Hi Mikey, I just wanted to report that it appears the crawler has stopped crawling pages after its initial run.
The Boost Settings Page still indicates that Crawler is active and so does the Admin Recent Log Entries section. However, the Top Visitors Page in the Admin Reports section does not show up the server IP address as a Visitor. This is happening since the last three days. On initial run after installation of Boost (dev version dated 27th September 2009) and cron run, the Top Visitors Page showed the Server IP address with lots of hits which gradually disappeared during the last 3 days, now with no hits or server IP showing up. I am still using the Boost Dev version (dated 27th September 2009).
Thanks

CommentFileSizeAuthor
#30 boost-594774.patch1.67 KBmikeytown2
#16 boost-594774.patch791 bytesmikeytown2

Comments

mikeytown2’s picture

Title: Has the crawler stopped working after initial run? » Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)

What are your crawler settings and any relevant info in that section?

Froggie-2’s picture

Thanks Mikey, for your quick response. The crawler is enabled in the Boost Settings page and I've not modified anything since setting up Boost 3 days back. However, I did add a patch to remove the PHP Safe Mode set_time_limit function error message but that does not seem to be the cause.

Edit: The Admin recent log entries show that Boost crawler thread one started, thread two started etc but no hits show up in the Top Visitors Page

mikeytown2’s picture

Crawler Throttle: ?
Crawler Batch Size: ?
Number Of Threads: ?

Boost crawler - Live info ?

Froggie-2’s picture

Crawler Throttle: 0 (default)
Crawler Batch Size: 15 (default)
Number of Threads: 2 (default)

Boost Crawler Live Info: None displayed at present. Initially it showed up during crawl.Not anymore.

mikeytown2’s picture

Is there any expired content? (bottom of Boost File Cache section, right button has count)

Froggie-2’s picture

it says: Clear Boost Expired Data: 0 pages

mikeytown2’s picture

Nothing is expired, thus nothing needs to be crawled most likely. Does the # in the left button match the # in the bottom button?

Froggie-2’s picture

Yes, the # in the left button matches the # in the bottom button.

But the site has tens of thousands of nodes. Why is it not being crawled by the crawler?

mikeytown2’s picture

Crawler only hits whats inside the boost_cache table. Your looking for this feature #363077: Add spider to crawler - Cache entire site with new install..
To get around this enable the "Crawl All URL's in the url_alias table. " setting.

Froggie-2’s picture

Yes, "Crawl All URL's in the url_alias table. " setting has been enabled since the very beginning during set up but crawler stopped after crawling about 7000 nodes or so. The site has about a million nodes.

mikeytown2’s picture

Interesting... here's the code

/**
 * Get URLs from url alias table
 */
function boost_crawler_add_alias_to_table() {
  // Insert batch of html URL's into boost_crawler table
  global $base_url;
  if (!variable_get('boost_crawl_url_alias', FALSE)) {
    return TRUE;
  }
  $count = 1000;
  $total = db_query("SELECT COUNT(*) FROM {url_alias}");
  $loaded = variable_get('boost_crawler_loaded_count_alias', 0);
  if ($total > $loaded) {
    $list = db_query_range("SELECT dst FROM {url_alias}", $loaded, $count);
    while ($url = db_result($list)) {
      @db_query("INSERT INTO {boost_crawler} (url) VALUES ('%s')", $base_url . '/' . $url);
    }
    variable_set('boost_crawler_loaded_count_alias', $loaded + $count);
    return FALSE;
  }
  else {
    return TRUE;
  }
}

This grabs 1,000 nodes at a time and inserts into the DB 1 at a time, because it has to add the base_url to the beginning of it. This keeps getting called until it returns true.

just for kicks whats the output of this

  echo db_query("SELECT COUNT(*) FROM {url_alias}");
Froggie-2’s picture

Where do I put this to get the output info?

mikeytown2’s picture

you can create a php file test.php

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_query("SELECT COUNT(*) FROM {url_alias}");
Froggie-2’s picture

On my browser I am getting a blank page while in the admin logs I am getting this error message:

" Object of class mysqli_result could not be converted to string in /var/www/vhosts/mywebsite.com/httpdocs/test_boost_crawler.php on line 5."

mikeytown2’s picture

bingo - u found a bug!
try this

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_result(db_query("SELECT COUNT(*) FROM {url_alias}"));
mikeytown2’s picture

Title: Has the crawler stopped working after initial run? (6.x-dev 2009-09-27) » boost_crawler_add_alias_to_table() doesn't do db_result on the total query
Category: support » bug
Status: Active » Needs review
StatusFileSize
new791 bytes
Froggie-2’s picture

Title: boost_crawler_add_alias_to_table() doesn't do db_result on the total query » Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)
Category: bug » support
Status: Needs review » Active

I see this number in my browser: 956821 with #15 above.

Froggie-2’s picture

I have added the patch boost-594774.patch. Will wait for sometime before confirming how it works.
Thanks again to you Mikey for your quick response, time & solution.
Best Regards

mikeytown2’s picture

Title: Has the crawler stopped working after initial run? (6.x-dev 2009-09-27) » boost_crawler_add_alias_to_table() doesn't do db_result on the total query
Category: support » bug
Status: Active » Needs review
Froggie-2’s picture

Still no response from the crawler. No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.

The recent log entries page show:
Crawler Start
Crawler Sleep for 15 seconds
Crawler Sleep for 15 seconds

Its almost an hour now after cron has run.

Boost Crawler Live Info: None displayed at present.

mikeytown2’s picture

it takes a long time to add in 956k urls. What does this output; separated by a 1 min gap...(run it twice) make sure the number is increasing.

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_result(db_query("SELECT COUNT(*) FROM {boost_crawler}"));
Froggie-2’s picture

Yes, thanks a lot! The number is increasing per minute. Now, it is 6,80,000.
I was kinda looking at the Admin Top Visitors Page and expecting to see the Server IP Address and the hits increasing. Didn't realize that first the crawler has to load the entire urls and then start crawling. Sorry Bro! My mistake, for sure.
Thanks again

Froggie-2’s picture

Current Situation: Last 15 minutes
Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls.
Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started

Boost Crawler Life Info is showing up

However, No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.

Froggie-2’s picture

Current Situation: Last 1 hour (approx)

Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls, one hour or so ago.

Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started

Boost Crawler Life Info has vanished

No page hits or server IP address noticed in the Admin > Top Visitors Page.

Pages from the site are not getting crawled by the crawler.

Pages visited by anonymous users and search engines are only getting cached at the moment.

Froggie-2’s picture

Same result after approx. 3 hours.

No page hits or server IP address noticed in the Admin > Top Visitors Page.

Pages from the site are not getting crawled by the crawler.

mikeytown2’s picture

I wonder why the crawler stopped (live info disappearing). Reason it wasn't hitting the logs at first is because the first part of the site is already in the boost cache.

Froggie-2’s picture

I too am wondering as to why the crawler has stopped. It is more than 12 hours now, but not a single page hit by the crawler. As I said earlier, there are heaps of nodes yet to be crawled.

mikeytown2’s picture

Status: Needs review » Active

committed this, now wondering why the crawler dies.

mikeytown2’s picture

Title: boost_crawler_add_alias_to_table() doesn't do db_result on the total query » Crawler stopped on site with 1M pages queued to be cached. (6.x-dev 2009-09-27)
mikeytown2’s picture

StatusFileSize
new1.67 KB

I remember your site taking upwards of 5 min to generate 1 url. I think php is timing out in your case. Try this

mikeytown2’s picture

and/or set
Crawler Batch Size: 3

mikeytown2’s picture

Category: bug » support
Froggie-2’s picture

Sorry for being late with this comment.
Since last three days the page generation time on the site has reduced from 5 minutes per page to about 10 to 20 seconds (max) per page after I removed the Similar Entries module. Even though the Similar Entries module provides highly relevant results it takes a lot of time to generate each page.

Since this morning, I had run an external crawler (httrack) and the response time per page load was quite satisfactory (max being 20 seconds) and minimum 6 seconds on fresh uncached pages. Even through browser the page load response is now much faster than before.
With Boost most cached pages are delivered even faster.

Latest Situation:

I tried by setting crawler batch size from 15 to 3 without any success. Still there is no response from the crawler. The crawler live info is occassionally visible on the Boost Page but the crawler is not crawling the pages.

Now I shall use the patch stated in #30 (boost 594774.patch), wait for sometime and then report back.
Thanks Mikey for your time and efforts..
As always best regards

Froggie-2’s picture

Latest Info after applying patch stated in #30 above (boost 594774.patch) and cron run:

The Crawler Live Info is visible occassionally after cron runs and vanishes after a while.
The Boost Crawler Live Info section shows: 957237 URL's left.
The crawler is still not crawling the pages.
------------------------------------------------------------------------------

Could it be that the crawler is encountering some error on some rouge url and stopping instead of skipping the rouge url to the next url in the database table? Just a wild guess.

Froggie-2’s picture

Hi Mikey, I just found this error message on my server error logs. This could be the reason why the crawler is not crawling the site.
Error Message: PHP Fatal error: Call to undefined function _boost_set_time_limit() in /var/www/vhosts/mywebsite.com/httpdocs/modules/boost/boost.module on line 2510

Froggie-2’s picture

Probably, I made a mistake in adding the first few lines of the patch (boost-590126.patch). I shall rectify it now and report back as soon as possible. Sorry for all the trouble.
Thanks again!

Froggie-2’s picture

An error on my part while adding the patch (boost-590126.patch for avoiding the set_time_limit message in PHP Safe Mode) caused Boost crawler to stop crawling. After rectification, the crawler has started to work again. My apologies to Mikey for causing this inconvenience.
Thanks again!

mikeytown2’s picture

does that mean this issue is "fixed"?

Froggie-2’s picture

Status: Active » Fixed

Yes, this issue is fixed. Marking it as fixed. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.