Crawler stopped on site with 1M pages queued to be cached. (6.x-dev 2009-09-27) [#594774]

Comment	File	Size	Author
#30	boost-594774.patch	1.67 KB	mikeytown2
#16	boost-594774.patch	791 bytes	mikeytown2

Comment #1

mikeytown2 commented 3 October 2009 at 06:25

Title:

Has the crawler stopped working after initial run?

» Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)

What are your crawler settings and any relevant info in that section?

Log in or register to post comments

Comment #2

Froggie-2 commented 3 October 2009 at 06:40

Thanks Mikey, for your quick response. The crawler is enabled in the Boost Settings page and I've not modified anything since setting up Boost 3 days back. However, I did add a patch to remove the PHP Safe Mode set_time_limit function error message but that does not seem to be the cause.

Edit: The Admin recent log entries show that Boost crawler thread one started, thread two started etc but no hits show up in the Top Visitors Page

Log in or register to post comments

Comment #3

mikeytown2 commented 3 October 2009 at 06:41

Crawler Throttle: ?
Crawler Batch Size: ?
Number Of Threads: ?

Boost crawler - Live info ?

Log in or register to post comments

Comment #4

Froggie-2 commented 3 October 2009 at 06:51

Crawler Throttle: 0 (default)
Crawler Batch Size: 15 (default)
Number of Threads: 2 (default)

Boost Crawler Live Info: None displayed at present. Initially it showed up during crawl.Not anymore.

Log in or register to post comments

Comment #5

mikeytown2 commented 3 October 2009 at 06:52

Is there any expired content? (bottom of Boost File Cache section, right button has count)

Log in or register to post comments

Comment #6

Froggie-2 commented 3 October 2009 at 06:56

it says: Clear Boost Expired Data: 0 pages

Log in or register to post comments

Comment #7

mikeytown2 commented 3 October 2009 at 06:58

Nothing is expired, thus nothing needs to be crawled most likely. Does the # in the left button match the # in the bottom button?

Log in or register to post comments

Comment #8

Froggie-2 commented 3 October 2009 at 07:02

Yes, the # in the left button matches the # in the bottom button.

But the site has tens of thousands of nodes. Why is it not being crawled by the crawler?

Log in or register to post comments

Comment #9

mikeytown2 commented 3 October 2009 at 07:04

Crawler only hits whats inside the boost_cache table. Your looking for this feature #363077: Add spider to crawler - Cache entire site with new install..
To get around this enable the "Crawl All URL's in the url_alias table. " setting.

Log in or register to post comments

Comment #10

Froggie-2 commented 3 October 2009 at 07:08

Yes, "Crawl All URL's in the url_alias table. " setting has been enabled since the very beginning during set up but crawler stopped after crawling about 7000 nodes or so. The site has about a million nodes.

Log in or register to post comments

Comment #11

mikeytown2 commented 3 October 2009 at 07:16

Interesting... here's the code

/**
 * Get URLs from url alias table
 */
function boost_crawler_add_alias_to_table() {
  // Insert batch of html URL's into boost_crawler table
  global $base_url;
  if (!variable_get('boost_crawl_url_alias', FALSE)) {
    return TRUE;
  }
  $count = 1000;
  $total = db_query("SELECT COUNT(*) FROM {url_alias}");
  $loaded = variable_get('boost_crawler_loaded_count_alias', 0);
  if ($total > $loaded) {
    $list = db_query_range("SELECT dst FROM {url_alias}", $loaded, $count);
    while ($url = db_result($list)) {
      @db_query("INSERT INTO {boost_crawler} (url) VALUES ('%s')", $base_url . '/' . $url);
    }
    variable_set('boost_crawler_loaded_count_alias', $loaded + $count);
    return FALSE;
  }
  else {
    return TRUE;
  }
}

This grabs 1,000 nodes at a time and inserts into the DB 1 at a time, because it has to add the base_url to the beginning of it. This keeps getting called until it returns true.

just for kicks whats the output of this

  echo db_query("SELECT COUNT(*) FROM {url_alias}");

Log in or register to post comments

Comment #12

Froggie-2 commented 3 October 2009 at 07:19

Where do I put this to get the output info?

Log in or register to post comments

Comment #13

mikeytown2 commented 3 October 2009 at 07:23

you can create a php file test.php

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_query("SELECT COUNT(*) FROM {url_alias}");

Log in or register to post comments

Comment #14

Froggie-2 commented 3 October 2009 at 07:32

On my browser I am getting a blank page while in the admin logs I am getting this error message:

" Object of class mysqli_result could not be converted to string in /var/www/vhosts/mywebsite.com/httpdocs/test_boost_crawler.php on line 5."

Log in or register to post comments

Comment #15

mikeytown2 commented 3 October 2009 at 07:33

bingo - u found a bug!
try this

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_result(db_query("SELECT COUNT(*) FROM {url_alias}"));

Log in or register to post comments

Comment #16

mikeytown2 commented 3 October 2009 at 07:37

Title:	Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)	» boost_crawler_add_alias_to_table() doesn't do db_result on the total query
Category:	support	» bug
Status:	Active	» Needs review

Status	File	Size
new	boost-594774.patch	791 bytes

Log in or register to post comments

Comment #17

Froggie-2 commented 3 October 2009 at 07:45

Title:	boost_crawler_add_alias_to_table() doesn't do db_result on the total query	» Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)
Category:	bug	» support
Status:	Needs review	» Active

I see this number in my browser: 956821 with #15 above.

Log in or register to post comments

Comment #18

Froggie-2 commented 3 October 2009 at 08:04

I have added the patch boost-594774.patch. Will wait for sometime before confirming how it works.
Thanks again to you Mikey for your quick response, time & solution.
Best Regards

Log in or register to post comments

Comment #19

mikeytown2 commented 3 October 2009 at 08:05

Title:	Has the crawler stopped working after initial run? (6.x-dev 2009-09-27)	» boost_crawler_add_alias_to_table() doesn't do db_result on the total query
Category:	support	» bug
Status:	Active	» Needs review

Log in or register to post comments

Comment #20

Froggie-2 commented 3 October 2009 at 08:54

Still no response from the crawler. No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.

The recent log entries page show:
Crawler Start
Crawler Sleep for 15 seconds
Crawler Sleep for 15 seconds

Its almost an hour now after cron has run.

Boost Crawler Live Info: None displayed at present.

Log in or register to post comments

Comment #21

mikeytown2 commented 3 October 2009 at 08:58

it takes a long time to add in 956k urls. What does this output; separated by a 1 min gap...(run it twice) make sure the number is increasing.

  require_once './includes/bootstrap.inc';
  drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  echo db_result(db_query("SELECT COUNT(*) FROM {boost_crawler}"));

Log in or register to post comments

Comment #22

Froggie-2 commented 3 October 2009 at 09:10

Yes, thanks a lot! The number is increasing per minute. Now, it is 6,80,000.
I was kinda looking at the Admin Top Visitors Page and expecting to see the Server IP Address and the hits increasing. Didn't realize that first the crawler has to load the entire urls and then start crawling. Sorry Bro! My mistake, for sure.
Thanks again

Log in or register to post comments

Comment #23

Froggie-2 commented 3 October 2009 at 09:55

Current Situation: Last 15 minutes
Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls.
Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started

Boost Crawler Life Info is showing up

However, No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.

Log in or register to post comments

Comment #24

Froggie-2 commented 3 October 2009 at 10:26

Current Situation: Last 1 hour (approx)

Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls, one hour or so ago.

Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started

Boost Crawler Life Info has vanished

No page hits or server IP address noticed in the Admin > Top Visitors Page.

Pages from the site are not getting crawled by the crawler.

Pages visited by anonymous users and search engines are only getting cached at the moment.

Log in or register to post comments

Comment #25

Froggie-2 commented 3 October 2009 at 14:17

Same result after approx. 3 hours.

No page hits or server IP address noticed in the Admin > Top Visitors Page.

Pages from the site are not getting crawled by the crawler.

Log in or register to post comments

Comment #26

mikeytown2 commented 3 October 2009 at 16:50

I wonder why the crawler stopped (live info disappearing). Reason it wasn't hitting the logs at first is because the first part of the site is already in the boost cache.

Log in or register to post comments

Comment #27

Froggie-2 commented 4 October 2009 at 03:39

I too am wondering as to why the crawler has stopped. It is more than 12 hours now, but not a single page hit by the crawler. As I said earlier, there are heaps of nodes yet to be crawled.

Log in or register to post comments

Comment #28

mikeytown2 commented 4 October 2009 at 07:22

Status:

Needs review

» Active

committed this, now wondering why the crawler dies.

Log in or register to post comments

Comment #29

mikeytown2 commented 4 October 2009 at 07:30

Title:

boost_crawler_add_alias_to_table() doesn't do db_result on the total query

» Crawler stopped on site with 1M pages queued to be cached. (6.x-dev 2009-09-27)

Log in or register to post comments

Comment #30

mikeytown2 commented 4 October 2009 at 07:35

Status	File	Size
new	boost-594774.patch	1.67 KB

I remember your site taking upwards of 5 min to generate 1 url. I think php is timing out in your case. Try this

Log in or register to post comments

Comment #31

mikeytown2 commented 4 October 2009 at 07:36

and/or set
Crawler Batch Size: 3

Log in or register to post comments

Comment #32

mikeytown2 commented 4 October 2009 at 07:40

Category:

bug

» support

Log in or register to post comments

Comment #33

Froggie-2 commented 4 October 2009 at 13:04

Sorry for being late with this comment.
Since last three days the page generation time on the site has reduced from 5 minutes per page to about 10 to 20 seconds (max) per page after I removed the Similar Entries module. Even though the Similar Entries module provides highly relevant results it takes a lot of time to generate each page.

Since this morning, I had run an external crawler (httrack) and the response time per page load was quite satisfactory (max being 20 seconds) and minimum 6 seconds on fresh uncached pages. Even through browser the page load response is now much faster than before.
With Boost most cached pages are delivered even faster.

Latest Situation:

I tried by setting crawler batch size from 15 to 3 without any success. Still there is no response from the crawler. The crawler live info is occassionally visible on the Boost Page but the crawler is not crawling the pages.

Now I shall use the patch stated in #30 (boost 594774.patch), wait for sometime and then report back.
Thanks Mikey for your time and efforts..
As always best regards

Log in or register to post comments

Comment #34

Froggie-2 commented 4 October 2009 at 13:01

Latest Info after applying patch stated in #30 above (boost 594774.patch) and cron run:

The Crawler Live Info is visible occassionally after cron runs and vanishes after a while.
The Boost Crawler Live Info section shows: 957237 URL's left.
The crawler is still not crawling the pages.
------------------------------------------------------------------------------

Could it be that the crawler is encountering some error on some rouge url and stopping instead of skipping the rouge url to the next url in the database table? Just a wild guess.

Log in or register to post comments

Comment #35

Froggie-2 commented 5 October 2009 at 11:05

Hi Mikey, I just found this error message on my server error logs. This could be the reason why the crawler is not crawling the site.
Error Message: PHP Fatal error: Call to undefined function _boost_set_time_limit() in /var/www/vhosts/mywebsite.com/httpdocs/modules/boost/boost.module on line 2510

Log in or register to post comments

Comment #36

Froggie-2 commented 5 October 2009 at 11:13

Probably, I made a mistake in adding the first few lines of the patch (boost-590126.patch). I shall rectify it now and report back as soon as possible. Sorry for all the trouble.
Thanks again!

Log in or register to post comments

Comment #37

Froggie-2 commented 5 October 2009 at 13:13

An error on my part while adding the patch (boost-590126.patch for avoiding the set_time_limit message in PHP Safe Mode) caused Boost crawler to stop crawling. After rectification, the crawler has started to work again. My apologies to Mikey for causing this inconvenience.
Thanks again!

Log in or register to post comments

Comment #38

mikeytown2 commented 5 October 2009 at 20:37

does that mean this issue is "fixed"?

Log in or register to post comments

Comment #39

Froggie-2 commented 6 October 2009 at 03:28

Status:

Active

» Fixed

Yes, this issue is fixed. Marking it as fixed. Thanks!

Log in or register to post comments

Comment #40

20 October 2009 at 03:30

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Crawler stopped on site with 1M pages queued to be cached. (6.x-dev 2009-09-27)

Comments