Hi Mikey, I just wanted to report that it appears the crawler has stopped crawling pages after its initial run.
The Boost Settings Page still indicates that Crawler is active and so does the Admin Recent Log Entries section. However, the Top Visitors Page in the Admin Reports section does not show up the server IP address as a Visitor. This is happening since the last three days. On initial run after installation of Boost (dev version dated 27th September 2009) and cron run, the Top Visitors Page showed the Server IP address with lots of hits which gradually disappeared during the last 3 days, now with no hits or server IP showing up. I am still using the Boost Dev version (dated 27th September 2009).
Thanks
| Comment | File | Size | Author |
|---|---|---|---|
| #30 | boost-594774.patch | 1.67 KB | mikeytown2 |
| #16 | boost-594774.patch | 791 bytes | mikeytown2 |
Comments
Comment #1
mikeytown2 commentedWhat are your crawler settings and any relevant info in that section?
Comment #2
Froggie-2 commentedThanks Mikey, for your quick response. The crawler is enabled in the Boost Settings page and I've not modified anything since setting up Boost 3 days back. However, I did add a patch to remove the PHP Safe Mode set_time_limit function error message but that does not seem to be the cause.
Edit: The Admin recent log entries show that Boost crawler thread one started, thread two started etc but no hits show up in the Top Visitors Page
Comment #3
mikeytown2 commentedCrawler Throttle: ?
Crawler Batch Size: ?
Number Of Threads: ?
Boost crawler - Live info ?
Comment #4
Froggie-2 commentedCrawler Throttle: 0 (default)
Crawler Batch Size: 15 (default)
Number of Threads: 2 (default)
Boost Crawler Live Info: None displayed at present. Initially it showed up during crawl.Not anymore.
Comment #5
mikeytown2 commentedIs there any expired content? (bottom of Boost File Cache section, right button has count)
Comment #6
Froggie-2 commentedit says: Clear Boost Expired Data: 0 pages
Comment #7
mikeytown2 commentedNothing is expired, thus nothing needs to be crawled most likely. Does the # in the left button match the # in the bottom button?
Comment #8
Froggie-2 commentedYes, the # in the left button matches the # in the bottom button.
But the site has tens of thousands of nodes. Why is it not being crawled by the crawler?
Comment #9
mikeytown2 commentedCrawler only hits whats inside the boost_cache table. Your looking for this feature #363077: Add spider to crawler - Cache entire site with new install..
To get around this enable the "Crawl All URL's in the url_alias table. " setting.
Comment #10
Froggie-2 commentedYes, "Crawl All URL's in the url_alias table. " setting has been enabled since the very beginning during set up but crawler stopped after crawling about 7000 nodes or so. The site has about a million nodes.
Comment #11
mikeytown2 commentedInteresting... here's the code
This grabs 1,000 nodes at a time and inserts into the DB 1 at a time, because it has to add the base_url to the beginning of it. This keeps getting called until it returns true.
just for kicks whats the output of this
Comment #12
Froggie-2 commentedWhere do I put this to get the output info?
Comment #13
mikeytown2 commentedyou can create a php file test.php
Comment #14
Froggie-2 commentedOn my browser I am getting a blank page while in the admin logs I am getting this error message:
" Object of class mysqli_result could not be converted to string in /var/www/vhosts/mywebsite.com/httpdocs/test_boost_crawler.php on line 5."
Comment #15
mikeytown2 commentedbingo - u found a bug!
try this
Comment #16
mikeytown2 commentedComment #17
Froggie-2 commentedI see this number in my browser: 956821 with #15 above.
Comment #18
Froggie-2 commentedI have added the patch boost-594774.patch. Will wait for sometime before confirming how it works.
Thanks again to you Mikey for your quick response, time & solution.
Best Regards
Comment #19
mikeytown2 commentedComment #20
Froggie-2 commentedStill no response from the crawler. No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.
The recent log entries page show:
Crawler Start
Crawler Sleep for 15 seconds
Crawler Sleep for 15 seconds
Its almost an hour now after cron has run.
Boost Crawler Live Info: None displayed at present.
Comment #21
mikeytown2 commentedit takes a long time to add in 956k urls. What does this output; separated by a 1 min gap...(run it twice) make sure the number is increasing.
Comment #22
Froggie-2 commentedYes, thanks a lot! The number is increasing per minute. Now, it is 6,80,000.
I was kinda looking at the Admin Top Visitors Page and expecting to see the Server IP Address and the hits increasing. Didn't realize that first the crawler has to load the entire urls and then start crawling. Sorry Bro! My mistake, for sure.
Thanks again
Comment #23
Froggie-2 commentedCurrent Situation: Last 15 minutes
Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls.
Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started
Boost Crawler Life Info is showing up
However, No page hits or server IP address are getting recorded in the Admin > Top Visitors Page.
Comment #24
Froggie-2 commentedCurrent Situation: Last 1 hour (approx)
Boost Crawler table (boost_crawler) has loaded up fully with 956821 urls, one hour or so ago.
Recent Log Entries Page shows:
Crawler- Thread 1 of 2 started
Crawler- Thread 2 of 2 started
Boost Crawler Life Info has vanished
No page hits or server IP address noticed in the Admin > Top Visitors Page.
Pages from the site are not getting crawled by the crawler.
Pages visited by anonymous users and search engines are only getting cached at the moment.
Comment #25
Froggie-2 commentedSame result after approx. 3 hours.
No page hits or server IP address noticed in the Admin > Top Visitors Page.
Pages from the site are not getting crawled by the crawler.
Comment #26
mikeytown2 commentedI wonder why the crawler stopped (live info disappearing). Reason it wasn't hitting the logs at first is because the first part of the site is already in the boost cache.
Comment #27
Froggie-2 commentedI too am wondering as to why the crawler has stopped. It is more than 12 hours now, but not a single page hit by the crawler. As I said earlier, there are heaps of nodes yet to be crawled.
Comment #28
mikeytown2 commentedcommitted this, now wondering why the crawler dies.
Comment #29
mikeytown2 commentedComment #30
mikeytown2 commentedI remember your site taking upwards of 5 min to generate 1 url. I think php is timing out in your case. Try this
Comment #31
mikeytown2 commentedand/or set
Crawler Batch Size: 3
Comment #32
mikeytown2 commentedComment #33
Froggie-2 commentedSorry for being late with this comment.
Since last three days the page generation time on the site has reduced from 5 minutes per page to about 10 to 20 seconds (max) per page after I removed the Similar Entries module. Even though the Similar Entries module provides highly relevant results it takes a lot of time to generate each page.
Since this morning, I had run an external crawler (httrack) and the response time per page load was quite satisfactory (max being 20 seconds) and minimum 6 seconds on fresh uncached pages. Even through browser the page load response is now much faster than before.
With Boost most cached pages are delivered even faster.
Latest Situation:
I tried by setting crawler batch size from 15 to 3 without any success. Still there is no response from the crawler. The crawler live info is occassionally visible on the Boost Page but the crawler is not crawling the pages.
Now I shall use the patch stated in #30 (boost 594774.patch), wait for sometime and then report back.
Thanks Mikey for your time and efforts..
As always best regards
Comment #34
Froggie-2 commentedLatest Info after applying patch stated in #30 above (boost 594774.patch) and cron run:
The Crawler Live Info is visible occassionally after cron runs and vanishes after a while.
The Boost Crawler Live Info section shows: 957237 URL's left.
The crawler is still not crawling the pages.
------------------------------------------------------------------------------
Could it be that the crawler is encountering some error on some rouge url and stopping instead of skipping the rouge url to the next url in the database table? Just a wild guess.
Comment #35
Froggie-2 commentedHi Mikey, I just found this error message on my server error logs. This could be the reason why the crawler is not crawling the site.
Error Message: PHP Fatal error: Call to undefined function _boost_set_time_limit() in /var/www/vhosts/mywebsite.com/httpdocs/modules/boost/boost.module on line 2510
Comment #36
Froggie-2 commentedProbably, I made a mistake in adding the first few lines of the patch (boost-590126.patch). I shall rectify it now and report back as soon as possible. Sorry for all the trouble.
Thanks again!
Comment #37
Froggie-2 commentedAn error on my part while adding the patch (boost-590126.patch for avoiding the set_time_limit message in PHP Safe Mode) caused Boost crawler to stop crawling. After rectification, the crawler has started to work again. My apologies to Mikey for causing this inconvenience.
Thanks again!
Comment #38
mikeytown2 commenteddoes that mean this issue is "fixed"?
Comment #39
Froggie-2 commentedYes, this issue is fixed. Marking it as fixed. Thanks!