Hey all,
I'm new to Drupal and have just developed a small site on Drupal 7. I am seeing very sluggish response times from my shared hosting provider, so I started playing around with cacheing options. Boost is by far the best option for this kind of site which sees very few hits at the moment. Response times went from 6+ seconds to 200ms, which is wonderful! The issue is that this obviously is only the case when cache files exist, and at the moment they only seem to get generated when a page is requested by a visitor. I know version 6 of Boost included a crawler to generate cache files, but it seems this component hasn't made it to the 7 dev release just yet? Or have I configured something incorrectly?
Anyone have any other ideas for a crawling module in the meantime? I'm not looking to do anything other than load every page on the site (less than 50 for now) every hour or so...
Comments
Comment #1
mikeytown2 commentedCrawler in 6.x needs a lot of love; thus it hasn't made it's way to 7.x
For a single threaded version that will likely timeout because your shared host kills long running scripts try the code below and have cron call it.
BTW this script will bypass the boost cache ;)
Comment #2
timfarley commentedCheer Mikeytown!
That seems to do the trick for now at least. First run took 183,000+ ms - and it didn't seem to timeout for me on Bluehost. We'll see if it works into the future, especially if the number of nodes increases drastically, but for now all looks good. Thanks again.
Comment #3
mikeytown2 commentednote to self:
->addTag('node_access')
Comment #4
NPC commentedSubscribing.
Comment #5
mikeytown2 commented2 more notes
http://www.leveltendesign.com/blog/randall-knutson/cron-queues-processin...
http://drupal.org/node/1138098#comment-4566670
Comment #6
emmeade commentedI created a module and used hook_cron with the code in #1. It looks like the pages are crawled but they are not added to the boost cache. Is there a different way to set this up with cron?
Comment #7
johnlutzI am also trying to get this to work but I can't figure out how. I got the code working from above like emmeade to get files cached in the cache directory, but when you go the pages initially it isn't pulling up the cached version.
Comment #8
jamescook commentedWhy not just use a cron call to wget?
e.g.
wget -r -nd --delete-after http://example.comComment #9
mikeytown2 commentedHTTPRL is the way forward with this. If/When I get around to making a boost submodule to do crawling it will require HTTPRL.
Comment #10
voodootea commentedthis is great (although it did cause my VPS to die from too much memory use), however the script tries to access page using the server's IP address rather than the domain name - this results in the following folder in my boost cache directory
/cache/normal/xxx.xxx.xxx.xxx/
rather than
/cache/normal/domain.com/
which is what most visitors would use to access the site...
Any idea's how to modify the script to make it crawl as a 'normal' user?
Many thanks
Comment #11
x3cion commentedTry to replace $_SERVER['HTTP_HOST'] with the real address. ("example.com" for example.)
Didn't test it.
Comment #12
ofir commentedI solved this using a script and cron. It requires the XML Sitemap module. I found this script: http://beajar.blogspot.co.uk/2011/05/quick-unix-shell-script-to-crawl-xm... and modified it to my requirements.
On my shared hosting it takes 3000 msec build a cached page and 500 msec to fetch a cached page. I run this once an hour.
Comment #13
bgm commentedI wrote a small module "boost_crawler", included as a submodule of boost, which can crawl pages (using httprl) that have been flushed by the 'expire' module.
* the expire hook queues the URLs for crawler (based on comment #5)
* the cron queue runs after the hook_cron() calls, which means that Boost cache clear also happens before we start crawling.
* for now, httprl calls are not async. I kind of like the idea of being to limit this way how much crawling we do on crawl calls. (a hook defines to the cron-queue how much time we want to spend crawling, and the queue takes care of keeping an eye on that).
It's nowhere near what the crawler did in 6.x-1.x, but it's a start. The main use case is for editors who update content: the cache gets deleted, and we should immediately regenerate the cache.
TODO: the boost block should show whether the page is queued for crawling, and how many items are in the queue.
Comment #14
volker23 commentedI tried the dev, but with no effort. Boost stopped working completely. Went back to beta-1, and everything worked again. Bummers.
Comment #15
bgm commentedHmm, any chance you could provide more information? Anything in the watchdog? Apache error log?
Comment #16
jamix commentedA quick test over here worked fine. On node change, Expire expired the node page and Boost removed the cached copy of it. On a subsequent cron run, the cached copy was re-created by the crawler.
@Volker23: Make sure you use the latest dev of Expire. There's a bug in alpha3 that makes it create invalid expire URLs: #1471926: Invalid expire URLs when "Include base URL in expires" is enabled.
Comment #17
yannisc commentedIt worked for me also with the dev version of expire.
I suggest there's an entry about the crawler in the documentation.
Comment #18
bgm commentedThanks, marking this issue as fixed.
Created a short handbook page here: http://drupal.org/node/1736568
Please review/improve.
Thoughts on how to re-crawl the base site on a full cache flush? (I think 6.x-1.x used the Drupal 'menu' entries.. I also saw someone mentioned xmlsitemap once, which was an interesting idea.. although minimally we should systematically re-crawl the frontpage) -- but that's for a new issue ;)
Comment #20
Mario Baron commentedFollowing all available instructions and using dev version of expire module I still can't get expired pages to regenerate on cron.
Boost deletes the expired pages but they are never regenerated until anonymous visitor first hits them.
I've also looked at the https://drupal.org/node/1785292 thread and now I am seriously confused.
Does boost crawler only generate new/edited pages or all expired pages?
Boost is an awesome module. I'd say its gotta be somewhere very high in my top 10 Drupal modules and performance gain for anonymous users is incredible but we really need to boost crawler to regenerate expired pages on cron.
Comment #21
Anonymous (not verified) commentedThe crawler is misnamed which is explained here (any idea who edited the front page of the project as it should point to the explanation not this thread)?
#1785292-16: Cron Crawler Not Running
if you want to have a never expiring cache, there is code in that thread that can help along with instructions for wget and cron, although if your pages have got to the point where they expired then no-one is visiting them or editing them so chances are that a search engine spider would regenerate them regularly.
Comment #22
vacilando commented@Mario Baron -- check out this recipe for Boost & Crawler I put together recently. It works perfectly for me.