Is the crawler falling behind on recrawling expired content?

dankohn - December 28, 2009 - 06:57
Project:Boost
Version:6.x-1.17
Component:Expiration logic
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Description

Hi, over a 24 hour period, my site is building up a 1.7 GB cache. This will be increasing significantly, I believe, when Google does a deep re-index next month and finds that our therapist directory is larger than it thinks.

My main question is over having 14,208 pages in the cache and 8566 pages expired. I'm particularly curious about the setting "Do not flush expired content on cron run, instead recrawl and overwrite it." Is this appropriate for my site?

Every cron run, I get a "Crawler already running" message. Does this mean that the crawler is running far behind where it needs to be, and will never catch up? If you could offer any suggestions on settings, I'd greatly appreciate it. I'm attaching a PDF of my current settings.

Boost is an amazing module, but the number of knobs to tweak can be intimidating.

AttachmentSize
boost.pdf267.6 KB

#1

mikeytown2 - December 29, 2009 - 23:37

It does appear that there is no way to recrawl your entire site in the 12 hour time frame given. Crawler would have to generate 4 pages per second in order to keep up. According to the stats on the boost configuration page, it appears to generate around 2 pages per second for your site. If this is the case then the crawler will be "stuck", always crawling because there's always expired content.

How often does your site change? If some pages never ever change, setting their expiration to a very high number (like 52 weeks) might be ideal. If you've set up different node types then this should be an easy thing to do. Enable the boost configuration block
http://drupal.org/node/545908#blocks
and set the scope to content type and the expiration time. This will prevent different parts of your site from expiring in 12 hours. Experiment with it sounds like for your site, you need it. The alt way of looking at this is to get a list of pages that change all the time; make the default 52 weeks and then set those pages to expire like every hour.

 
 

Drupal is a registered trademark of Dries Buytaert.