Boost 7.x: crawler to regenerate the cache automatically

The Boost cache is generated when an anonymous user first visits a web page. The cache is then cleared when new content is published (if you have enabled the expire module), or when the cron is run.

However, depending on the type of traffic your site receives, you may want pages to be automatically re-cached when their cache expires. For example, a complex page could take over 10 seconds to render, causing users to occasionally stumble on it and wonder why it is taking so much time (or affect your Google ranking because the crawler finds that your website is not efficient).

This is where the Boost crawler can be helpful: it will automatically regenerate the cache for expired pages.

Installation

Requirements:

httprl module
expire module (> 7.x-1.0-alpha3, as of August 2012, this means the "dev" snapshot)

To enable the crawler:

go to your modules page and enable the "Boost Crawler" module (boost_crawler).
go to Admin > Configure > System > Boost > Crawler, tick the checkbox to enable the crawler to run on cron.

How it works

Updated content may cause the Expire module to send a hook call to boost_crawler to inform it that a list of pages were flushed from the cache. The crawler adds this list of pages to a queue of tasks (Drupal Queue API) which are run on cron in batches of maximum 30 seconds.

Historic notes

Since Boost 7.x-1.x, the crawler has been moved to a new submodule, boost_crawler. It was added to the 7.x-1.x branch in version 7.x-1.0-beta2. The crawler in 6.x-1.x had become complex and hard to debug. For now, we aim to have a minimal crawler that is "good enough" for most websites. As of 7.x-1.0-beta2, the boost_crawler has around 100 lines of code.

Important issues from the issue queue: