We have a music website where we import our artists' Twitter, Instagram and Facebook feeds using the Feeds module. This creates tens of thousands of nodes which all end up in the Boost Crawler queue. However, those nodes are only displayed via Views and don't have node pages of their own (thanks to Rabbit Hole).

It would be nice if we could exclude all such nodes from Boost crawling. Perhaps it makes sense to introduce a setting which would allow to include/exclude nodes based on the content type.

Comments

Anonymous’s picture

There is a setting on the front page of the boost configuration that enables the exclusion of files. Depends on the urls

The "crawler" is not really a spider, but generates a cached page in the queue on detection of changed content, you plug in a module with frequent updates and it's going to do that. Probably better to turn off the crawler and set low boost and cron times, and select the remove stale files from cache option on cron runs. Boost isn't really designed for the situation where you get a large amount of changing content from something like Facebook, the better option would probably be to iframe the facebook content so the page is statically correct, which the new content is up to date, or to limit the content to specific excluded pages.

jamix’s picture

Status: Active » Closed (works as designed)

Thanks. We ultimately solved this on the Expire level by implementing hook_expire_cache_alter(). The hook implementation skips cache expiration for the node pages that have been disabled through Rabbit Hole:

/**
 * Implements hook_expire_cache_alter().
 */
function mymodule_expire_cache_alter(&$expire, $node, $paths) {
  // Do not expire node pages that have been disabled with Rabbit Hole.
  if (module_exists('rabbit_hole') && rabbit_hole_get_action('node', $node) != RABBIT_HOLE_DISPLAY_CONTENT) {
    $expire = array();
  }
}