Project:Boost
Version:7.x-1.x-dev
Component:Cron Crawler
Category:feature request
Priority:minor
Assigned:Unassigned
Status:active

Issue Summary

Hey all,

I'm new to Drupal and have just developed a small site on Drupal 7. I am seeing very sluggish response times from my shared hosting provider, so I started playing around with cacheing options. Boost is by far the best option for this kind of site which sees very few hits at the moment. Response times went from 6+ seconds to 200ms, which is wonderful! The issue is that this obviously is only the case when cache files exist, and at the moment they only seem to get generated when a page is requested by a visitor. I know version 6 of Boost included a crawler to generate cache files, but it seems this component hasn't made it to the 7 dev release just yet? Or have I configured something incorrectly?

Anyone have any other ideas for a crawling module in the meantime? I'm not looking to do anything other than load every page on the site (less than 50 for now) every hour or so...

Comments

#1

Crawler in 6.x needs a lot of love; thus it hasn't made it's way to 7.x

For a single threaded version that will likely timeout because your shared host kills long running scripts try the code below and have cron call it.

<?php
// Bootup Drupal.
define('DRUPAL_ROOT', getcwd());
require_once
DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

 
// Get all published nodes.
 
$nids = db_select('node', 'n')
    ->
fields('n', array('nid'))
    ->
condition('status', 1)
    ->
orderBy('n.created', 'DESC')
    ->
execute();

 
// Set request headers.
 
$options = array(
   
'headers' => array(
     
'Host' => $_SERVER['HTTP_HOST'],
     
'Cookie' => 'DRUPAL_UID=0',
    ),
  );
 
 
// Output number of urls we will hit.
 
echo count($nids)+1 . " urls to hit<br>\n";
  echo
"<br>\n";

 
// Hit Frontpage.
 
$url = 'http://' . $_SERVER['SERVER_ADDR'] . url('<front>', array('absolute' => FALSE));
 
$data = drupal_http_request($url, $options);
  echo
'1 - ' . $url . ' ...' . $data->code . "<br>\n";

 
// Hit every nodes URL.
 
foreach ($nids as $count => $nid) {
   
$url = 'http://' . $_SERVER['SERVER_ADDR'] . url('node/' . $nid->nid, array('absolute' => FALSE));

   
drupal_http_request($url, $options);
    echo
$count+2 . ' - ' . $url . ' ...' . $data->code . "<br>\n";
  }
 
 
// Output time.
 
echo "<br>\n";
  echo
"Total Time: " . timer_read('page') . "ms <br>\n";
?>

BTW this script will bypass the boost cache ;)

#2

Cheer Mikeytown!

That seems to do the trick for now at least. First run took 183,000+ ms - and it didn't seem to timeout for me on Bluehost. We'll see if it works into the future, especially if the number of nodes increases drastically, but for now all looks good. Thanks again.

#3

note to self:
->addTag('node_access')

#4

Subscribing.

#6

I created a module and used hook_cron with the code in #1. It looks like the pages are crawled but they are not added to the boost cache. Is there a different way to set this up with cron?

<?php
// Bootup Drupal.
define('DRUPAL_ROOT', getcwd());
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

function hslcrawl_cron(){
  // Get all published nodes.
  $nids = db_select('node', 'n')
    ->fields('n', array('nid'))
    ->condition('status', 1)
    ->orderBy('n.created', 'DESC')
    ->execute();

  // Set request headers.
  $options = array(
    'headers' => array(
      'Host' => $_SERVER['HTTP_HOST'],
      'Cookie' => 'DRUPAL_UID=0',
    ),
  );

  // Output number of urls we will hit.
  echo count($nids)+1 . " urls to hit<br>\n";
  echo "<br>\n";

  // Hit Frontpage.
  $url = 'http://' . $_SERVER['HTTP_HOST'] . url('<front>', array('absolute' => FALSE));
  $data = drupal_http_request($url, $options);
  echo '1 - ' . $url . ' ...' . $data->code . "<br>\n";

  // Hit every nodes URL.
  foreach ($nids as $count => $nid) {
    $url = 'http://' . $_SERVER['HTTP_HOST'] . url('node/' . $nid->nid, array('absolute' => FALSE));

    drupal_http_request($url, $options);
    echo $count+2 . ' - ' . $url . ' ...' . $data->code . "<br>\n";
    $pagecount = $count+2;
    $pagecode = $data->code;
    drupal_set_message(t(' @pagecount %url @pagecode', array('%url' => $url,'@pagecount' => $pagecount,'@pagecode' => $pagecode)));
    watchdog('hslcrawl',' @pagecount %url @pagecode', array('%url' => $url,'@pagecount' => $pagecount,'@pagecode' => $pagecode));
  }

  // Output time.
  echo "<br>\n";
  echo "Total Time: " . timer_read('page') . "ms <br>\n";
  $timer = timer_read('page');

  drupal_set_message(t('Crawler Total Time: @count', array('@count' => $timer)));
  watchdog('hslcrawl', 'Total Time: %timer', array('%timer' => timer_read('page')));
}
nobody click here