Cron crawler in 7.x? [#1074080]

Hey all,

I'm new to Drupal and have just developed a small site on Drupal 7. I am seeing very sluggish response times from my shared hosting provider, so I started playing around with cacheing options. Boost is by far the best option for this kind of site which sees very few hits at the moment. Response times went from 6+ seconds to 200ms, which is wonderful! The issue is that this obviously is only the case when cache files exist, and at the moment they only seem to get generated when a page is requested by a visitor. I know version 6 of Boost included a crawler to generate cache files, but it seems this component hasn't made it to the 7 dev release just yet? Or have I configured something incorrectly?

Anyone have any other ideas for a crawling module in the meantime? I'm not looking to do anything other than load every page on the site (less than 50 for now) every hour or so...

Comments

Comment #1

mikeytown2 commented 27 February 2011 at 07:34

Crawler in 6.x needs a lot of love; thus it hasn't made it's way to 7.x

For a single threaded version that will likely timeout because your shared host kills long running scripts try the code below and have cron call it.

// Bootup Drupal.
define('DRUPAL_ROOT', getcwd());
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

  // Get all published nodes.
  $nids = db_select('node', 'n')
    ->fields('n', array('nid'))
    ->condition('status', 1)
    ->orderBy('n.created', 'DESC')
    ->execute();

  // Set request headers.
  $options = array(
    'headers' => array(
      'Host' => $_SERVER['HTTP_HOST'],
      'Cookie' => 'DRUPAL_UID=0',
    ),
  );
  
  // Output number of urls we will hit.
  echo count($nids)+1 . " urls to hit<br>\n";
  echo "<br>\n";

  // Hit Frontpage.
  $url = 'http://' . $_SERVER['SERVER_ADDR'] . url('<front>', array('absolute' => FALSE));
  $data = drupal_http_request($url, $options);
  echo '1 - ' . $url . ' ...' . $data->code . "<br>\n";

  // Hit every nodes URL.
  foreach ($nids as $count => $nid) {
    $url = 'http://' . $_SERVER['SERVER_ADDR'] . url('node/' . $nid->nid, array('absolute' => FALSE));

    drupal_http_request($url, $options);
    echo $count+2 . ' - ' . $url . ' ...' . $data->code . "<br>\n";
  }
  
  // Output time.
  echo "<br>\n";
  echo "Total Time: " . timer_read('page') . "ms <br>\n";

BTW this script will bypass the boost cache ;)

Comment #2

timfarley commented 28 February 2011 at 02:12

Cheer Mikeytown!

That seems to do the trick for now at least. First run took 183,000+ ms - and it didn't seem to timeout for me on Bluehost. We'll see if it works into the future, especially if the number of nodes increases drastically, but for now all looks good. Thanks again.

Comment #3

mikeytown2 commented 1 March 2011 at 01:39

note to self:
->addTag('node_access')

Comment #4

NPC commented 24 May 2011 at 06:37

Subscribing.

Comment #5

mikeytown2 commented 11 June 2011 at 21:01

2 more notes
http://www.leveltendesign.com/blog/randall-knutson/cron-queues-processin...
http://drupal.org/node/1138098#comment-4566670

Comment #6

emmeade commented 20 September 2011 at 16:42

I created a module and used hook_cron with the code in #1. It looks like the pages are crawled but they are not added to the boost cache. Is there a different way to set this up with cron?

<?php
// Bootup Drupal.
define('DRUPAL_ROOT', getcwd());
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

function hslcrawl_cron(){
  // Get all published nodes.
  $nids = db_select('node', 'n')
    ->fields('n', array('nid'))
    ->condition('status', 1)
    ->orderBy('n.created', 'DESC')
    ->execute();

  // Set request headers.
  $options = array(
    'headers' => array(
      'Host' => $_SERVER['HTTP_HOST'],
      'Cookie' => 'DRUPAL_UID=0',
    ),
  );
 
  // Output number of urls we will hit.
  echo count($nids)+1 . " urls to hit<br>\n";
  echo "<br>\n";

  // Hit Frontpage.
  $url = 'http://' . $_SERVER['HTTP_HOST'] . url('<front>', array('absolute' => FALSE));
  $data = drupal_http_request($url, $options);
  echo '1 - ' . $url . ' ...' . $data->code . "<br>\n";

  // Hit every nodes URL.
  foreach ($nids as $count => $nid) {
    $url = 'http://' . $_SERVER['HTTP_HOST'] . url('node/' . $nid->nid, array('absolute' => FALSE));

    drupal_http_request($url, $options);
    echo $count+2 . ' - ' . $url . ' ...' . $data->code . "<br>\n";
    $pagecount = $count+2;
    $pagecode = $data->code;
    drupal_set_message(t(' @pagecount %url @pagecode', array('%url' => $url,'@pagecount' => $pagecount,'@pagecode' => $pagecode)));
    watchdog('hslcrawl',' @pagecount %url @pagecode', array('%url' => $url,'@pagecount' => $pagecount,'@pagecode' => $pagecode));
  }
 
  // Output time.
  echo "<br>\n";
  echo "Total Time: " . timer_read('page') . "ms <br>\n";
  $timer = timer_read('page');

  drupal_set_message(t('Crawler Total Time: @count', array('@count' => $timer)));
  watchdog('hslcrawl', 'Total Time: %timer', array('%timer' => timer_read('page')));
}

Comment #7

johnlutz

he/him

English

commented 10 April 2012 at 20:47

I am also trying to get this to work but I can't figure out how. I got the code working from above like emmeade to get files cached in the cache directory, but when you go the pages initially it isn't pulling up the cached version.

Comment #8

jamescook commented 9 May 2012 at 19:25

Why not just use a cron call to wget?
e.g.
wget -r -nd --delete-after http://example.com

Comment #9

mikeytown2 commented 16 May 2012 at 00:09

HTTPRL is the way forward with this. If/When I get around to making a boost submodule to do crawling it will require HTTPRL.

Comment #10

voodootea commented 22 June 2012 at 13:02

this is great (although it did cause my VPS to die from too much memory use), however the script tries to access page using the server's IP address rather than the domain name - this results in the following folder in my boost cache directory

/cache/normal/xxx.xxx.xxx.xxx/

rather than

/cache/normal/domain.com/

which is what most visitors would use to access the site...

Any idea's how to modify the script to make it crawl as a 'normal' user?

Many thanks

Comment #11

x3cion commented 14 July 2012 at 11:53

Try to replace $_SERVER['HTTP_HOST'] with the real address. ("example.com" for example.)

Didn't test it.

Comment #12

ofir commented 14 July 2012 at 21:46

I solved this using a script and cron. It requires the XML Sitemap module. I found this script: http://beajar.blogspot.co.uk/2011/05/quick-unix-shell-script-to-crawl-xm... and modified it to my requirements.

#! /bin/sh

curl http://yourdomain.com/sitemap.xml > sitemap1.xml
cat sitemap1.xml | sed 's/\<url\>//g' > sitemap.xml

ff()
{  
	count=0
  while read line1; do
	count=$((count+1))
  	# echo $count	"$line1"
    curl --silent $line1 >/dev/null 2>&1
    sleep 1
  done
}
awk '{if(match($0,"<loc>")) {sub(/<\/loc>.*$/,"",$0);
sub(/<loc>/,"",$0); print $0}}' sitemap.xml | ff

On my shared hosting it takes 3000 msec build a cached page and 500 msec to fetch a cached page. I run this once an hour.

Comment #13

bgm commented 2 August 2012 at 05:06

I wrote a small module "boost_crawler", included as a submodule of boost, which can crawl pages (using httprl) that have been flushed by the 'expire' module.

* the expire hook queues the URLs for crawler (based on comment #5)
* the cron queue runs after the hook_cron() calls, which means that Boost cache clear also happens before we start crawling.
* for now, httprl calls are not async. I kind of like the idea of being to limit this way how much crawling we do on crawl calls. (a hook defines to the cron-queue how much time we want to spend crawling, and the queue takes care of keeping an eye on that).

It's nowhere near what the crawler did in 6.x-1.x, but it's a start. The main use case is for editors who update content: the cache gets deleted, and we should immediately regenerate the cache.

TODO: the boost block should show whether the page is queued for crawling, and how many items are in the queue.

Comment #14

volker23 commented 2 August 2012 at 21:11

I tried the dev, but with no effort. Boost stopped working completely. Went back to beta-1, and everything worked again. Bummers.

Comment #15

bgm commented 2 August 2012 at 21:15

Hmm, any chance you could provide more information? Anything in the watchdog? Apache error log?

Comment #16

jamix commented 8 August 2012 at 13:32

A quick test over here worked fine. On node change, Expire expired the node page and Boost removed the cached copy of it. On a subsequent cron run, the cached copy was re-created by the crawler.

@Volker23: Make sure you use the latest dev of Expire. There's a bug in alpha3 that makes it create invalid expire URLs: #1471926: Invalid expire URLs when "Include base URL in expires" is enabled.

Comment #17

yannisc commented 17 August 2012 at 06:59

It worked for me also with the dev version of expire.

I suggest there's an entry about the crawler in the documentation.

Comment #18

bgm commented 17 August 2012 at 14:34

Status:

Active

» Fixed

Thanks, marking this issue as fixed.

Created a short handbook page here: http://drupal.org/node/1736568
Please review/improve.

Thoughts on how to re-crawl the base site on a full cache flush? (I think 6.x-1.x used the Drupal 'menu' entries.. I also saw someone mentioned xmlsitemap once, which was an interesting idea.. although minimally we should systematically re-crawl the frontpage) -- but that's for a new issue ;)

Comment #19

31 August 2012 at 14:41

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #20

Mario Baron commented 8 October 2013 at 23:57

Following all available instructions and using dev version of expire module I still can't get expired pages to regenerate on cron.
Boost deletes the expired pages but they are never regenerated until anonymous visitor first hits them.
I've also looked at the https://drupal.org/node/1785292 thread and now I am seriously confused.
Does boost crawler only generate new/edited pages or all expired pages?

Boost is an awesome module. I'd say its gotta be somewhere very high in my top 10 Drupal modules and performance gain for anonymous users is incredible but we really need to boost crawler to regenerate expired pages on cron.

Comment #21

Anonymous (not verified) commented 9 October 2013 at 00:07

The crawler is misnamed which is explained here (any idea who edited the front page of the project as it should point to the explanation not this thread)?

#1785292-16: Cron Crawler Not Running

if you want to have a never expiring cache, there is code in that thread that can help along with instructions for wget and cron, although if your pages have got to the point where they expired then no-one is visiting them or editing them so chances are that a search engine spider would regenerate them regularly.

Comment #22

vacilando commented 9 October 2013 at 09:17

@Mario Baron -- check out this recipe for Boost & Crawler I put together recently. It works perfectly for me.

Cron crawler in 7.x?