Add spider to crawler - Cache entire site with new install. [#363077]

If this could auto generate the cached file after it expired (push instead of pull) that would be nice.

Various check boxes would be nice as well, such as
Homepage
Primary Links
Secondary Links
All
Custom (with a textarea below)

Comment	File	Size	Author
#23	boost_crawler.php.txt	8.67 KB	mikeytown2
#23	boost_crawler_stats.php.txt	1.45 KB	mikeytown2
#18	boost_crawler.php.txt	9.26 KB	mikeytown2
#12	cron.php.txt	8.1 KB	mikeytown2
#8	cron.php.txt	9.17 KB	mikeytown2
#6	cron.php_.txt	9.21 KB	mikeytown2
#5	cron.php_.txt	9.47 KB	mikeytown2

Comments

Comment #1

Terko commented 26 January 2009 at 10:16

I think, that it will be nice to specify few items, that to be with different cache life. For example to have 6 hours for all content, but 1 hour for index.html.

Comment #2

mikeytown2 commented 7 February 2009 at 10:39

Made a Crawler, but because of PHP timeouts I can't crawl my entire site in one shot, thus this script saves the state to disk and reloads the page. So if running from a web browser it works fine, but I doubt it will work when ran from Cron. How Can you call another file from cron to keep the processing going? Some sort of System call? In other words what does my host use when calling cron.php? I would like to have the script call its self until its done Regenerating the cache.
http://mcapewell.wordpress.com/2006/09/02/calling-php-from-a-cron-job/

Also does anyone know of a better URL parser?

Save this as test_cache.php inside the junk folder

ob_start();
//yoursite.com/  no http:// no www. must have trailing /
$base_url = "example.com/";
$dir = "junk/";
$script_url = "http://www.example.com/" .$dir. "test_cache.php";
$php_timeout = 30;


//Local File Storage
$url_list = array();
$url_list_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir ."temp_junk__url_list.txt";
$url_done = array();
$url_done_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "temp_junk__url_done.txt";
$url_count = array('count' => 0, 'time' => my_microtime());
$url_count_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "temp_junk__url_count.txt";

//Read Previous Run In
ReadFilesIntoArrays(&$url_list,$url_list_filename, &$url_done,$url_done_filename, &$url_count, $url_count_filename);

if (count($url_list) < 1) {
  //Prime the pump
  $url_list[] = "http://".$base_url;
}
$end = my_microtime();

//Output Some Info
echo 'Read Data Files From: ' . $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "<br>\n";
echo 'Total Time: ' . round($end - $url_count['time'],2). " Seconds. URL's in Queue: " .count($url_list). "<br>\n";
//Flush Buffers
tolet();

//Do The List
$start = my_microtime();
while (list(, $value) = each($url_list)) {
  if (in_array($value, $url_done) == FALSE) {
    $temp_list = GetLinksFromURL($value, $base_url);
    $tempA = array_merge($url_list, $temp_list);
    unset($url_list);
    $url_list = KillDuplicates($tempA);
    $url_done[] = $value;
    $url_done = KillDuplicates($url_done);
    $end = my_microtime();
    
    echo $url_count['count'] . ", ";
    echo round($end - $start,2) . " seconds, \t";
    echo "Crawling " .urldecode(urldecode($value)). "<br>\n";
    
    //Flush Buffers
    tolet();
    
    $url_count['count']++;
  }
  if (round($end - $start,2) > $php_timeout) {
    StoreArrays($url_list,$url_list_filename, $url_done,$url_done_filename, $url_count,$url_count_filename);
    
    //Call Self via browser
    echo "<html><head><script type='text/javascript'>
    <!--
    function delayer(){location.reload(true);}
    //-->
    </script></head><body onLoad='setTimeout(";
    echo 'delayer()';
    echo ", 1000)'></body></html>";
    
    //Flush Buffers
    tolet();
    
    ob_end_flush();
    
    //Kill This Process
    exit();
  }
}

//Indexing Complete, Kill Temp Files
DelTempFiles($url_list_filename, $url_done_filename, $url_count_filename);
echo count($url_done) .", ". count($url_list) .", ". $url_count['count']. "<br>\n";
echo "Total Time: " . round(($end - $url_count['time'])/60.0,2) . " Minutes<br>\n";
//Flush Buffers
tolet();
exit();
//End Of Program


//Web Crawler
function GetLinksFromURL($url,$base_url) {
  //download file
  $var = file_get_contents($url);
  //get all links
  preg_match_all("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $var, &$matches);
  
  //process links
  $matches = $matches[1];
  $temp_list = array();
  foreach($matches as $var) {    
    //kill http://www.yoursite.com/ ($base_url)
    $temp = str_replace($base_url,"",strstr($var, $base_url));
    if (strlen($temp)>0) {
      $var = $temp;
    }
    
    //kill any external links & pics
    if (strlen(strstr($var, 'http://'))>0) { }
    else if(strlen(strstr($var, 'www.'))>0) { }
    else if(strlen(strstr($var, 'mailto:'))>0) { }
    else if(strlen(strstr($var, '.jpg'))>0) { }
    else if(strlen(strstr($var, '.JPG'))>0) { }
    else if(strlen(strstr($var, '.gif'))>0) { }
    else if(strlen(strstr($var, '.png'))>0) { }
    else if(strlen(strstr($var, '#'))>0) { }
    else if(strlen(strstr($var, '.xml'))>0) { }
    //store url in array
    else {
      $temp_list[] = "http://".$base_url.$var;
    }
  }
  unset ($var);
  unset ($matches);
  $matches = array();
  //Round 2 of link checking
  foreach($temp_list as $var) {
    if (strlen(strstr($var, 'http://'.$base_url))<0) { }
    else if (substr_count($var, '://')>1) {}
    else { 
    $var = str_replace($base_url."/",$base_url,$var);
      $matches[] = $var;
    }
  }
  
  //remove duplicate links
  return KillDuplicates($matches);
}


//Destroys duplicate entries in array
function KillDuplicates($array) {
  return array_unique($array);
}

//System Timer
function my_microtime($precision = 4) {
  return round(microtime(true),$precision);
}

//Clear Buffers
function tolet() {
echo ' ';
// wait for 0.1 seconds
usleep(100000);
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();

echo ' ';
// wait for 0.1 seconds
usleep(100000);
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();

echo ' ';
// wait for 0.1 seconds
usleep(100000);
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();
ob_flush();
flush();
}


//File Functions
function StoreArrays ($url_list,$url_list_filename, $url_done,$url_done_filename, $url_count,$url_count_filename) {
  $f_handle = fopen($url_list_filename, 'wb');
  fwrite($f_handle, serialize($url_list));
  fclose($f_handle);
  unset ($f_handle);

  $f_handle = fopen($url_done_filename, 'wb');
  fwrite($f_handle, serialize($url_done));
  fclose($f_handle);
  unset ($f_handle);

  $f_handle = fopen($url_count_filename, 'wb');
  fwrite($f_handle, serialize($url_count));
  fclose($f_handle);
  unset ($f_handle);
}

function ReadFilesIntoArrays(&$url_list,$url_list_filename, &$url_done,$url_done_filename, &$url_count,$url_count_filename) {
  $f_handle = fopen($url_list_filename, 'rb');
  if ($f_handle) {
    $url_list = unserialize(fread($f_handle, filesize($url_list_filename)));
  }
  fclose($f_handle);

  $f_handle = fopen($url_done_filename, 'rb');
  if ($f_handle) {
    $url_done = unserialize(fread($f_handle, filesize($url_done_filename)));
  }
  fclose($f_handle);
  
  $f_handle = fopen($url_count_filename, 'rb');
  if ($f_handle) {
    $url_count = unserialize(fread($f_handle, filesize($url_count_filename)));
  }
  fclose($f_handle);
}

function DelTempFiles($url_list_filename, $url_done_filename, $url_count_filename) {
  unlink($url_list_filename);
  unlink($url_done_filename);
  unlink($url_count_filename);
}

The above code could be modified so that it only crawls an array, and doesn't go looking for more links; thus pre-caching certain nodes. Throw in a Menu DB call and we are crawling Primary & Secondary Links. Only thing is, the code is meant to be ran outside of the drupal system, thus it might need to be reworked if one wanted to do selective pre-caching using drupal assets.

EDIT Feb, 7th 2008 2:12 -8 GMT : Fixed a couple of errors
EDIT Feb, 7th 2008 2:39 -8 GMT : Better output IMHO

Comment #3

mikeytown2 commented 10 February 2009 at 09:42

Status:

Active

» Needs review

This can be ran as a PHP cron job on Go Daddy.

/**
 * Site Variables
 */
$base_url = "example.com/"; //yoursite.com/  no http:// no www. must have trailing /

/**
 * Script Variables
 */
$script_name = "test_cache.php"; //name of this file
$dir = "junk/"; //sub directory on server
$script_url = "http://www." .$base_url .$dir .$script_name; //location of this files URL

/**
 * Server Variables
 *  Change to match your hosting config!
 */
//Default System Run Commands - Go Daddy Specific Below
$script_location = '"$HOME/html/' .$dir. $script_name . '"'; //points to this file on local system
$system_php = "/web/cgi-bin/php5 "; //location of php interperter
$php_timeout = 30; //PHP Script timeout
$async_call =  " > /dev/null &"; //linux only http://www.sitecrafting.com/blog/to-run-php-code-in/
$command = $system_php  . $script_location . $async_call; //Runline

/**
 * Crawler Variables
 */
$url_list = array("http://".$base_url); //"prime the pump"
$url_done = array();
$url_count = array('count' => 0, 'time' => my_microtime()); //timer for total run

/**
 * File Variables
 */
$url_done_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "temp_junk__url_done.txt";
$url_list_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir ."temp_junk__url_list.txt";
$url_count_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "temp_junk__url_count.txt";
$log_filename = $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "log.txt";


//Make script more robust
ignore_user_abort(1); // run script in background
set_time_limit(0); // run script forever 

/*** Start To Do Stuff! ***/


ReadFilesIntoArrays(&$url_list,$url_list_filename, &$url_done,$url_done_filename, &$url_count, $url_count_filename); //Read Previous Run In

//Output Some Info
ob_start();
$end = my_microtime();
echo '<pre>';
echo 'Read Data Files From: ' . $_SERVER['DOCUMENT_ROOT'] . "/" .$dir. "\n";
echo 'Total Time: ' . round($end - $url_count['time'],2). " Seconds. URL's in Queue: " .count($url_list). "\n";
tolet(); //Flush Buffers


$start = my_microtime(); //start timer for this run
//Do The List
//Run though every value in array; works with a growing array
while (list(, $value) = each($url_list)) {
  //URL has not been crawled
  if (in_array($value, $url_done) == FALSE) {
    $temp_list = GetLinksFromURL($value, $base_url); //return array of links from url
    $tempA = array_merge($url_list, $temp_list); //merge arrays
    unset($url_list); //clear old array
    $url_list = KillDuplicates($tempA); //remove dups in array
    $url_done[] = $value; //add crawled url to done list
    $url_done = KillDuplicates($url_done); //remove dups from crawled list
    $end = my_microtime(); //stop clock
    
    //Output Info
    echo $url_count['count'] . ", ";
    echo round($end - $start,2) . " seconds, \t";
    echo "Crawling " .urldecode(urldecode($value)). "\n";
    
    tolet(); //Flush Buffers
    $url_count['count']++; //Increment URL Crawled Counter
  }
  //Timer has gone over the limit; end this run and restart
  if (round($end - $start,2) > $php_timeout) {
    StoreArrays($url_list,$url_list_filename, $url_done,$url_done_filename, $url_count,$url_count_filename); //Write State To Files
    Sleep(1); //Wait One Second
    
    CallSelfSystem($command); //Call Self
    
    tolet(); //Flush Buffers
    ob_end_flush(); //End Output 
    exit(); //Kill This Process
  }
}
//While Loop Done; Crawling of stie is Complete
DelTempFiles($url_list_filename, $url_done_filename, $url_count_filename); //Indexing Complete, Kill Temp Files

//Output Interesting Stats
echo count($url_done) .", ". count($url_list) .", ". $url_count['count']. "\n";
echo "Total Time: " . round(($end - $url_count['time'])/60.0,2) . " Minutes\n";
echo '</pre>';
tolet();//Flush Buffers

//Write To Log
$txt = '';
$txt .= '[' .date(DATE_RFC822). ']';
$txt .= "\t".'Crawled:' .count($url_done). '/' .$url_count['count']. '/' .count($url_list);
$txt .= "\t".'Total Time: ' .round(($end - $url_count['time'])/60.0,2). ' Minutes'."\n";
WriteToLog($log_filename, $txt);

ob_end_flush(); //End Output 
exit(); //End Of Program


/**
 * List of functions below
 **/
 
//Web Crawler & Link Get
function GetLinksFromURL($url,$base_url) {
  //download file
  $var = file_get_contents($url);
  //get all links
  preg_match_all("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $var, &$matches);
  
  //process links
  $matches = $matches[1];
  $temp_list = array();
  foreach($matches as $var) {    
    //kill http://www.yoursite.com/ ($base_url)
    $temp = str_replace($base_url,"",strstr($var, $base_url));
    if (strlen($temp)>0) {
      $var = $temp;
    }
    
    //kill any external links & pics
    if (strlen(strstr($var, 'http://'))>0) { }
    else if(strlen(strstr($var, 'www.'))>0) { }
    else if(strlen(strstr($var, 'mailto:'))>0) { }
    else if(strlen(strstr($var, '.jpg'))>0) { }
    else if(strlen(strstr($var, '.JPG'))>0) { }
    else if(strlen(strstr($var, '.gif'))>0) { }
    else if(strlen(strstr($var, '.png'))>0) { }
    else if(strlen(strstr($var, '#'))>0) { }
    else if(strlen(strstr($var, '.xml'))>0) { }
    //store url in array
    else {
      $temp_list[] = "http://".$base_url.$var;
    }
  }
  unset ($var);
  unset ($matches);
  $matches = array();
  //Round 2 of link checking
  foreach($temp_list as $var) {
    if (strlen(strstr($var, 'http://'.$base_url))<0) { }
    else if (substr_count($var, '://')>1) {}
    else { 
    $var = str_replace($base_url."/",$base_url,$var);
      $matches[] = $var;
    }
  }
  
  //remove duplicate links
  return KillDuplicates($matches);
}


//Destroys duplicate entries in array
function KillDuplicates($array) {
  return array_unique($array);
}

//System Timer
function my_microtime($precision = 4) {
  return round(microtime(true),$precision);
}

//Clear Buffers
function tolet() {
  ob_flush();
}

//Call Self via browser
function CallSelfBrowser() {
  echo '</pre>';
  echo "<html><head><script type='text/javascript'>
  <!--
  function delayer(){location.reload(true);}
  //-->
  </script></head><body onLoad='setTimeout(";
  echo 'delayer()';
  echo ", 1000)'></body></html>";
}

//Call Self via exec Call
function CallSelfSystem($command) {
  $dump = array();
  $start = my_microtime();
  exec($command, $dump);
  $end = my_microtime();
  echo "\n" . $command;
  echo "\nTime to Run Command " . round($end - $start,4) . " seconds. Low number (0.01) means async execution worked.";
  echo '</pre>';
}


//File Functions
function StoreArrays ($url_list,$url_list_filename, $url_done,$url_done_filename, $url_count,$url_count_filename) {
  $f_handle = fopen($url_list_filename, 'wb');
  fwrite($f_handle, serialize($url_list));
  fclose($f_handle);
  unset ($f_handle);

  $f_handle = fopen($url_done_filename, 'wb');
  fwrite($f_handle, serialize($url_done));
  fclose($f_handle);
  unset ($f_handle);

  $f_handle = fopen($url_count_filename, 'wb');
  fwrite($f_handle, serialize($url_count));
  fclose($f_handle);
  unset ($f_handle);
}

function ReadFilesIntoArrays(&$url_list,$url_list_filename, &$url_done,$url_done_filename, &$url_count,$url_count_filename) {
  if (file_exists($url_list_filename)) {
    $f_handle = fopen($url_list_filename, 'rb');
    $url_list = unserialize(fread($f_handle, filesize($url_list_filename)));
    fclose($f_handle);
  }

  if (file_exists($url_done_filename)) {
    $f_handle = fopen($url_done_filename, 'rb');
    $url_done = unserialize(fread($f_handle, filesize($url_done_filename)));
    fclose($f_handle);
  }
  
  if (file_exists($url_count_filename)) {
    $f_handle = fopen($url_count_filename, 'rb');
    $url_count = unserialize(fread($f_handle, filesize($url_count_filename)));
    fclose($f_handle);
  }
}

function DelTempFiles($url_list_filename, $url_done_filename, $url_count_filename) {
  if (file_exists($url_list_filename)) {unlink($url_list_filename);}
  if (file_exists($url_done_filename)) {unlink($url_done_filename);}
  if (file_exists($url_count_filename)) {unlink($url_count_filename);}
}

function WriteToLog($log_filename, $txt) {
  $f_handle = fopen($log_filename, 'a');
  fwrite($f_handle, $txt);
  fclose($f_handle);
  unset ($f_handle);
}

EDIT - Feb 8, 2009 3:17am -8GMT: Writes to log file at end of run.
EDIT - Feb 8, 2009 4:26am -8GMT: Better formatting of log file.
EDIT - Feb 8, 2009 4:35am -8GMT: Remove file warnings.
EDIT - Feb 8, 2009 5:53am -8GMT: Better code comments & screen output.
EDIT - Feb 10, 2009 1:40am -8GMT: Make script more robust.

Comment #4

mikeytown2 commented 10 February 2009 at 10:23

Component:

Expiration logic

» Caching logic

Use this file when running from cron. The above code was having trouble calling it's self if it was first called from the command line. So I made the code below to go to the above scripts url. I call this cron.php

/**
 * Site Variables
 */
$base_url = "example.com/"; //yoursite.com/  no http:// no www. must have trailing /

/**
 * Script Variables
 */
$script_name = "test_cache.php"; //name of crawler script
$dir = "junk/"; //sub directory on server
$script_url = "http://www." .$base_url .$dir .$script_name; //location of crawler script URL

//Access script from URL not file

//Output Info
echo "Running " .$script_url. "\n";
ob_end_clean();
header("Connection: close");
//ignore_user_abort(1); // optional
ob_start();
echo "Running " .$script_url. "\n";
$size = ob_get_length();
header("Content-Length: $size");
ob_end_flush(); 
flush();

Sleep(1);
// Do processing here
echo file_get_contents($script_url);

//end of jumpstart script
exit();

BTW I am using this right now on a live site and it's working wonders! In my cron manager the first cron to run is the drupal one, and it kills the cache. The second one calls this script above and it regenerates the entire site; cache is primed, and site is fast! My live site on a shared host is faster then my dev site on my local box; and my box isn't that slow. This is the fastest way to preempt it as well because TCP/IP packets don't have to travel, they stay put right on the server. It now only takes about 4 seconds to generate a page on my site; via web (TCP/IP) it can be double that, using a tool like GSiteCrawler. I think basic functionality is there.

Future:
Make it so the script only runs if it's being called by its own server, or a pre-set ip (prevent a lucky spider/user from hogging the cpu).
Merge 2 scripts (detect being called via cron vs browser).
Better looking code in the crawler function or integrate an external one (code looks ugly).
Integrate with drupal (only pre-cache certain pages, ect...).
Be apart of the boost distribution?
Other Ideas???

Comment #5

mikeytown2 commented 11 February 2009 at 10:27

Status	File	Size
new	cron.php_.txt	9.47 KB

Script is back down to 1 file, and it only runs if it's called via it's self or a user entered IP address.

Comment #6

mikeytown2 commented 18 February 2009 at 10:33

Status	File	Size
new	cron.php_.txt	9.21 KB

updated, cosmetic changes. I think I'm done

Comment #7

mths commented 7 April 2009 at 08:12

subscribe.
looks awesome, definitely going to test & use.

Comment #8

mikeytown2 commented 7 April 2009 at 08:24

Status	File	Size
new	cron.php.txt	9.17 KB

started to clean it up... next step is to get it down to 1 temp file. Then break structure into more functions.

Comment #9

mths commented 7 April 2009 at 08:52

subscribe.
looks awesome, definitely going to test & use.

Comment #10

rsvelko commented 18 April 2009 at 18:19

Nicey niciness script, yes.
Question: What does Arto think about this new mode of operation/ideology of boost? Correct me if I am wrong but we took this path just recently.

The above question I ask cause not enough docs on that matter.

Comment #11

rsvelko commented 19 April 2009 at 23:48

2nd question: what is the time needed to run this script for say 1000 nodes? per node/page ?

Comment #12

mikeytown2 commented 20 April 2009 at 00:05

Status	File	Size
new	cron.php.txt	8.1 KB

Time per page depends on your server. I do 2,000 pages in 80-90 min most of the time. It's the fastest way to generate the pages. There should be a log.txt file in the same dir that has some interesting stats, like which url took the longest to generate.

Comment #13

rsvelko commented 20 April 2009 at 01:58

code improvements TODO:
1. better comments
2. functions names in the drupal way - main funcs like name_of_func() and helper funcs with "_"in front
3. rename the file like boost_buld_html_cache.php or similar
4. make it smarter so it does not need more than necessary configuration

5. and an idea - if you crawl the site to get the list of pages to cache ( afaik=yes ) it seems too much html fetching - wouldn't it be faster to use the node table to get a list of all nodes (and probably the menu and/or term tables ?? )

5.1. maybe the node table list can just work as a helper to the crawler or could this technique make it unnecessary?

6. One more idea - some access log (google analytics export) analysis can help the script pick up just the pages worth caching... this seems faster too - hope I am right .

Anyway the code needs to be made more extensible .

7. And lastly - have you thought using a ready made php crawler implementation?

Comment #14

mikeytown2 commented 20 April 2009 at 08:20

found a related issue with code!
#337391: Setting to grab url's from url_alias table.

Comment #15

rsvelko commented 20 April 2009 at 11:23

Title:

Auto Regenerate Cache (pre-caching)

» Auto Regenerate Cache (pre-caching) (the crawler code thread)

haha - this guy has read my mind 3-4 months ago ... he is doing :

+  // Get all nodes with status 1.
+  $result = db_query("SELECT nid FROM {node} WHERE status = 1");
+  while ($row = db_fetch_object($result)) {
+    $operations[] = array('boost_export_html', array('node/'. $row->nid, ''));
+  }
+  
+  // Get all terms
+  if (module_exists('taxonomy')) {
+    $result = db_query("SELECT tid FROM {term_data}");
+    while ($row = db_fetch_object($result)) {
+      $operations[] = array('boost_export_html', array('taxonomy/term/'. $row->tid, ''));
+    }
+  }

which is exactly what I proposed above . I suggest using your code to help him if your code has sth to deliver. Maybe for panel/views pages crawling will still be easier. And maybe for very dynamic pages - like views with exposed filters or search pages - we cache only on the if-accessed principle - in other words - we precache all nodes and terms and sleep calmly at night :)

Marking this one as a complement to the other.

Comment #16

ferrangil commented 23 June 2009 at 15:01

Tested on my local box and it generates a lot of static html files. I stopped the process as it might take a lot of time.
My site has around 60k nodes (and a few views).

Now my home page (and most of all other cached pages) are really out dated: cached 5 hours ago. What is the workaround in here?
I can't clear all the files once an hour, as it needs more than 1 hour to generate them (and the idea is not having the server creating thousands of pages and then deleting them, then recreating...)...

Maybe I should use the idea from the latest post, and just cache a few hundreds of most accessed pages (and rebuilding them soon). That could work better.
Ideas?

Comment #17

mikeytown2 commented 8 July 2009 at 06:29

ID-ed 2 bottle necks with the posted code

array_unique(array_merge(...))

can be fixed by using a foreach() and placing array_unique() at the end so its not called every time an item is added. Easy Fix

while (list(, $value) = each($url_list)) {
  //URL has not been crawled
  if (in_array($value, $url_done) == FALSE) {

can be fixed by passing the last key value & doing an array_slice() on it. array_merge() to bring the 2 arrays back together before script ends. Harder Fix

Comment #18

mikeytown2 commented 8 July 2009 at 21:13

Status	File	Size
new	boost_crawler.php.txt	9.26 KB

Rewrote crawler, it's now faster and shouldn't have scaling issues (double bonus).

Comment #19

mikeytown2 commented 10 July 2009 at 04:33

Can increase speed and reduce memory usage by killing $urls_crawled array, since it's a counter now. Add setting to control output & usage of timers would make it more efficient.

PHP doesn't support threading, but I could make it run multiple processes or roll my own. Easy thing to do is use a DB to keep track of what's been crawled ect... hard thing to do would be to split up crawling opperation via a modulus operation. If not using a DB, each thread gets it's own temp file, parent coordinates child processes & combines output before restarting its self. If using a DB have 2 tables; List of URLs & Pointer. Each thread grabs 25 URL's and moves pointer 25 up.

If I where to multi thread this I would bootstrap the Drupal DB; create 1 table & have the pointer in the variables table. This would allow for multi-core boxes to be crawled using all its cores.

Comment #20

ferrangil commented 10 July 2009 at 06:45

Hi,
I've been following all the thread. On my case, I have around 75.000 nodes, some of them with views and large pagers (but I think they are not being cached as I didn't change the ?page=1 from the URL (using Clean Pagination for example)).

I'm have now the normal Boost module enabled, and expiring every 45 minutes (which is a lot of time, specially as the site is dinamic, people uploads materials and they are listed, so anonymous users see the same list for a while).

I was thinking in caching all my nodes, but using the crawler or the Select nid from node.... as shown on another thread. The thing is, it will be worth the effort to generate that much cached files "just in case"? Mysql will be eating CPU all time. Maybe it's better to keep caching the most popular pages like is happening now. Maybe I can make a selection on the pages I want to have in the cache...
Suggestions?

Comment #21

mikeytown2 commented 10 July 2009 at 07:42

Note to self: DB or arrays, having pure worker threads that get/pass batch info to parent is the way to go for MT. Table or File locks... for file might want to use append & get rid of serialization for faster reads/writes.

@ferrangil
#337391: Setting to grab url's from url_alias table. is currently in the issue queue, so it will eventually get done. It will allow for fine grained control of the crawler with settings from #453426: Merge Cache Static into boost - Create GUI for database operations.

If you need something now rather then a couple of months away (in other words when I get to it) then I can do some custom work in exchange for money. If that's a custom crawler (crawl based on past hits, content type, per page setting, ect...) or different content types having different expiration times or something else, let me know.

Heres how I envision the issues getting fixed
http://drupal.org/node/326515#comment-1796028

The answer to your question is how long does it take for a non cached/boosted page to get served to your end user? If it takes too long (like 5 seconds) then having them pre-cached is a good idea. If your server is generating pages fairly quickly then crawling your server wouldn't change the end users experience. In short, If boost is working for you as is then don't sweat it.

Because your supporting anonymous users & registered users, you might want to look into more advanced things like APC & memcache. If boost can't make your server fly, start with APC and go from there.

Comment #22

mikeytown2 commented 21 July 2009 at 03:51

Idea to make this work independent of the server's setup. Have php return before it's done processing. Takes care of having to do a system call with a & at the end of it.

  // Prime php for background operations
  ob_end_clean();
  header("Connection: close");
  ignore_user_abort();

  // Output text
  ob_start();
  header("Content-type: text/html");
  header("Expires: Wed, 11 Nov 1998 11:11:11 GMT");
  header("Cache-Control: no-cache");
  header("Cache-Control: must-revalidate");
  header("Content-Length: 7");
  header("Connection: close");
  print('testing');
  ob_end_flush();
  flush();

  // text returned and connection closed.
  // Do background processing. Time taken below should not effect page load times.
  
sleep(2);
echo '... test done';

Comment #23

mikeytown2 commented 21 July 2009 at 06:11

Status	File	Size
new	boost_crawler_stats.php.txt	1.45 KB
new	boost_crawler.php.txt	8.67 KB

Code now works on windows & linux. Doesn't output info like before so I made a separate program to print stats. Uses above trick to do async execution.

Comment #24

capellic

he/him/his

Austin, Texas

commented 31 July 2009 at 12:24

Can you provide some instructions on how to run this? I read the thread but think I may have missed something.

Since this isn't a module, I thought I would throw it into the "files" directory and call it from my browser to test. I get a "CRON Script can only be run via system" error. I tried to run it from my server's cron manager over http (curl) and I got the same error. Then I changed the invocation method to be command line. The output from the cron job was empty and it doesn't look like any log file was created, which is what I would expect.

My steps:

1. Uploaded the two files in #23 to your server and place them in the following directory:
sites/default/files/boost_crawler/boost_crawler.php

2. I changed the permissions on the boost_crawler directory to 777 (chmod 777) so that the script can write a log file.

Comment #25

mikeytown2 commented 31 July 2009 at 17:47

You have to edit the script to match your setup. It is not a simple drop in, in it's current form.

Comment #26

capellic

he/him/his

Austin, Texas

commented 31 July 2009 at 18:26

@mikeytown2: I am happy to report that I have created a module that suits my use case perfectly. I am able to define URLs within a settings page and have those URLs called during cron. No, it can't handle a lot URLs and it slows cron down noticeably, but for the handful of pages I need to have lightening fast, it gets the job done.

You can read all about it here:
http://capellic.com/blog/pre-caching-low-volume-website

Comment #27

mikeytown2 commented 1 August 2009 at 03:44

@capellic
Looked at the code; it's simple and it works! I'm not sure if it's worth it, but since your inside Drupal you can use drupal_http_request() instead of file_get_contents(). I like the idea of specifying only the url's you want to crawl... then again with the boost block, I'm about to do the same with the "push" setting in the database (codes there, just need to unhide the form).

My next step is to make this crawler (#23) use Drupal and at first only crawl url's you tell it to. Then use that and replace the batch api with this code so it can be run at cron. In short the 2 crawler threads will be merged together, now that the database is in and working correctly (as far as I know). The next step is to make a reverse url lookup function that figures what is the URL, given the filename. And with that I filed my first bug #537186: Better prevention of URL collisions. in regards to this.

Comment #28

capellic

he/him/his

Austin, Texas

commented 2 August 2009 at 13:22

@mikeytown2
Thanks for the reivew, I've got a couple of things PHP warning errors in the Pre-cache module and I'll definitely update to drupal_http_request -- so cool that it supports POST!

As for configuration options, I see specifying the URLs of one of a couple that can be used. In reality, I can see wanting to specify some URLs, indicate which a checkbox that I want all primary and secondary and tertiary menu items cached as well as the most popular pages (because it may be a blog post which is not in the menu system).

Looking forward to seeing you progress with pre-cache/crawling. Again, thanks for all your discussion, you made my module possible.

Comment #29

mikeytown2 commented 3 August 2009 at 07:04

This is the future direction of this thread.
#538460: Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats

Comment #30

mikeytown2 commented 21 August 2009 at 02:37

Status:

Needs review

» Postponed

Comment #32

mikeytown2 commented 1 September 2009 at 20:58

Title:	Auto Regenerate Cache (pre-caching) (the crawler code thread)	» Add spider to crawler - Cache entire site with new install.
Component:	Caching logic	» Cron Crawler
Status:	Postponed	» Active

Comment #33

mikeytown2 commented 7 October 2009 at 04:51

I need to make the crawler concurrent; right now it operates in in a parallel manner. This means in short once a URL has been added to the crawler queue, its starts crawling even if more URL's need to be added. Starting and ending crawler threads on the fly is the trick...

Comment #34

mikeytown2 commented 10 October 2009 at 09:18

Priority:	Normal	» Minor
Status:	Active	» Postponed

Postponing this again. With the new cron bypass this feature could make the crawler very slow. Need to think about this one more.

Comment #35

YK85 commented 26 October 2010 at 16:53

subscribing