Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
If this could auto generate the cached file after it expired (push instead of pull) that would be nice.
Various check boxes would be nice as well, such as
Homepage
Primary Links
Secondary Links
All
Custom (with a textarea below)
Comment | File | Size | Author |
---|---|---|---|
#23 | boost_crawler.php.txt | 8.67 KB | mikeytown2 |
#23 | boost_crawler_stats.php.txt | 1.45 KB | mikeytown2 |
#18 | boost_crawler.php.txt | 9.26 KB | mikeytown2 |
#12 | cron.php.txt | 8.1 KB | mikeytown2 |
#8 | cron.php.txt | 9.17 KB | mikeytown2 |
Comments
Comment #1
Terko CreditAttribution: Terko commentedI think, that it will be nice to specify few items, that to be with different cache life. For example to have 6 hours for all content, but 1 hour for index.html.
Comment #2
mikeytown2 CreditAttribution: mikeytown2 commentedMade a Crawler, but because of PHP timeouts I can't crawl my entire site in one shot, thus this script saves the state to disk and reloads the page. So if running from a web browser it works fine, but I doubt it will work when ran from Cron. How Can you call another file from cron to keep the processing going? Some sort of System call? In other words what does my host use when calling cron.php? I would like to have the script call its self until its done Regenerating the cache.
http://mcapewell.wordpress.com/2006/09/02/calling-php-from-a-cron-job/
Also does anyone know of a better URL parser?
Save this as test_cache.php inside the junk folder
The above code could be modified so that it only crawls an array, and doesn't go looking for more links; thus pre-caching certain nodes. Throw in a Menu DB call and we are crawling Primary & Secondary Links. Only thing is, the code is meant to be ran outside of the drupal system, thus it might need to be reworked if one wanted to do selective pre-caching using drupal assets.
EDIT Feb, 7th 2008 2:12 -8 GMT : Fixed a couple of errors
EDIT Feb, 7th 2008 2:39 -8 GMT : Better output IMHO
Comment #3
mikeytown2 CreditAttribution: mikeytown2 commentedThis can be ran as a PHP cron job on Go Daddy.
EDIT - Feb 8, 2009 3:17am -8GMT: Writes to log file at end of run.
EDIT - Feb 8, 2009 4:26am -8GMT: Better formatting of log file.
EDIT - Feb 8, 2009 4:35am -8GMT: Remove file warnings.
EDIT - Feb 8, 2009 5:53am -8GMT: Better code comments & screen output.
EDIT - Feb 10, 2009 1:40am -8GMT: Make script more robust.
Comment #4
mikeytown2 CreditAttribution: mikeytown2 commentedUse this file when running from cron. The above code was having trouble calling it's self if it was first called from the command line. So I made the code below to go to the above scripts url. I call this cron.php
BTW I am using this right now on a live site and it's working wonders! In my cron manager the first cron to run is the drupal one, and it kills the cache. The second one calls this script above and it regenerates the entire site; cache is primed, and site is fast! My live site on a shared host is faster then my dev site on my local box; and my box isn't that slow. This is the fastest way to preempt it as well because TCP/IP packets don't have to travel, they stay put right on the server. It now only takes about 4 seconds to generate a page on my site; via web (TCP/IP) it can be double that, using a tool like GSiteCrawler. I think basic functionality is there.
Future:
Make it so the script only runs if it's being called by its own server, or a pre-set ip (prevent a lucky spider/user from hogging the cpu).
Merge 2 scripts (detect being called via cron vs browser).
Better looking code in the crawler function or integrate an external one (code looks ugly).
Integrate with drupal (only pre-cache certain pages, ect...).
Be apart of the boost distribution?
Other Ideas???
Comment #5
mikeytown2 CreditAttribution: mikeytown2 commentedScript is back down to 1 file, and it only runs if it's called via it's self or a user entered IP address.
Comment #6
mikeytown2 CreditAttribution: mikeytown2 commentedupdated, cosmetic changes. I think I'm done
Comment #7
mths CreditAttribution: mths commentedsubscribe.
looks awesome, definitely going to test & use.
Comment #8
mikeytown2 CreditAttribution: mikeytown2 commentedstarted to clean it up... next step is to get it down to 1 temp file. Then break structure into more functions.
Comment #9
mths CreditAttribution: mths commentedsubscribe.
looks awesome, definitely going to test & use.
Comment #10
rsvelko CreditAttribution: rsvelko commentedNicey niciness script, yes.
Question: What does Arto think about this new mode of operation/ideology of boost? Correct me if I am wrong but we took this path just recently.
The above question I ask cause not enough docs on that matter.
Comment #11
rsvelko CreditAttribution: rsvelko commented2nd question: what is the time needed to run this script for say 1000 nodes? per node/page ?
Comment #12
mikeytown2 CreditAttribution: mikeytown2 commentedTime per page depends on your server. I do 2,000 pages in 80-90 min most of the time. It's the fastest way to generate the pages. There should be a log.txt file in the same dir that has some interesting stats, like which url took the longest to generate.
Comment #13
rsvelko CreditAttribution: rsvelko commentedcode improvements TODO:
1. better comments
2. functions names in the drupal way - main funcs like name_of_func() and helper funcs with "_"in front
3. rename the file like boost_buld_html_cache.php or similar
4. make it smarter so it does not need more than necessary configuration
5. and an idea - if you crawl the site to get the list of pages to cache ( afaik=yes ) it seems too much html fetching - wouldn't it be faster to use the node table to get a list of all nodes (and probably the menu and/or term tables ?? )
5.1. maybe the node table list can just work as a helper to the crawler or could this technique make it unnecessary?
6. One more idea - some access log (google analytics export) analysis can help the script pick up just the pages worth caching... this seems faster too - hope I am right .
Anyway the code needs to be made more extensible .
7. And lastly - have you thought using a ready made php crawler implementation?
Comment #14
mikeytown2 CreditAttribution: mikeytown2 commentedfound a related issue with code!
#337391: Setting to grab url's from url_alias table.
Comment #15
rsvelko CreditAttribution: rsvelko commentedhaha - this guy has read my mind 3-4 months ago ... he is doing :
which is exactly what I proposed above . I suggest using your code to help him if your code has sth to deliver. Maybe for panel/views pages crawling will still be easier. And maybe for very dynamic pages - like views with exposed filters or search pages - we cache only on the if-accessed principle - in other words - we precache all nodes and terms and sleep calmly at night :)
Marking this one as a complement to the other.
Comment #16
ferrangil CreditAttribution: ferrangil commentedTested on my local box and it generates a lot of static html files. I stopped the process as it might take a lot of time.
My site has around 60k nodes (and a few views).
Now my home page (and most of all other cached pages) are really out dated: cached 5 hours ago. What is the workaround in here?
I can't clear all the files once an hour, as it needs more than 1 hour to generate them (and the idea is not having the server creating thousands of pages and then deleting them, then recreating...)...
Maybe I should use the idea from the latest post, and just cache a few hundreds of most accessed pages (and rebuilding them soon). That could work better.
Ideas?
Comment #17
mikeytown2 CreditAttribution: mikeytown2 commentedID-ed 2 bottle necks with the posted code
can be fixed by using a foreach() and placing array_unique() at the end so its not called every time an item is added. Easy Fix
can be fixed by passing the last key value & doing an array_slice() on it. array_merge() to bring the 2 arrays back together before script ends. Harder Fix
Comment #18
mikeytown2 CreditAttribution: mikeytown2 commentedRewrote crawler, it's now faster and shouldn't have scaling issues (double bonus).
Comment #19
mikeytown2 CreditAttribution: mikeytown2 commentedCan increase speed and reduce memory usage by killing
$urls_crawled
array, since it's a counter now. Add setting to control output & usage of timers would make it more efficient.PHP doesn't support threading, but I could make it run multiple processes or roll my own. Easy thing to do is use a DB to keep track of what's been crawled ect... hard thing to do would be to split up crawling opperation via a modulus operation. If not using a DB, each thread gets it's own temp file, parent coordinates child processes & combines output before restarting its self. If using a DB have 2 tables; List of URLs & Pointer. Each thread grabs 25 URL's and moves pointer 25 up.
If I where to multi thread this I would bootstrap the Drupal DB; create 1 table & have the pointer in the variables table. This would allow for multi-core boxes to be crawled using all its cores.
Comment #20
ferrangil CreditAttribution: ferrangil commentedHi,
I've been following all the thread. On my case, I have around 75.000 nodes, some of them with views and large pagers (but I think they are not being cached as I didn't change the ?page=1 from the URL (using Clean Pagination for example)).
I'm have now the normal Boost module enabled, and expiring every 45 minutes (which is a lot of time, specially as the site is dinamic, people uploads materials and they are listed, so anonymous users see the same list for a while).
I was thinking in caching all my nodes, but using the crawler or the Select nid from node.... as shown on another thread. The thing is, it will be worth the effort to generate that much cached files "just in case"? Mysql will be eating CPU all time. Maybe it's better to keep caching the most popular pages like is happening now. Maybe I can make a selection on the pages I want to have in the cache...
Suggestions?
Comment #21
mikeytown2 CreditAttribution: mikeytown2 commentedNote to self: DB or arrays, having pure worker threads that get/pass batch info to parent is the way to go for MT. Table or File locks... for file might want to use append & get rid of serialization for faster reads/writes.
@ferrangil
#337391: Setting to grab url's from url_alias table. is currently in the issue queue, so it will eventually get done. It will allow for fine grained control of the crawler with settings from #453426: Merge Cache Static into boost - Create GUI for database operations.
If you need something now rather then a couple of months away (in other words when I get to it) then I can do some custom work in exchange for money. If that's a custom crawler (crawl based on past hits, content type, per page setting, ect...) or different content types having different expiration times or something else, let me know.
Heres how I envision the issues getting fixed
http://drupal.org/node/326515#comment-1796028
The answer to your question is how long does it take for a non cached/boosted page to get served to your end user? If it takes too long (like 5 seconds) then having them pre-cached is a good idea. If your server is generating pages fairly quickly then crawling your server wouldn't change the end users experience. In short, If boost is working for you as is then don't sweat it.
Because your supporting anonymous users & registered users, you might want to look into more advanced things like APC & memcache. If boost can't make your server fly, start with APC and go from there.
Comment #22
mikeytown2 CreditAttribution: mikeytown2 commentedIdea to make this work independent of the server's setup. Have php return before it's done processing. Takes care of having to do a system call with a
&
at the end of it.Comment #23
mikeytown2 CreditAttribution: mikeytown2 commentedCode now works on windows & linux. Doesn't output info like before so I made a separate program to print stats. Uses above trick to do async execution.
Comment #24
capellicCan you provide some instructions on how to run this? I read the thread but think I may have missed something.
Since this isn't a module, I thought I would throw it into the "files" directory and call it from my browser to test. I get a "CRON Script can only be run via system" error. I tried to run it from my server's cron manager over http (curl) and I got the same error. Then I changed the invocation method to be command line. The output from the cron job was empty and it doesn't look like any log file was created, which is what I would expect.
My steps:
1. Uploaded the two files in #23 to your server and place them in the following directory:
sites/default/files/boost_crawler/boost_crawler.php
2. I changed the permissions on the boost_crawler directory to 777 (chmod 777) so that the script can write a log file.
Comment #25
mikeytown2 CreditAttribution: mikeytown2 commentedYou have to edit the script to match your setup. It is not a simple drop in, in it's current form.
Comment #26
capellic@mikeytown2: I am happy to report that I have created a module that suits my use case perfectly. I am able to define URLs within a settings page and have those URLs called during cron. No, it can't handle a lot URLs and it slows cron down noticeably, but for the handful of pages I need to have lightening fast, it gets the job done.
You can read all about it here:
http://capellic.com/blog/pre-caching-low-volume-website
Comment #27
mikeytown2 CreditAttribution: mikeytown2 commented@capellic
Looked at the code; it's simple and it works! I'm not sure if it's worth it, but since your inside Drupal you can use drupal_http_request() instead of file_get_contents(). I like the idea of specifying only the url's you want to crawl... then again with the boost block, I'm about to do the same with the "push" setting in the database (codes there, just need to unhide the form).
My next step is to make this crawler (#23) use Drupal and at first only crawl url's you tell it to. Then use that and replace the batch api with this code so it can be run at cron. In short the 2 crawler threads will be merged together, now that the database is in and working correctly (as far as I know). The next step is to make a reverse url lookup function that figures what is the URL, given the filename. And with that I filed my first bug #537186: Better prevention of URL collisions. in regards to this.
Comment #28
capellic@mikeytown2
Thanks for the reivew, I've got a couple of things PHP warning errors in the Pre-cache module and I'll definitely update to drupal_http_request -- so cool that it supports POST!
As for configuration options, I see specifying the URLs of one of a couple that can be used. In reality, I can see wanting to specify some URLs, indicate which a checkbox that I want all primary and secondary and tertiary menu items cached as well as the most popular pages (because it may be a blog post which is not in the menu system).
Looking forward to seeing you progress with pre-cache/crawling. Again, thanks for all your discussion, you made my module possible.
Comment #29
mikeytown2 CreditAttribution: mikeytown2 commentedThis is the future direction of this thread.
#538460: Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats
Comment #30
mikeytown2 CreditAttribution: mikeytown2 commentedComment #32
mikeytown2 CreditAttribution: mikeytown2 commentedComment #33
mikeytown2 CreditAttribution: mikeytown2 commentedI need to make the crawler concurrent; right now it operates in in a parallel manner. This means in short once a URL has been added to the crawler queue, its starts crawling even if more URL's need to be added. Starting and ending crawler threads on the fly is the trick...
Comment #34
mikeytown2 CreditAttribution: mikeytown2 commentedPostponing this again. With the new cron bypass this feature could make the crawler very slow. Need to think about this one more.
Comment #35
YK85 CreditAttribution: YK85 commentedsubscribing