Closed (fixed)
Project:
Boost
Version:
6.x-1.x-dev
Component:
Cron Crawler
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
21 Nov 2008 at 15:41 UTC
Updated:
4 Sep 2009 at 02:50 UTC
Jump to comment: Most recent file
Comments
Comment #1
moshe weitzman commentedVery nice work! For those who can't be bothered to follow links, this code uses the batch api to request each node page and write it to cache.
To me, this should be part of core boost - probably in a boost.pages.inc file.
Comment #2
swentel commentedUpdate the code cf http://drupalbin.com/4138 , more mature allready including terms, providing hook_boost_export_operations support, using boost_is_cacheable() and boost_is_cached functions to determine if the page has to be cached or not. I'll post a patch (probably tomorrow) against 6.x-dev create a boost.pages.inc file.
Comment #3
swentel commentedPatch attached against DRUPAL-6--1. All batch funtionality is in boost.pages.inc. Other changes are
- boost.module requires boost.pages.inc (inspired by devel_generate batch functions)
- boost.admin.inc form alter function includes the button and the submit callback of it
Comment #4
swentel commentedBumping, any update on this ? Again, I'd be happy to start a separate project for this - we are also currently testing a few patches at work supporting views and panels, so possibly lots of patches and I wouldn't want to bother you guys every time with that :)
Comment #5
rsvelko commentedsince 23rd of Jan this year a similar issue was started - #363077: Add spider to crawler - Cache entire site with new install. . We finally found this one here and will use it as the main thread for this task.
@swentel : Please if you have newer code - give it here so we can use it .
Comment #6
swentel commented@rsvelko : I have no newer code right now, so that patch is ready to use.
Comment #7
mikeytown2 commentedproblem with patch right now is I think it will grab the node/* and then only that will be cached (not that useful anymore). Getting a list of URLs to hit from the url_alias table would be a better idea. It would cache the 1st page of views as well :)
Comment #8
mikeytown2 commentedLookup URL's for aliases via url($path, array('absolute' => true)). Get list of previous pages in the cache once #453426: Merge Cache Static into boost - Create GUI for database operations is done.
Comment #9
mikeytown2 commentedNeed to break this up as sites with LOTS of url's take too long to generate the initial array. Do a count, if more then 5k URLs then go into super batch mode & use db_query_range(*,0,1). It then looks up the URL on each batch run, going off of count to keep track of the total progress.
Comment #10
ferrangil commentedSubscribing!!
Comment #11
capellicI have a huge need fore pre-caching and have been looking for a solution to my issue for MONTHS. I simply didn't have the keyword right in Google: pre-caching. I have a lot of small, low-volume sites that have performance problems because every visitor is regenerating cache due to cron clearing it every 15 minutes. I get about 20 visitors a day and so you can see how this is a problem.
I applied the patch in #3 to the latest dev release (7/18). The patch was mostly applied and the only thing that didn't make it was the "export pages to html now" button. I manually applied that, put it in it's own field group and added a description to explain that "pre-cache" would do.
I also added some code to the boost_export_done() function to properly output error and notices through drupal_set_message() and added watchdog logging.
I've attached a new version of the patch so that it works with he 7/18 version of the dev release.
The patch file for pre-caching does a couple of things:
I have read through the comments on this thread http://drupal.org/node/363077 and I agree -- more configuration items would be great including being able to declare only menu items -- and the thread is focused on my greatest need -- generating pre-cache on cron -- so that after Drupal clears the cache, Boost pre-cache will regenerate it. Of course, you should be able to toggle this on/off on the settings page. I haven't really looked at the code too closely to see how easy it would be to bring the cron functionality to this patch, but I will be doing so within the next week. Let me know if you have any tips.
I am also thinking that we should be able to configure whether you want nodes, taxonomy and eventually views, etc. I, for example, don't use taxonomy lists, so caching those isn't interesting to me. I've got some more ideas for Pre-Caching, so maybe you should create a new component for this feature set? This is really cool stuff.
I am new to module dev, but will likely be submitting patches to move this along. I see that this feature is a bit farther out on your roadmap.
#7: I don't see what the problem is here, but maybe I just understand what you mean. The cache files are written in the same way when I tested the patch.
Comment #12
mikeytown2 commented#7 is only an issue if Global Redirect is not installed.
Thanks for showing some interest into boost by writing some code! Sorry to be picky, but you should use 2 spaces instead of a tab for indentation. This page helped me with writing code for drupal: http://drupal.org/coding-standards. Also if your developing on a windows box (like me), this is how I currently write code for drupal: http://drupal.org/node/505974. Marking this as "needs work" until the tab issue is taken care of.
Heads up: the boost_cache database table has a column in it called push. This will be used to "push" the content out so it is pre-cached. That table will also record the page generation time so one can crawl the slow pages first. This is why the crawler is at the bottom of the list, because it will be that awesome!
As for running this code on cron, it might not work; see #229905: Batch API assumes client's support of meta refresh. It could eventually work with #363077: Add spider to crawler - Cache entire site with new install., if I set that up to use a database connection, as that code doesn't need a browser window in order to call it's self; only problem is I can't make it run on all systems out of the box, it needs to be customized to match your servers setup.
Comment #13
capellic@miketown2
Please be picky, I'll fix the code. Thanks for the guides.
Yes, this feature does sound great, your plans sound nice. I know it's down on the list, but is there a reasonable ETA?
As for the cron, it warms my heart to see that this has been thoroughly researched and that it will be difficult to provide a "Drupal solution" that will work for everybody. For that reason, it might be best if people roll their own helper module that hooks into cron? I'll be strolling over to 363077 to see what I can cobble together with the code you've posted there.
Another consideration should be, "Do I really need to run cron every 15 minutes?" I didn't know, until halfway through yesterday, that cron cleared cache. I think I'll run it a couple of times a day instead on brochure sites. But, if I am running Notification on cron, then every 5 to 15 minutes is mandatory.
Thanks for this module, it's really nice.
Comment #14
mikeytown2 commented@capellic
Set the page expiration time to a higher value, then all your pages won't be expired on cron. The nodes have a hook so if you edit/del, them the cache for that page gets flushed; same with comments. Views, taxonomy and other content types do not have this so if your site relies upon views, then the current option is to use a lower expiration time. Try this patch, as it allows for different expiration times for each page
http://drupal.org/node/453426#comment-1817832
and heres one that works with the promote checkbox
#459956: Flush front page when node is edited/created with promote to front page selected.
Comment #15
capellic@mikeytown2
I've tried to set the expiration time to a higher value, but if my node is included in a view, that content doesn't update and that's a deal breaker for me.
Not sure how setting different expiration times for each page will help me out. If that is something required by someone editing content, then it's going to be a bit too technical.
I don't use the "promote" checkbox.
Thanks for the tips.
Comment #16
mikeytown2 commented@capellic
You can make it so nodes expire in like a week, while views expire in 5 min. Interface for this is still a little clunky, so my recommendation to you is to run the crawler right after you run drupal cron. Thats the simple fix for your current setup.
Comment #17
capellicSorry for the long absence, but here's the new patch with tabs converted into two spaces.
Comment #18
mikeytown2 commentedThis is the future direction of this thread.
#538460: Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats
But this thread may still be useful on it's own...
Comment #19
mikeytown2 commentedis this still useful with the cron crawler?
Comment #20
mikeytown2 commentedGoing to kill the export module, since it is limited by the batch api; sad.
The cron crawler is going to replace it. Having a start button, & selecting url's from other tables like the node, taxonomy, user, url_alias is the next step for the crawler, and should replace the export module's functionality. Marking this as active.
Comment #21
mikeytown2 commentedGoing to skip the start button, cron can start this. Gets Url's from url_alias table.
Comment #22
mikeytown2 commentedcommitted
Comment #23
capellic@mikeytown2 Great! Just trying to understand what you've implemented. So this is a crawler that kicks off on cron. Is there any sort of control to set a limit on the the number, define the type or priority of certain pages? If I deserve a kick because I didn't install it and look for myself, by all means do so. ;-)
Comment #24
mikeytown2 commentedNeed to make sure the URL is published. Also a way to turn this off. Have a crawler field on the boost settings page.
Comment #25
mikeytown2 commentedComment #26
mikeytown2 commenteda FALSE should have been a TRUE
Comment #27
mikeytown2 commentedcommitted