I'm not normally one to post issues... but I've spent countless hours trying to get this to work and can't figure it out.

My current setup is Windows Server 2012 with IIS 8, but I also tested this with Apache and saw the same results. I followed this post to setup my web.config: http://drupal.org/node/1621192.

I have enabled Boost, Boost Crawler, Cache Expiration, and the HTTP Parallel Request Library. When I run cron I see the the .htaccess file generated, but no other files. When I anonymously access a page, I see the page get added to the cache. Then, if I try to anonymously access the page again, I see the "Page cached by Boost" with time stamp message at the bottom of the source. So, it appears to me that I have boost setup correctly, but the crawler is not regenerating my pages on each cron run.

Looking at my log messages with the boost debug flag on, I see these two messages every time I run cron:
boost: Flushed all files (0) from static page cache. (It doesn't always say 0, dependent on whether I manually generated some pages)
cron: Cron run completed

Any ideas what I am doing wrong? My site only has about 20 pages and I really would prefer not to require visitors to wait for pages to load the first time.

Comments

jamescook’s picture

I'm having exactly same issue here

Anonymous’s picture

Priority: Critical » Major

Me too. Virtual hosted site so have set httprl to -1 to make sure it is calling the domain name. Can see in the logs that there is something going on with test-key and removing locks (manual entry from anonymous user outputs this and shoes it works when ip address was not). Have also installed watchdog and Cache Expiration. ALso have htmlpurifier, and memcache, html purifier has it's own cache, memcache requires adding to settings.

Summary crawler never runs, tried triggering it by anonymous, logged in unprivileged user and site administrator on cached and un-cached pages.

Anonymous’s picture

I've investigated this quite throughly today and found a couple of things. In D7 there is no "cron crawler" as one would expect. What you have is that when a page is altered then the "family" of related nodes or whatever your expiration settings are set to, is added to a list and when cron runs, then these pages alone are regenerated by the crawler. You can see the list after editing a page by going into your database and doing

SELECT *, FROM_UNIXTIME(created) FROM queue WHERE name LIKE "boost%" ;

which also gives you a time in user readable format from the unix timestamp. After cron is run, you find that theses pages alone have been recreated, but if you follow the "recommended" settings in boost, to expire pages past their max cache lifetime, then that same cron run can wipe out all the cached pages where users visited.

As such there appears to not be anything that one would call a "cron crawler" in the normal sense, where you would expect it to get the nodes and then build a cache. To get around this, I am using a cache warming cron job using wget but really there needs to be some addition to the module to do this and I am aware that there are a few comments and associated scripts that claim to do this, but I have yet to test them out. Another point worth noting is that the module boost_expire claims to immediately rebuild an altered page, this could be quite important as stale pages are going to be served, up until the moment that the "crawler" gets around to expiring the family of affected pages as it is on the cron job which could be set for in excess of an hour.

It also appears that anonymous users never trigger the cron job that would rebuild the family of expired pages, so if a user has created content, it's visibility could be vastly delayed.

jamescook’s picture

#2 and #3 thanks for your investigations Philip.

but if you follow the "recommended" settings in boost, to expire pages past their max cache lifetime, then that same cron run can wipe out all the cached pages where users visited.

Not sure about the logic of the "where users visited" part, surely the file should be expired regardless of being visited?

With Boost Comment in html enabled I can see that (according to the comment) a particular file is set to become stale in an hour, say 13:43, but the cache file is never deleted/replaced.

I'm also using cron
1: to expire - well it is a find mmin delete after so and so many minutes/hours
2: to reget the page (have the cache file remade) with wget

not ideal but i don't have too much time to investigate boost in-depth at the moment.

Anonymous’s picture

I think I was trying to highlight a disparity in that the cron crawler doesn't work as expected in D7, as in "generate pages". If one has pages set to expire at 1 day, and page Y is modified, then the expire module schedules page X (by the same author) to be expired regardless of when it was generated. Page X may have been generated correctly by a visit 30 seconds ago but then "expired" as part of a family by a daily cron job queued because page Y was edited the day beforehand, so then requires a second visit to generate page X that was correct. So generation of X should check and remove it from the expiry queue.

I suspect there are multiple issues in D7 that need to be re-arranged,

My wget is

17 * * * * /bin/wget -r --delete-after -nd -w 0.5 -R jpg,png,gif,css,js http://www.example.com -o ~/cache_log

which at 17 minutes past the hour (using @hourly puts too much load in general at some times of the day) generates all html pages and then deletes them, leaving a log file behind (though /dev/null could replace the file ~/cache_log)

I don't bother with fetching css and js files as they are auto generated and aggregated by the drupal cache.

jamescook’s picture

i use a very similar wget (nice idea with the images,js etc.), but is that really all you need?

My problem is AFAIC remember(!) that stale files are not deleted in the cache directory so wget alone does not regenerate them.

Anonymous’s picture

I'll try and speed up your debugging by stating the obvious so you can cut and paste URL's as opposed to being patronising :-) there's quite a few settings scattered around in differing modules.

What do you have set in admin/config/system/boost/expiration

is cron running ? admin/config/system/cron

and your expire settings admin/config/development/performance/expire

in your cache, the standard configuration places things under cache/normal and possibly it could be cache/normal/www.example.com/

at which point you should check the Expire protocol settings that includes the base url and for httprl in admin/config/development/httprl set it to -1 so that it uses the domain name. My debugging procedure would be set no minimum cache lifetime to eliminate a set of variable and focus on the max. Set a short max time in boost and normal cache ( admin/config/development/performance )

a) is a domain name set ?
b) does expire and httprl reflect this either yes or no ?
c) try setting Expire protocol can Httprl to match to force the issue.
d) run cron manually as an authenticated user (I have found that as an anon user cron does not appear to expire pages which I'm still investigating and may be your issue)

dblog should show the expiration of stale pages, admin/reports/dblog

bgm’s picture

Category: bug » support
Priority: Major » Normal

The Boost handbook has a page on the crawler for 7.x: http://drupal.org/node/1736568

Someone will need to step up to re-implement a more robust crawler. The one in 6.x-1.x is too much work to maintain imho, and there are probably other modules or methods we could use to make it simpler.

Anonymous’s picture

Assigned: Unassigned »

I'll step up unless there are any objections.

IckZ’s picture

same here! My crawler does not cache any page. I dont see anything in the logs and have to surf manually at the sites. I've installed erverything as mentioned in the docs. Did you solve your problem? Would be nice to hear how you did it ;)

Anonymous’s picture

Category: support » bug

Still looking into it after providing support for other issues. There's a relationship between a failed cron callback and an unset args message that appears in the dblog and is triggered by cron being run manually (tries to cache itself doesn't find the global $_boost variable),

Short answer, No, only expired pages ever appear in the queue table - still investigating.

IckZ’s picture

Hey Phillp!

It's crazy! I dont get any error in the dblog... My crawler also does nothing. It does not even reproduce the expired pages.. In the apache logs there are also no errors. Its confusing :(

Anonymous’s picture

No really frustrating is having a site where it works and a site where it doesn't on the same machine :-)

Anonymous’s picture

Category: bug » feature
Priority: Normal » Major

changed to feature request and bumped it up, specifically because of bgm's comment

Someone will need to step up to re-implement a more robust crawler.

as almost all of this thread is a misunderstanding of how the crawler is currently implemented (including my own at times). Which is:

/**
 * @file
 * Minimal crawler to regenerate the cache as pages are expired.
 */

The crawler only runs if modes are changed by editing or comments etc... Other than that, boost expires them (without reference to expiry) and expects anonymous users to regenerate them.

I'd like bgm's input on this, do you want the crawler re-implemented ? (which is working well and as designed) or do you want me to make Cache Expiration a requirement, rewrite slightly the current expiry of boost and push the expired pages into a queue. Or I can make HTTPRL a requirement and then queue a regeneration request, or an entirely new crawler ? in which case we may need to look at renaming the existing crawler or maybe adding something titled "boost cache warmer" ?

bgm’s picture

If I understand correctly, the question is whether to have a more fully-featured crawler?

I'm open to the idea of adding features to it, better documenting what it does, but we should avoid the issues from the crawler in 6.x-1.x:

  • the 6.x-1.x crawler code is complicated and hard to maintain.
  • on shared hosting servers of an organisation where I used to work, maintainers of Drupal sites with Boost would enable the crawler without understanding what it does, and would end up hammering the server until their site was disabled. e.g. default settings should be server-friendly.
  • determining which pages to crawl in Drupal is not obvious (by looking at the 6.x-1.x code). We could, for example, crawl only the basic pages by request of the user + expiration, such as to crawl URLs in the menus selected by the user, but not "everything"?

Which reminds me that currently we do not queue for crawling pages that are expired in hook_cron(). That sounds like a low-hanging fruit.

For more complicated use-cases, I think we should recommend examples with wget.

Otherwise, for a more complex crawler as in Boost 6.x-1.x, I'd recommend creating a separate project completely, since it could be useful to people using other types of caching such as Varnish (which is what mikeytown2, the official maintainer of Boost, has been doing with the "expire" and "httprl" modules).

Anonymous’s picture

Which reminds me that currently we do not queue for crawling pages that are expired in hook_cron().

Which I do think is a perception problem, we queue for pages that are expired manually by updates or node changes etc... We have questions around where people have wondered why their site's not crawled (because nothing is edited but everything has expired by cron), or how long it will take to crawl the entire site (i.e. it never will).

It's quite a large decision to crawl the pages expired by hook_cron, as probably the easiest way to implement it would be to hook into queue like your existing expire code (it's 5am, I may have that wrong). The trouble with advising wget/ cron is that a lot of smaller ISP's disabled it on shared hosting (along with lynx) because of bots, plus one is also relying on the user to install a cron job.

I believe httprl was split out to deliberately implement queued requests, but drupal 7 also has a cron limit of 1 hour and a number of seconds to avoid time outs so there could be the realistic situation where the pages expired exceed the amount of requests that cron could do in an hour. Above my pay grade to make this kind of decision. Interesting point though. On standard drupal you can have an extremely creative "user" page (who has never written an article) and it will never be boosted by a wget based crawler, or anything external unless the site map is read, because user pages never appear on the front page, or linked to articles.

bgm’s picture

I agree on the perception problem. If we have an option to automatically crawl pages expired in hook_cron(), wouldn't this be sufficient for most use-cases? (+ we should document this clearly in the settings UI)
Sounds like just a few lines of code, and a setting in the admin interface, disabled by default?

A possible DDoS: if someone hammers the site by requesting example.org/?test={1...1000}, those pages will be cached, and then the crawler will re-hit those pages when they expire. Might be worth having some sanity check on the URL before re-crawling?

We could also provide a few options to pre-seed the cache (crawl) for some use cases: items linked to menus, user pages, url aliases, list of URLs entered by the user?

If your "user/*" pages are not accessible by wget/bots, how do people access them? Although using the sitemap as a list of pages to crawl is an option too.

I'd be curious to hear from people who are requesting a better crawler, to avoid implementing features that no one will use.

I'd also like to re-iterate that the only valid use-case I know for the crawler, is for "small presentation sites" on cheap hosting where content rarely changes and low frequency of visitors, so you want to make sure that the content is cached (and usually the expiry time is very high). Otherwise, you will have the googlebot and others constantly hitting your site anyway, probably before Drupal cron runs.

Anonymous’s picture

I agree on the perception problem. If we have an option to automatically crawl pages expired in hook_cron(), wouldn't this be sufficient for most use-cases? (+ we should document this clearly in the settings UI)
Sounds like just a few lines of code, and a setting in the admin interface, disabled by default?

I agree, but then aren't we moving towards needing httprl and expire as "required" so perhaps it should go under the existing crawler since that is the perception.

A possible DDoS: if someone hammers the site by requesting example.org/?test={1...1000}, those pages will be cached, and then the crawler will re-hit those pages when they expire. Might be worth having some sanity check on the URL before re-crawling?

I think we could limit the crawled pages to the url aliases etc... skip the query strings just in case, put then some would complain about pages, and I haven't gone over the spec for httprl and whether it returns a 404, but then that would feat the point as the page would be boosted by the visit, (Sorry thinking out loud), suggest that it is just limited to verified pages stored in the db that expire. (though could append no cache to the query string but it's still using resources and extending the DDoS).

If your "user/*" pages are not accessible by wget/bots, how do people access them? Although using the sitemap as a list of pages to crawl is an option too.

Bots only find user pages on a default drupal install by looking at the site map, or through external URL's

I'd also like to re-iterate that the only valid use-case I know for the crawler, is for "small presentation sites" on cheap hosting where content rarely changes and low frequency of visitors, so you want to make sure that the content is cached (and usually the expiry time is very high). Otherwise, you will have the googlebot and others constantly hitting your site anyway, probably before Drupal cron runs.

I'm not that sure, I've just helped some people here on a tourism site and a newspaper, and they are both large, with the archived news articles infrequently read but certainly they didn't want to tie up resources with people looking at older articles. I think there is a case for extending the module for regular expressions and long cache times from what I've seen recently.

beatnikdude’s picture

example Boost cache warmer PHP script #1916906: Boost cache warmer via PHP and XML sitemap

marcoka’s picture

i can confim that ist not working
i used this for setup: http://drupal.org/node/1736568

for testing pruposes i added a breakpoint inside the function boost_crawler_run($url) { ... but ist not even getting called.

Anonymous’s picture

Do you understand how the crawler works ? it works that if you edit a page it is added to the queue table and then regenerates the page, the crawler in 7.x is not a crawler per se, so you should check your break point works by just editing a page even if it is adding a space.

Philip.

marcoka’s picture

yes you are right. it works.
i thought it would works like a real crawler. no i used XENU, a crawler tool to initially crawl the whole page and continue testing.

so it only works on editing a node. so meaning if i want to update cached frontpage i need to flush the whole cache?

Anonymous’s picture

Status: Active » Closed (works as designed)

Yep this catches a lot of people out, really we should rename it from crawler to something else.

SandraZ’s picture

Hi Philip!

I am a bit confused...
So i use drupal 7 and boost, + expire to clean up my cache at cron-runs.
(I use the "Ignore a cache flush command if cron issued the request." unchecked.)
I use Cron Crawler to regenerate it automatically at the same time.
After cron my cache dir is empty (i have the .htaccess in it).

Is it normal?
If not, please help,
If it is, ok and Cron Crawler regenerates pages cache "just" when i edit the page, please tell me, witch modul should i use.

thanks

Anonymous’s picture

Yes for 7 this is normal. The crawler only generates pages that have been inserted, edited, deleted and their family associated with the cron expiry rules (the documentation is about to change) as it is misleading. If your site is not going to change layout then untick the clear pages on cron run, and it will build up, I believe it is recommended to clear the cache on a cron run to save disk space in the case of sites with thousands of pages, (though this was before my time).

There are various solutions using wget, spiders and scripts to generate a full cache of the site if needed.

The 6.x crawler became unusable as it could time out cron runs past he PHP maximum limits, so having a knock on effect for example the site not being searchable.

SandraZ’s picture

Thank you 4 Your quick reply!

So at this time there is not a module to regenerate the cache automatically?
Anyway the Cron Crawler is a fine mod!

Anonymous’s picture

No as it stands the anonymous visitors to the site generate the cache, this has the benefits that no resources are wasted on less popular areas. There are no plans to develop a crawler for the entire site; priority of pages, url aliases, redirects suggest that it is better to crawl from outside the site much as a search engine spider would. Or to use one of the suggestions in this thread to use a sitemap as a basis.

Mario Baron’s picture

Can anyone post an example wget to generate pages from sitemap please?

Anonymous’s picture

The wget example above follows the links like a spider, but if set on sitemap see then see this bash script

#1074080-12: Cron crawler in 7.x?

then read the wget manual for the -i options as well as time limits/ spidering and -w

Mario Baron’s picture

Thanks Philip_Clarke I ended up going with this wget --quiet http://mywebsite.xy/sitemap.xml --output-document - | egrep -o "http://mywebsite.xy[^<]+" | wget -q --delete-after -i - and its working beautifully. Super fast website now on shared hosting :) Boost should be part of Drupal core, its that great!

dkre’s picture

Issue summary: View changes

Wow, thanks Philip_Clarke and Mario, the wget solution is really elegant.

+1 to boost in the core and cron for crawling is the best solution.

I should add, I'm using boost 7.x-1.0-beta2 with elysia cron and haven't been able to get it to crawl.

RAWDESK’s picture

Slightly off-topic, although i was looking at Mario's wget script also to have a sitemap triggered re-population of my cached html files.
Since i've installed the Boost, Expire and crawler module triangle successfully i am coping with some IE caching issues i have no clue to get around with. Could anyone here with some experience have a look at my post in this related thread ?
https://groups.drupal.org/node/35652#comment-1077863
It looks to me the cached pages are rendered browser independent (same html in FF and IE) yet no CSS seems to be applied in repeated IE views.
PS. I am running IE10.

UPDATE 29/12: This issue seems to be related to the compatibility view mode IE has putted my site in. If switched off from IE dev toolbar, it's being shown correctly above IE7. Similar report here : http://stackoverflow.com/questions/21891842/stuck-with-ie-compatibility-...
Digging further into it

UPDATE 29/12 2nd: Issue solved using <meta http-equiv="X-UA-Compatible" content="IE=Edge" />

alexx-alexx@list.ru’s picture

Awful stuff ... did not manage to get it working ... The idea and the result are good but not the crawler implementation. How am i able to make it work if i need to pre-cache nodes... That should be simple. What am i doing in a wrong way?