Hi, this might be a bit of a general question, so sorry about that.
I have Varnish set up and I'm trying to find a nice way to create a crawler to cache the site. I thought using wget would be kind of easy, but it seems that hitting http://mydomain.com/page with wget doesn't actually cause it to cache.
Visiting the page in a browser does.
Is there something obvious I'm missing?
Comments
Comment #1
spazfoxDid you ever find a solution to this? I'm having trouble getting either wget or cURL crawlers to cause pages to cache, too...
Comment #2
spazfoxAs I mention over at #1681700: Possible way to prime/warm authcache?, this issue may have to deal with incorrect/poor handling of cookies and/or not being able to do javascript with cURL or wget (in Authcache, at least, it checks for the js cookie before marking a page as cacheable). In the meantime, I've taken to warming my cache by running an automated browser script (iMacros) after cron runs. Not ideal, but it works well for me so far.
Comment #3
mropanen commentedI was just able to get some promising results with
curl -H "Accept-Encoding: gzip" http://mydomain.com/pagesince I finally realized that Drupal is sending the "Vary: Accept-Encoding" header and what that means...Comment #4
mropanen commentedTo elaborate on that; the reason why my cache warming didn't seem to work is that Varnish caches requests with different Accept-Encoding headers separately, and wget and curl do not by default send that header (at least not the same as normal browsers). That can be added manually via a parameter.
The easy way to warm the cache would be to run wget recursively, but that does not work since the pages it gets are downloaded in gzipped form and wget cannot find and follow links in that.
I came across this module http://drupal.org/project/cache_warmer which uses curl, and seems to be doing quite OK as long as I manually add
CURLOPT_HTTPHEADER => array("Accept-Encoding: gzip,deflate")into its crawl options (~line 202, cache_warmer.drush.inc)Another option would be to manually write a crawler that parses URLs from sitemap.xml and hits them with curl or wget one by one.
Comment #5
misc commentedClosing issues that had no activity the latest year, please re-open if the issues is still relevant.