Using wget, curl or something to cache pages [#1398698]

Hi, this might be a bit of a general question, so sorry about that.

I have Varnish set up and I'm trying to find a nice way to create a crawler to cache the site. I thought using wget would be kind of easy, but it seems that hitting http://mydomain.com/page with wget doesn't actually cause it to cache.
Visiting the page in a browser does.

Is there something obvious I'm missing?

Comments

Comment #1

spazfox

Olympia, WA

commented 16 July 2012 at 03:11

Did you ever find a solution to this? I'm having trouble getting either wget or cURL crawlers to cause pages to cache, too...

Comment #2

spazfox

Olympia, WA

commented 17 July 2012 at 21:57

As I mention over at #1681700: Possible way to prime/warm authcache?, this issue may have to deal with incorrect/poor handling of cookies and/or not being able to do javascript with cURL or wget (in Authcache, at least, it checks for the js cookie before marking a page as cacheable). In the meantime, I've taken to warming my cache by running an automated browser script (iMacros) after cron runs. Not ideal, but it works well for me so far.

Comment #3

mropanen commented 23 July 2012 at 13:19

I was just able to get some promising results with curl -H "Accept-Encoding: gzip" http://mydomain.com/page since I finally realized that Drupal is sending the "Vary: Accept-Encoding" header and what that means...

Comment #4

mropanen commented 24 July 2012 at 09:57

To elaborate on that; the reason why my cache warming didn't seem to work is that Varnish caches requests with different Accept-Encoding headers separately, and wget and curl do not by default send that header (at least not the same as normal browsers). That can be added manually via a parameter.

The easy way to warm the cache would be to run wget recursively, but that does not work since the pages it gets are downloaded in gzipped form and wget cannot find and follow links in that.

I came across this module http://drupal.org/project/cache_warmer which uses curl, and seems to be doing quite OK as long as I manually add CURLOPT_HTTPHEADER => array("Accept-Encoding: gzip,deflate") into its crawl options (~line 202, cache_warmer.drush.inc)

Another option would be to manually write a crawler that parses URLs from sitemap.xml and hits them with curl or wget one by one.

Comment #5

misc commented 3 March 2016 at 19:02

Issue summary:	View changes
Status:	Active	» Closed (outdated)

Closing issues that had no activity the latest year, please re-open if the issues is still relevant.

Using wget, curl or something to cache pages

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

News items

Our community

Documentation

Drupal code base

Governance of community