I hope I haven't missed direct reference to this in any other issues. Although Boost itself appears to work fine, the crawler is creating lots of 404 errors because it duplicates the subdirectory in which my Drupal installation lives. My logs are filled up with messages like "Crawler fetched http://123.345.6.789/cr-d7/cr-d7/node/21465" followed by a separate "page not found" warning. If you notice, the subdirectory of the install is listed twice. This does not happen unless the crawler is enabled.
Even though Boost sans crawler appears to be working properly, I should note that we are using a symlink for our web root ("var/www") and I get this note under the Document Root heading on the htaccess settings page: "Value of /mnt/data is recommended for this server. Please open an boost issue on Drupal.org, since apache and php might not be configured correctly." "mnt/data" is where the drupal files are stored with a symlink to "var/www". I'm not sure if this could be relevant to my crawler problem, or maybe it's some combination of using symlinks with a subdirectory installation. Maybe the crawler's handling of these things has not caught up with the core module? Or maybe it has something to do with the Cache Expiration module?
Anyway, here's what the relative part of my .htaccess file currently looks like:
# Modify the RewriteBase if you are using Drupal in a subdirectory or in a
# VirtualDocumentRoot and the rewrite rules are not working properly.
# For example if your site is at http://example.com/drupal uncomment and
# modify the following line:
RewriteBase /cr-d7
#
# If your site is running in a VirtualDocumentRoot at http://example.com/,
# uncomment the following line:
# RewriteBase /
### BOOST START ###
# Allow for alt paths to be set via htaccess rules; allows for cached variants (future mobile support)
RewriteRule .* - [E=boostpath:normal]
# Caching for anonymous users
# Skip boost IF not get request OR uri has wrong dir OR cookie is set OR request came from this server
RewriteCond %{REQUEST_METHOD} !^(GET|HEAD)$ [OR]
RewriteCond %{REQUEST_URI} (^/cr-d7/(admin|cache|misc|modules|sites|system|openid|themes|node/add|comment/reply))|(/(edit|user|user/(login|password|register))$) [OR]
RewriteCond %{HTTP_COOKIE} DRUPAL_UID [OR]
RewriteCond %{ENV:REDIRECT_STATUS} 200
RewriteRule .* - [S=3]
# GZIP
RewriteCond %{HTTP:Accept-encoding} !gzip
RewriteRule .* - [S=1]
RewriteCond /mnt/data/cr-d7/cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html\.gz -s
RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html\.gz [L,T=text/html,E=no-gzip:1]
# NORMAL
RewriteCond /mnt/data/cr-d7/cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s
RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]
### BOOST END ###Any red flags here?
I'd appreciate any help and can provide more info if need be.
Thanks
Comments
Comment #1
firfin commentedYou actually mean that you have the files are stored in the same directory as the symlink to /var/www ?
If so, why?
If not, can you elaborate about the setup?
I personally had 404 errors but fixed these by fixing the path settings and regenerating the .htaccess file as out lined in the installation instructions
And checking the site status report and recent logs.
Comment #2
Anonymous (not verified) commentedBoost 7.x does not have a crawler in the sense that you are referring to. 6.x has a crawler that would go through the site but it became unmanageable, the more efficient solution being to generate pages if they were edited ir commented upon.
Please go through your logs and see what is calling the 404 pages. Drupal gives a similar user agent/ log entry in Apache to the line below.
94.76.246.74 - - [24/Jan/2013:23:44:08 +0000] "GET / HTTP/1.0" 200 31246 "-" "Drupal (+http://drupal.org/)"Comment #3
drett commentedThanks for the comments on this. We ended up going with Varnish for all our caching needs.
Comment #4
Anonymous (not verified) commented