Download & Extend

Google caches unstyled pages - links to obsolete bundles

Project:BundleCache
Version:6.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:active
Issue tags:aggregation, JavaScript, JS aggregation

Issue Summary

We have found that google caches our pages complete with links to sfcache bundles, which all have a hash/checksum in their names. e.g. bundle_79172574ac22_1.css

Then when you go and create a new bundle, the filename changes completely and sfcache deletes the old bundle.
So any person who looks at a google cache of your page that was created when you were using an old bundle only sees a plain, unstyled page because the old stylesheet bundle no longer exists.

May I suggest the following solution:

Instead of changing the name completely each time the bundle is recreated, increment a version counter each time and append it as a query string. This acts as a mechanism to inform the user's browser when to download a new version of the same file. e.g. bundle.css?1, bundle.css?2 etc

Browsers will download the new bundle.css file if the querystring has changed but will fetch from their local cache if it hasn't. Google will still get whatever current version of the file exists because if it requests bundle.css?1 and you're up to bundle.css?3 it still sends down the latest file anyway (we used to do this manually at our sites before I found sfcache).

The other thing I thought of was 301 redirecting requests for the old files to the new ones but I'd expect that would involve some ugly drupal witchcraft if you wanted to avoid manually maintaining .htaccess

I realise that google fetching the latest version of a stylesheet may still not make an old page look like it should, but it is still better than no style at all in the vast majority of cases.

Comments

#1

The problem with adding query strings to the file is that some systems (e.g. reverse proxies, CDNs) do not cache these files by default (most can be configured that way, but it might have other implications). Another idea is to allow the user to select whether sf_cache should delete outdated bundles.

Redirecting outdated URLs might work in some occassions, but since sf_cache allows the user to configure the URL from which files are loaded, it can't cover all cases.

#2

Hmm, doesn't sound too promising then.
Here is the manual process we are using in the meantime, in case anyone else has this same problem.

We add something like this to our htaccess file, using the correct current version of the bundles.
This should redirect all requests to obsolete sf_cache files to the current versions.

# Redirect obsolete SF-Cache files to the current version
# DISABLE THIS BEFORE REGENERATING NEW FILES, then change to new filename and re-enable
RewriteCond %{REQUEST_FILENAME} !files/sf_css/core_9029c875a8a4_1.css$
RewriteRule ^files/sf_css/core_[a-z0-9]+_1.css$ files/sf_css/core_9029c875a8a4_1.css [R=301,L,NC]
RewriteCond %{REQUEST_FILENAME} !files/sf_css/theme_54bcd22c7896_1.css$
RewriteRule ^files/sf_css/theme_[a-z0-9]+_1.css$ files/sf_css/theme_54bcd22c7896_1.css [R=301,L,NC]

Now the important thing: during the update process, you must comment out those lines in htaccess before regenerating the bundles - otherwise your server will redirect new requests to the old filenames even after they no longer exist.

When the regeneration is complete, the sf_cache module gives you the new bundle names in a message. Copy the new filenames and update your htaccess file then uncomment the lines, save it on your server and you're good to go. No more unstyled google cache pages, or watchdog error messages when people with cached pages request out of date stylesheets.

#3

Project:Support File Cache» BundleCache
Version:5.x-1.1-a» 6.x-1.x-dev

Moving to the Bundlecache issue queue, as per #962178: Rename/migrate to Bundlecache and provide a stable D6 release..

#4

I've got the same problem due to Boost caching old versions of pages with the previous bundle, leading to temporary unstyled content.

I think you should follow the discussion over at #721400: Don't change filenames for aggregated JS/CSS about core aggregation moving to not changing filenames. It appears at in D7 things went towards using a GET parameter instead.

Also see http://drupal.org/node/721400#comment-2695488 where it was recommended that sf_cache (now bundlecache I suppose) become the backported option for this change.

I'm happy to roll a patch that provides this as a configurable option if you like?

#5

That issue makes me sad. The reason we started using sf_cache is I regard the drupal default aggregation as fundamentally broken because of the way it forces a re-download of everything even if one only one page has one changed css/js file. I vented about is a long time ago here: #246722: CSS preprocessor: vast amount of redundant CSS. Its frustrating to see people raising all the same issues again on that issue you linked to.

However it seemed to me that they decided in the end to change the filenames - like sf_cache does - but leave the old files around on the server for 30 days.

There was a link to this page which I found useful in describing why using a GET parameter can be bad (which kkaefer did say above in comment #1):
http://www.stevesouders.com/blog/2008/08/23/revving-filenames-dont-use-q...

Interesting non-drupal solution here. The real filename does not change, but the filenames output to the browser in the html source are versioned and then .htaccess rewrites requests to the versioned filename into the actual filenames. That seems to me to be a better solution as it catches all possible versioned filenames, and requires no manual intervention once set up.
http://particletree.com/notebook/automatically-version-your-css-and-java...

#6

subscribe

#7

subscribe

#8

module that fully addresses this issue
http://drupal.org/project/advagg

#9

advagg doesn't seem to offer the kind of control over what goes in which bundle that bundlecache does. I backed out of advagg and went back to bundlecache, and had a go at implementing the option to use a query string.

Unfortunately, because of the way that paths are created at bundle create time, and the importance of those paths to determining whether the hash has changed, I ultimately failed to patch bundlecache to do this without having to make some extensive changes.

I instead went down the path of using bundlecache's hooks to do it, by hardlinking the composite files to their basename, then using a new alter hook that I submitted in #1132708: New hook for altering the URL at render time to modify the URL on output to be of the form bundlename.js?hash

Here's the module code that implements these hooks. Just change it from mymodule to whatever module you use for cache behaviour on your site:

/**
* Implementation of hook_bundlecache_composite_deploy()
*/
function mymodule_bundlecache_composite_deploy($bundle, $processed) {
  // When a bundle file is deployed, strip off the hash part and write the file plain,
  // eg. primary_js.abc123_1.js -> primary_js.js
  list($new_path, $hash) = _mymodule_bundlecache_processed_strip($processed);
  if ($new_path && $new_path != $processed) {
    // Create a hardlink to the latest file
    if (file_exists($new_path)) {
      file_delete($new_path);
    }
    link($processed, $new_path);
  }
}

/**
* Implementation of hook_bundlecache_composite_retract()
*/
function mymodule_bundlecache_composite_retract($bundle, $processed) {
  // Delete our copy of the file
  list($new_path, $hash) = _mymodule_bundlecache_processed_strip($processed);
  if (file_exists($new_path)) {
    file_delete($new_path);
  }
}


/**
* Implementation of hook_bundlecache_process_file_alter()
*/
function mymodule_bundlecache_process_file_alter(&$file) {
  // Rewrite the file to exclude the hash on output
  if (!empty($file['processed'])) {
    list($new_path, $hash) = _mymodule_bundlecache_processed_strip($file['processed']);
    if ($new_path && file_exists($new_path)) {
      $file['processed'] = $new_path.'?'.$hash;
    }
  }
}

/**
* Strips off the bundlecache hash from a bundlecache path and returns a pure path and the hash
*
* @param string $processed the string, such as sites/default/files/bundlecache_js/primary_js_976b5ee51491_1.js
* @return array a two-element array, the first being the stripped path, eg. sites/default/files/bundlecache_js/primary_js.js and the second being the hash, eg. 976b5ee51491_1
*/
function _mymodule_bundlecache_processed_strip($processed) {
  if (preg_match('@sites/default/files/bundlecache_(css|js)/(.+?)_([0-9a-f_]+?)\.(js|css)$@', $processed, $m)) {
    return array(
      'sites/default/files/bundlecache_'.$m[1].'/'.$m[2].'.'.$m[4],
      $m[3]
    );
  }
}

#10

I agree about the lack of control over what goes into bundles in advagg. I decided not to use it for the same reason.

Your approach is interesting. If I understand it correctly this is the result:

- bundlecache saves the file on the server with a plain name. ie bundle.js
- html page is served with bundle.js?123456
- there is a hard link from bundle.js to bundle_123456.js. This hard link is destroyed when the bundle is changed (?)

I think even better would be to do this (I actually use this approach using .htaccess for some non-bundlecache css that we serve specific to IE browsers):

- save the file on the server with a plain name. ie bundle.js
- serve the bundle on the html page with either a hash or simply the file's timestamp in the filename, not querystring. ie bundle_123456.js
- avoid the use of a hard link by intercepting the request and stripping the hash/timestamp from it (you can do this with .htaccess, not sure if possible via drupal hooks).

Then it avoids the problems with using querystrings mentioned by kkaefer in #1 above, and any old files which are referenced by google or elsewhere will always be caught and redirected to the most current bundle. The key is whether we can intercept the request via drupal to strip the hash/timestamp.

I think a possible problem with your approach is that requests for obsolete bundles will not be caught, as the hard link will have been deleted when the bundle was destroyed?

#11

No, the hardlink is the name of the file without any hash, an example:

  1. Bundlecache generates a composite file called homepage_976b5ee51491_1.js
  2. I catch the deploy hook and hardlink homepage.js -> homepage_976b5ee51491_1.js (so they're now the same file as far as the filesystem is concerned)
  3. I rewrite the output link to go to homepage.js?976b5ee51491_1.js instead of homepage_976b5ee51491_1.js
  4. The webserver receives a request for homepage.js and serves homepage.js, which is hardlinked to the latest bundle

Every time Bundlecache updates the bundle, it calls retract on the old bundle, then deploy on the new one (which will have a new name). On retract, I kill the hardlink, then on deploy I create a new hardlink to the latest bundle. This way, homepage.js is always hardlinked to the latest file. The only real reason I used a hardlink rather than a symlink to do this is hardlinks are slightly faster, and more compatible with anything that wants to perform an operation on the file, and also less likely to randomly break (since you can't have a dangling hardlink).

Similar to your solution, any cached page will contain a reference to the latest JS file via the hardlink, but doesn't require the webserver to help.

We use nginx as our webserver, and I have tried an alternative method that maintains the original filenames, just like what you do:

  # Rewrite bundlecache files back to their base filename
  location ~ ^/sites/default/files/bundlecache_(css|js) {
    rewrite ^/sites/default/files/bundlecache_(css|js)/(.+?)_[0-9a-f]+_[0-9]+\.(js|css)$ /sites/default/files/bundlecache_$1/$2.$3 last;
  }

And remove the hook_bundlecache_process_file_alter().

The only downside to this is that Boost doesn't pre-gzip the file and add it to perm, because Boost looks for JS/CSS files inside the HTML of a page it's about to cache. I got around this by explicitly calling _boost_copy_css_files(array($new_path)); or _boost_copy_js_files(array($new_path)); inside the deploy hook. I haven't quite got it working right yet. Of course this only matters if you care about Boost's gzip cache...

I agree that the query string method probably causes more trouble with proxies, and Chrome's audit tool is unhappy about it. I'll keep pursuing my other solution I think :)

#12

Created an issue for advagg based on feedback here.
#1140624: Create GUI for full control over bundles used

#13

I think it doesn't make sense to serve new versions of bundles on requests to the old ones. I would treat cached page and its CSS files as a single entity. When Google, or any other cache serves old version of the page together with new version of CSS, it can break very easily (especially when you make changes to the design).
Because of this, I would think about page+CSS as a versioning: in order to support serving old pages, you also must keep old version of CSS forever (or as long as it can be requested by visitors).
Am I missing something here ?

#14

Agree .. keep all older versions and add a "clean up" button and a cron option to auto-clean after N days?

nobody click here