We need locking around the css and js aggregation. We added locking to other subsystems in Drupal 7, and this one was just missed. Without locking, multiple processes try to build and rebuild the css and js. On our site, this puts a high load on the file system, which causes apache to throw 503 errors.

Also, if you simply update the css or js, the files will not rebuild. Say you upgrade a module, and it adds 5 lines to it's css, or say you modify your theme and change things around, if the bundle is already built, and if no new files were added to the page, then the bundle will not rebuild. I've introduced a "version" for this. The intent of the "version" is that some contrib module can be written to allow incrementing the version either from an admin page or via drush, and that the site developers will know when this needs to be done, and will simply increment the version, forcing a rebuild.

An additional problem surfaced while trying to solve this, which is that if the creation of the css file fails, we currently return FALSE; however the function that calls drupal_build_css_cache() doesn't interprete the result, and simply adds the empty string as a css file. This is an obvious bug. We need to handle the FALSE return in css, just like we do in js.

So my solution is to use locks so that we generate each bundle in only one process (allowing multiple bundles to build simultaneously, but making sure only one process builds each bundle), to use a global lock when saving the map, returning stale bundles during the rebuild or on error.

Anyone new to this issue can problem skip ahead to #30 or so.

Files: 
CommentFileSizeAuthor
#88 css-stampede-886488-86.patch14.26 KBmarcingy
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css-stampede-886488-86_0.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#86 css-stampede-886488-86.patch14.26 KBmarcingy
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css-stampede-886488-86.patch.
[ View ]
#81 886488.stampede.patch14.41 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 31,527 pass(es).
[ View ]
#80 886488-68.patch14.38 KBcarlos8f
PASSED: [[SimpleTest]]: [MySQL] 29,849 pass(es).
[ View ]
#71 886488.patch14.66 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 28,844 pass(es).
[ View ]
#68 886488-68.patch14.38 KBcarlos8f
PASSED: [[SimpleTest]]: [MySQL] 28,809 pass(es).
[ View ]
#65 stampede.patch14.05 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 28,782 pass(es).
[ View ]
#56 stampede.patch14.05 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 28,839 pass(es).
[ View ]
#55 css.stamped.patch13.69 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 28,820 pass(es).
[ View ]
#53 css.stamped.patch13.65 KBbeejeebus
FAILED: [[SimpleTest]]: [MySQL] 28,790 pass(es), 0 fail(s), and 19 exception(es).
[ View ]
#50 css.stamped.patch13.65 KBbeejeebus
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in common.inc.
[ View ]
#48 886488-2-debug.patch.txt13.81 KBJeremy
#47 886488-2.patch12.86 KBJeremy
PASSED: [[SimpleTest]]: [MySQL] 28,780 pass(es).
[ View ]
#46 886488.patch14.12 KBJeremy
PASSED: [[SimpleTest]]: [MySQL] 28,737 pass(es).
[ View ]
#46 886488-debug.patch.txt15.49 KBJeremy
#46 886488-nopatch-debug.patch.txt1.76 KBJeremy
#42 886488.patch13.74 KBnnewton
PASSED: [[SimpleTest]]: [MySQL] 28,788 pass(es).
[ View ]
#37 886488.patch13.25 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_12.patch.
[ View ]
#34 886488.patch13.23 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 27,393 pass(es).
[ View ]
#30 886488.patch13.23 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] 27,393 pass(es), 0 fail(s), and 20 exception(es).
[ View ]
#28 886488.patch12.51 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] 27,373 pass(es), 0 fail(s), and 2 exception(es).
[ View ]
#26 886488.patch12.41 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] 27,373 pass(es), 0 fail(s), and 39 exception(es).
[ View ]
#24 886488.patch10.12 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_7.patch.
[ View ]
#23 886488.patch10.09 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_6.patch.
[ View ]
#21 886488.patch9.74 KBdouggreen
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_5.patch.
[ View ]
#20 886488.patch10.11 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 24,652 pass(es).
[ View ]
#19 886488.patch9.77 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 23,299 pass(es).
[ View ]
#18 886488.patch9.75 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 23,314 pass(es).
[ View ]
#17 886488.patch9.16 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 23,293 pass(es).
[ View ]
#13 886488.patch7.71 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 23,293 pass(es).
[ View ]
#12 886488.patch7.59 KBdouggreen
PASSED: [[SimpleTest]]: [MySQL] 23,291 pass(es).
[ View ]
#11 css_js_stampede.patch6.3 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 22,935 pass(es).
[ View ]
#10 css_js_stampede.patch6.23 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 22,961 pass(es).
[ View ]
#8 css.patch8.23 KBcatch
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css_39.patch.
[ View ]
#5 css_js_stampede.patch6.16 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 22,924 pass(es).
[ View ]
#3 css_stampede.patch6.17 KBcatch
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in common.inc.
[ View ]
css_stampede.patch4.34 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 22,941 pass(es).
[ View ]

Comments

Status:Needs review» Needs work

The last submitted patch, css_stampede.patch, failed testing.

Status:Needs work» Needs review

css_stampede.patch queued for re-testing.

StatusFileSize
new6.17 KB
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in common.inc.
[ View ]

Adding js.

Status:Needs review» Needs work

The last submitted patch, css_stampede.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new6.16 KB
PASSED: [[SimpleTest]]: [MySQL] 22,924 pass(es).
[ View ]

minus syntax error.

Status:Needs review» Needs work

Please initialize $lock_acuire_attempts = 0; in both cases. Also, the logic for $lock_acquired seems to be wrong, or rely on some mystical operator precedence. I'd change this to

if ($lock_acquired = lock_acquire($key)) {
  ...
}
else {
  lock_wait($key);
}

Status:Needs work» Needs review
StatusFileSize
new8.23 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css_39.patch.
[ View ]

updated patch. Some of these fixes are direct from doug, so please credit him in any commits.

Status:Needs review» Needs work

The last submitted patch, css.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new6.23 KB
PASSED: [[SimpleTest]]: [MySQL] 22,961 pass(es).
[ View ]

Should hopefully apply this time, also this time with a global lock instead of per file.

StatusFileSize
new6.3 KB
PASSED: [[SimpleTest]]: [MySQL] 22,935 pass(es).
[ View ]

One more time, use a variable for $lock_name.

StatusFileSize
new7.59 KB
PASSED: [[SimpleTest]]: [MySQL] 23,291 pass(es).
[ View ]

The previous patch by catch still does not solve our cache stampede problem. Forcing everything on the site to rebuild simultaneously causes too long of a wait. This new approach uses a single lock for css and another for js, and let's the system rebuild them one at a time, serving up the older files when it can't get a lock.

StatusFileSize
new7.71 KB
PASSED: [[SimpleTest]]: [MySQL] 23,293 pass(es).
[ View ]

And a fix to that release, that unlocks the right key (catches last patch did this right, but mine did not), and also unlocks the key on write failure.

Write through seems the right way to go for this, however it looks like the only window for that to work is between the variable_set() and the file deletion, if that's right is there a way for us to do the delete itself a bit later?

Doug pointed out that drupal_delete_if_stale defaults to 30 days, so there's a reasonable chance of them existing, and you can set things higher if you need it.

The only problem I see with this approach is that the maps might contain entries that are obsolete. When you push new code that requires a js or css rebuild, other things probably changed and the hash's on some pages might change. We probably want to occasionally remove stale map entries, maybe on cron, that exist in the map, but have an older time than the clear. I'll add that later today.

StatusFileSize
new9.16 KB
PASSED: [[SimpleTest]]: [MySQL] 23,293 pass(es).
[ View ]

New patch
* adds map cleanup on system cron
* only sets $map['clear'] if there's something in it

StatusFileSize
new9.75 KB
PASSED: [[SimpleTest]]: [MySQL] 23,314 pass(es).
[ View ]

One small tweak, to unset 'clear' if it's no longer needed.

Also, I see an additional problem, which is that the map is retrieved before we get a lock. It's probably that if we don't get the lock on the first attempt, that another process will update the map, and that once this process gets the lock, it will revert anything another process did to the map. It's even possible that this was the root cause to needing a global lock (and a global lock just minimized the time that this was an issue), and that after this change, we might be able to go back to a lock on the individual keys.

StatusFileSize
new9.77 KB
PASSED: [[SimpleTest]]: [MySQL] 23,299 pass(es).
[ View ]

... and now, only re-read the map if file_unmanaged_save_data() succeeds, ... we probably never hit this condition, but it saves us from re-reading the variables on failure.

StatusFileSize
new10.11 KB
PASSED: [[SimpleTest]]: [MySQL] 24,652 pass(es).
[ View ]

lock_wait() needs the name of the lock as an argument.

StatusFileSize
new9.74 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_5.patch.
[ View ]

Updated patch always rebuilds file if it is missing, or being cleared, extra check for file existence was preventing rebuilds on clear, and legacy check from old code, new code should only get into this if the file doesn't exist or it needs to be cleared.

Status:Needs review» Needs work

The last submitted patch, 886488.patch, failed testing.

StatusFileSize
new10.09 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_6.patch.
[ View ]

oops, I did same thing last time

StatusFileSize
new10.12 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_7.patch.
[ View ]

We're still having problems. Updated patch file makes sure to atomically change the map file. We're also trying a version that uses cache_set/get, but will post that next if it appears to work.

Status:Needs work» Needs review
StatusFileSize
new12.41 KB
FAILED: [[SimpleTest]]: [MySQL] 27,373 pass(es), 0 fail(s), and 39 exception(es).
[ View ]

Fixing patch file again, and updating to head

Status:Needs review» Needs work

The last submitted patch, 886488.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new12.51 KB
FAILED: [[SimpleTest]]: [MySQL] 27,373 pass(es), 0 fail(s), and 2 exception(es).
[ View ]

Add missing contents line, that was lost in the merge.

Status:Needs review» Needs work

The last submitted patch, 886488.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new13.23 KB
FAILED: [[SimpleTest]]: [MySQL] 27,393 pass(es), 0 fail(s), and 20 exception(es).
[ View ]

Updated version checks if the map exists before writing it, which is the root cause of the last testbot failure.

Updated version returns the $uri on file_unmanaged_save_data write failures. It returns the $uri because we have one, and it's more likely that returning a stale version is better than returning no version. Updated version also duplicates the js code css for handling a FALSE returned by drupal_build_css_cache, by simply not putting it into the css group. Surprisingly, if you do put it into the group, with a value of FALSE, what gets generated is a css link to the root of your site.

Priority:Normal» Critical

This issue hasn't gotten much attention yet. I just rewrote the description, and would like to make a case for this being a Drupal 7 critical issue. It's amazing we've done so well on Drupal 7, which speaks to how mature it really was, even pre-launch. But this issue, the one outlined here, has been our biggest cause of downtime. Without this patch, we can't change our theme, and we can

I'm not convinced yet that this is the final solution. I'm advocating that others look at the problem, and help solve it.

I'm overstepping a bit by making this critical, but I'd like to make sure that some key people look.

Status:Needs review» Needs work

The last submitted patch, 886488.patch, failed testing.

subscribe.

Status:Needs work» Needs review
StatusFileSize
new13.23 KB
PASSED: [[SimpleTest]]: [MySQL] 27,393 pass(es).
[ View ]

ugh, I can't believe I did that, fixing variable name...

subscribing

Priority:Critical» Major

While I think this is a great change, and much needed for D7 I don't think it is a critical (which doesn't mean it won't make D7, just that it won't block the release).

Put simply
It doesn't break the API (although it does add some clean up functions so could count as an API change)
It doesn't contain any strings
It doesn't stop Drupal working in a fundamental way.

See #974066: Friendly reminder: on critical issues and RCs for information on criticals.

If someone else disagrees withe me please bump again.

Priority:Major» Critical
StatusFileSize
new13.25 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 886488_12.patch.
[ View ]

This is a new feature to D7 that prevents a high usage site from updating their code without downtime, that to me stops D7 from working in a critical way, consider security updates. No disrespect Jeremy, I'm sure you've done great work in Drupal. I hope you agree it would be nice to let a few more people see this before lowering the priority. I work with chx and catch, and both those guys know about this issue. I know that either of them (or any of a dozen or more others) will lower the priority when they want to get Drupal out the door or if they agree with you, and I'll accept it then.

Here's one more update. Adding version to the filename, so that upping the version, which already forces a new bundle to build, also updates the filename, forcing CDN's to clear. I thought that was already in here, but wasn't. It's kinda a key component to the version idea.

Status:Needs review» Needs work

The last submitted patch, 886488.patch, failed testing.

The return FALSE bug should probably be it's own issue, that's quite serious just by itself if you run into it and it's not directly related to the stampede.

I think this is strictly speaking 'major' in the sense that it will only occur with a certain degree of traffic (or possibly on a site that's i/o constrained with medium traffic), however I don't think we should release with both this issue and #973436: variable_set() should rebuild the variable cache, not just clear it unresolved - since to me the combination of the two is critical (the css/js stampede can itself lead to a variable_set() stampede since that's where the map is stored).

These are extremely hard issues to track down, hard to fix, and there are currently limited sites which can be used as test-beds - which means it makes sense to fix it before people's sites start going down all over the place rather than after when we're flooded with obscure support requests.

Priority:Critical» Major

Let's stop with the status ping-pong. But I'm happy to take a look at this if it gets RTBC in time.

RE: The patch in #37

I've been testing this patch today on an EC2 VM. It is fairly easy to generate a css/js stampede simply using ab. You don't even need to simulate NFS-level IO conditions (although one could argue that EC2 is fairly close to that at times), which makes me wonder how long this has been a problem. I forward ported the patch in 37 to work for the latest head and tested it against a vanilla head. The results were not encouraging, it basically had no impact but for slowing down page load due to the locks. I added some logging and I think I know what is going on. While this patch introduces a lock and makes only a single process build a given css group at a time, the function never checks to see if another process has built the CSS group its looking for after waiting for the lock (unless I'm missing something). Thus, all you really end up doing is forcing multiple processes to repeatedly build the same css bundles in sequence.

I've run out of time today, but will continue with this tomorrow and post something that checks for new entries in the $map after waiting for a lock.

StatusFileSize
new13.74 KB
PASSED: [[SimpleTest]]: [MySQL] 28,788 pass(es).
[ View ]

Attached is a slight edit of the patch in #37. As a note, I don't think this is an acceptable way to do this as it makes the logical flow of this function just weird and introduces another break in the loop. However, I wanted to post it to allow for testing of this method of checking the map.

With this patch, after clearing the js/css and hitting a site with 50 concurrent requests we see the following:

[23-Nov-2010 13:18:29] Got the css lock, building
[23-Nov-2010 13:18:29] Saving file public://css/css_nTykqH7_P-Sl6RvZSU5FglBKJax_EHxtkLwXS_CelI8.css
[23-Nov-2010 13:18:29] Did not get the lock and mtime is not set, waiting
[23-Nov-2010 13:18:29] Got the css lock, building
[23-Nov-2010 13:18:29] Saving file public://css/css_UZZSfikU7d17FOgy4Ft2z8A59kA-jeQ0IbTgYg1SDQ0.css
[23-Nov-2010 13:18:29] Got the css lock, building
[23-Nov-2010 13:18:29] Did not get the lock and mtime is not set, waiting
[23-Nov-2010 13:18:29] Did not get the lock and mtime is not set, waiting
[23-Nov-2010 13:18:29] Saving file public://css/css_h0Hq534EpnZFH3ps1FOXyyFyDR5sn8iiIadImLqr0FE.css
[23-Nov-2010 13:18:29] Did not get the lock and mtime is not set, waiting
[23-Nov-2010 13:18:29] Got the css lock, building
[23-Nov-2010 13:18:29] Saving file public://css/css_1Uc_CBofbQGqeIKgDSH-akZMSYiWg_pKxfxa4ffiBCk.css
[23-Nov-2010 13:18:29] Saving file public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js
[23-Nov-2010 13:18:29] Saving file public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz

As compared to this with just #37:

[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_f8Gzuz3XG20eax5GvJQHwYj--jYXFIfSk4pvMCGjzjc.css
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_f8Gzuz3XG20eax5GvJQHwYj--jYXFIfSk4pvMCGjzjc.css
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_f8Gzuz3XG20eax5GvJQHwYj--jYXFIfSk4pvMCGjzjc.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_451-qZiWBOIVu7qDk0wv_5WnfXM-OdrdCzYy8zvxbhU.css
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_451-qZiWBOIVu7qDk0wv_5WnfXM-OdrdCzYy8zvxbhU.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_451-qZiWBOIVu7qDk0wv_5WnfXM-OdrdCzYy8zvxbhU.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_451-qZiWBOIVu7qDk0wv_5WnfXM-OdrdCzYy8zvxbhU.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css__ZKx9t7pbm1l_4rHrOG_PayYpFTwjzZSwPKhiaWl-0w.css
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css__ZKx9t7pbm1l_4rHrOG_PayYpFTwjzZSwPKhiaWl-0w.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css__ZKx9t7pbm1l_4rHrOG_PayYpFTwjzZSwPKhiaWl-0w.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_2THG1eGiBIizsWFeexsNe1iDifJ00QRS9uSd03rY9co.css
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_2THG1eGiBIizsWFeexsNe1iDifJ00QRS9uSd03rY9co.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://css/css_2THG1eGiBIizsWFeexsNe1iDifJ00QRS9uSd03rY9co.css.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz
[22-Nov-2010 16:55:07] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://css/css_nTykqH7_P-Sl6RvZSU5FglBKJax_EHxtkLwXS_CelI8.css
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://css/css_UZZSfikU7d17FOgy4Ft2z8A59kA-jeQ0IbTgYg1SDQ0.css
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://css/css_h0Hq534EpnZFH3ps1FOXyyFyDR5sn8iiIadImLqr0FE.css
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://css/css_nTykqH7_P-Sl6RvZSU5FglBKJax_EHxtkLwXS_CelI8.css
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://css/css_1Uc_CBofbQGqeIKgDSH-akZMSYiWg_pKxfxa4ffiBCk.css
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js
[22-Nov-2010 16:55:48] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js.gz
[22-Nov-2010 16:55:49] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js
[22-Nov-2010 16:55:50] Running file_unmanaged_save_data with dest public://css/css_nTykqH7_P-Sl6RvZSU5FglBKJax_EHxtkLwXS_CelI8.css
[22-Nov-2010 16:55:50] Running file_unmanaged_save_data with dest public://js/js_VF53tELsw3kPql1FJD633rc2wNa0fCZV5Foz0aUfhvo.js

Great progress Narayan :) Yes, I see where the extra variable_initialize and map re-check is required only while were rebuilding the bundle. I'm somewhat concerned about the extra variable initialize after the lock_acquire because of issues raised by catch, and would like to hear what he has to say, but comforted that we only do this during the time that we try to rebuild the bundle.

this looks to be badly needed.

<?php
+        //We may have just waited, if so we need to recheck the map
+        $vars = variable_initialize(array());
+       
$map = isset($vars['drupal_css_cache_files']) ? $vars['drupal_css_cache_files'] : array();
?>

i think the intent is to make sure we get the updated map if another thread already did the work? so, can we add a comment about why we need to use variable_initialize() instead of variable_get() here? also, the comment needs a space at the start and a full-stop at the end, and there's no need to pass the 'array()' to variable_initialize, as that's the default.

we could really use #973436: variable_set() should rebuild the variable cache, not just clear it here...

Yes this issue is a big part of why I opened the variable_set() one. Narayan's fix looks right to me pending Justin's comments.

StatusFileSize
new1.76 KB
new15.49 KB
new14.12 KB
PASSED: [[SimpleTest]]: [MySQL] 28,737 pass(es).
[ View ]

Updated patch to address Justin's comments.

Does this patch intentionally remove support for gzip'ing the cache files? That's a critical bug with this patch imo, unless I'm missing where this logic was moved to.

Also, when we find that a CSS file has been built while we were waiting for a lock we simply exit out of the loop, leaving the lock around. I suppose it doesn't much matter, anyone already waiting on a lock will detect the new map and break, and anyone that comes along afterwards won't bother trying to grab the lock.

I'm also attaching two debug patches. The "nopatch-debug" patch simply adds a sleep (to increase the race time) and debug output to Drupal 7 without adding locking -- with this patch applied enable/disable a theme or otherwise trigger a CSS rebuild in one browser, and then load a page in another browser. This easily illustrates the race condition this patch is trying to fix as we end up building multiple copies of the CSS cache (the same is true for the JS cache though this patch doesn't show that).

The "-debug" patch adds sleep and debut output to Drupal 7 along with the new locking logic -- with this patch applied enable/disable a theme or otherwise trigger a CSS rebuild in one browser, and then load a page in another browser. You can then confirm that the lock is preventing multiple copies of the CSS cache from being built. (Navigate a second page to see the drupal_set_message output...) This also shows how necessary #973436: variable_set() should rebuild the variable cache, not just clear it is -- loading the page with a number of browsers while the CSS cache is being rebuilt we reload variables from the database an excessive number of times (an additional sleep could make that more obvious).

Status:Needs work» Needs review
StatusFileSize
new12.86 KB
PASSED: [[SimpleTest]]: [MySQL] 28,780 pass(es).
[ View ]

I took another shot at this, trying to simplify the patch a bit. I removed the $version stuff as it's a new feature and not directly relevant to this bug. I also removed the patch to the system module, as I couldn't find anywhere that $map['clear'] is actually being set in the latest version of the patch. I also removed the special handling for the first time we fail to grab a lock as it was uncommented and I wasn't certain I understood why we do that, though perhaps that should be left in with comments added?

Otherwise, I restored the gzip functionality, perform the atomic map update for css and js caching inline, and added more comments to make things more readable. Sorry if this patch is a functional regression, but I was aiming to be sure there's nothing unnecessary in there that would prevent this from getting merged into core. All it's attempting to do is properly protect against the js/css stampede.

StatusFileSize
new13.81 KB

And the debug version in case it's helpful to anyone trying to test this.

#46: 886488.patch queued for re-testing.

StatusFileSize
new13.65 KB
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in common.inc.
[ View ]

i agree with jeremy's approach here - i think we need to change as little as possible at this stage.

attached patch is an attempt to improve on #47.

* don't need the $mtime stuff at all, we're only interested in whether an old file exists for $key
* don't lock around the gzip file, because the rewrite rules will handle a missing gzip file, so there's no need to make other process wait for that to hit the disc
* try harder to return the old $uri if we have issues saving the newly generated css data to disc
* lock around variable_del('drupal_css_cache_files') to avoid the race where we read the old map, delete, then write the old values again

i've only done the CSS part, but i think js should have similar treatment, pending feedback on the implementation.

oh yeah, this is another issue that would actually be testable if we had #850782: allow testing lock code via async http calls.

Status:Needs review» Needs work

The last submitted patch, css.stamped.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new13.65 KB
FAILED: [[SimpleTest]]: [MySQL] 28,790 pass(es), 0 fail(s), and 19 exception(es).
[ View ]

this time without the syntax errors....

Status:Needs review» Needs work

The last submitted patch, css.stamped.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new13.69 KB
PASSED: [[SimpleTest]]: [MySQL] 28,820 pass(es).
[ View ]

initialise lock_attempts.

StatusFileSize
new14.05 KB
PASSED: [[SimpleTest]]: [MySQL] 28,839 pass(es).
[ View ]

ok, js now gets the same treatment.

we really need tests of this functionality (locking aside). the patch at #55 was completely broken for the gzip stuff, but all tests passed :-(

Status:Needs review» Needs work

The last submitted patch, stampede.patch, failed testing.

Status:Needs work» Needs review

#56: stampede.patch queued for re-testing.

can't reproduce that fail locally...

Status:Needs review» Needs work

The last submitted patch, stampede.patch, failed testing.

Status:Needs work» Needs review

blurgh! not having much luck with the bot.

Issue tags:+Performance, +scalability

tagging.

I need to review this tonight. The bot should be in a better mood today.

#56: stampede.patch queued for re-testing.

StatusFileSize
new14.05 KB
PASSED: [[SimpleTest]]: [MySQL] 28,782 pass(es).
[ View ]

updates to keep up with HEAD only.

Status:Needs review» Needs work

The $mtime stuff already has its own issue: #678292: Make CSS and JS aggregation sensitive to file changes and be enabled by default so I'm OK with removing that from the patch.

Starting with #50 though it looks like we are introducing a silly regression: we are re-aggregating on every request! This should be put at the top:

<?php
if (!empty($map[$key]) && file_exists($map[$key])) {
  return
$map[$key];
}
?>

I'm still digesting the rest of the patch, but we at least need to address the obvious :)

Assigned:Unassigned» carlos8f

Working on some cleanups.

Status:Needs work» Needs review
StatusFileSize
new14.38 KB
PASSED: [[SimpleTest]]: [MySQL] 28,809 pass(es).
[ View ]
  • Fixed bug -- drupal_build_*_cache() should aggregate only if $map[$key] file doesn't exist.
  • Fixed bug -- lock not released in if ($lock_acquired) { block of the CSS function before returning.
  • Fixed bug -- js_cache lock not respected in drupal_clear_js_cache().
  • Style -- simplified while () conditions a bit. Now does 5 tries instead of 4.
  • Style -- define $data closer to where it's used.
  • Style -- rewrote some parts of the comments.

Additionally, drupal_build_*_cache() are virtually identical and tough to update since changes need to be made to both js and css functions in separate parts of the behemoth common.inc. I think we should merge these into one and have a $type = 'css' or 'js' for sanity. That doesn't have to happen now though.

thanks for the review! you've caught some nice bugs there.

i'll work on some tests today.

Assigned:carlos8f» Unassigned

.

StatusFileSize
new14.66 KB
PASSED: [[SimpleTest]]: [MySQL] 28,844 pass(es).
[ View ]

I've restored a somewhat simplified version system for the bundles. I think we need the version system for high volume sites because even with the race fixes here, some processes are going to have to wait, and our current wait is 1 second.

With a version system, someone can write a contrib module that (1) pre-generate new bundles before upping the version, and thus completely by-pass the contention, and (2) create new bundles after an update to your code, without breaking cached pages to the old bundles.

Why'd you change this to 5 tries instead of 4? If we had #802856: Make lock_wait() wait less I wouldn't mind bumping the number even more; but without it, on our site, every second is 25-50 processes, and an extra second adds too many waiting processes.

@douggreen, I changed to 5 tries to make the code more readable, so the "< [number]" matches the number of tries. Otherwise you might think there are 5 tries when there are actually 4 due to the ++ being before the variable. And because I couldn't figure out why you would want 4 tries instead of 5 :) That it certainly negotiable.

maybe i'm missing something, but i don't see why versioning is necessary.

in the case where the file exists, but the cache hasn't been cleared, you could write a drush command to generate new cache files when you know your code has changed. the filename is based off of contents, so there are no caching issues.

a call to drupal_build_*_cache() writes the new file and updates the map, and new requests will just serve the existing file.

so, i think we should proceed with #68.

I'm in a rush but last time I looked at this, I also thought the filename was based on contents, but it's not, it's based on the filenames only. This is a bug in itself, but afaik we can't fix it without loading all the files into php on every request which seems silly.

Will try to review this more in depth later but the recent patches look a lot more straightforward.

filename is based on contents:

<?php
$filename = 'css_' . drupal_hash_base64($data) . '.css';
?>

Re: #75,

- $key is based on the list of files to aggregate.
- The value of $map[$key] ($uri) is a filename hashed with the file's contents.
- If the $uri file exists, no action is taken (true in HEAD and in #68).
- If it's desired that $uri automatically changes with the file contents, mtime() would work for that, which already has an initiative in #678292: Make CSS and JS aggregation sensitive to file changes and be enabled by default.

Patch #71 changes $key to take a $version into account. As it stands, it looks like that idea needs more hashing out (pun intended). Setting a variable to denote a "version" is messy, not documented, and would be better suited as an API function that auto-increments (but we're beyond the point of new features like that).

Re: @justinrandell #74, for that idea to work you would need a cache of all the css/js groups or "bundles" to re-aggregate. I don't think that exists currently. So with those things in mind, I think #68 can stand on its own without a versioning scheme, and versioning can be attacked in a different issue.

Justin/Carlos thanks - makes sense with $key vs. $filename, I agree with moving discussion of that to a separate issue (or the existing mtime() one).

StatusFileSize
new14.38 KB
PASSED: [[SimpleTest]]: [MySQL] 29,849 pass(es).
[ View ]

Re-upload of #68, because versioning is a feature and should have its own issue.

StatusFileSize
new14.41 KB
PASSED: [[SimpleTest]]: [MySQL] 31,527 pass(es).
[ View ]

updated #80 to fix patch noise, no code changes.

Been thinking about this recently and realised two things:

1. Drupal currently saves a gzipped copy of all css/js aggregates in addition to the .css/.js version - this will double i/o load. Any site using a cdn and/or varnish can afford to use mod_deflate on those files instead of having Drupal do it. Can save around 4-500ms per request.

2. We could also prevent the write stampede, or most of it at least, by not trying to have every request build the file themselves, instead have imagecache style callbacks that do the generation - see #1014086: Consider using imagecache style generation of CSS and JS aggregates. This would be an even bigger change than the patch here, and I don't have code for it yet though.

Issue tags:+i/o

Tagging.

Subscribing.

Version:7.x-dev» 8.x-dev
Category:bug» task
Issue tags:+needs backport to D7

Nice improvement for high-performance sites, but only borderline a bug.

StatusFileSize
new14.26 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css-stampede-886488-86.patch.
[ View ]

Just a reroll to convert to git patch.

Status:Needs review» Needs work

The last submitted patch, css-stampede-886488-86.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new14.26 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch css-stampede-886488-86_0.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Hopefully this is better

Tagging issues not yet using summary template.

Bump

fwiw we are running this code and have enabled php-fpm during a site crash to help with debugging. Turns out the usleep being used here was stacking up with multiple requests causing the site to crash:

[13-Feb-2012 17:47:22] [pool www] pid 23743
script_filename = /var/www/html/index.php
[0x0000000015a27a90] usleep() /var/www/html/sites/all/modules/contrib/memcache/memcache-lock-code.inc:106
[0x0000000015a25c00] lock_wait() /var/www/html/includes/common.inc:3464
[0x0000000015a25580] drupal_build_css_cache() /var/www/html/includes/common.inc:3136
[0x0000000015a23ac8] drupal_aggregate_css() /var/www/html/includes/common.inc:3222
[0x0000000015a22ba8] drupal_pre_render_styles() /var/www/html/includes/common.inc:5829
[0x0000000015a223f0] drupal_render() /var/www/html/includes/common.inc:2968

Seems that using any type of application code level locking to achieve this is a recipe for disaster on high performance sites. This type of processing should probably be queued in some way so that the end result is simply copying the rendered files into place vs locking.

While it's not queuing, there is #1014086: Consider using imagecache style generation of CSS and JS aggregates and http://drupal.org/project/agrcache for D7 that takes this out of the page rendering pipeline. No locking, but the potential for stampedes ought to be reduced since it waits until the very last minute to generate files, and a single very popular aggregate that takes a long time to generate won't hold up any other processing at all.

I agree that the lock mechanisms/logic in this patch are problematic. Was going to try it but after reviewing it and reading other comments decided against it. catch's suggestion for an alternative approach seems more sensible to me.

just pointing out #1014086: Consider using imagecache style generation of CSS and JS aggregates is active, and will make this easier.

Status:Needs review» Needs work
Issue tags:+Performance, +Needs issue summary update, +scalability, +i/o, +needs backport to D7

The last submitted patch, css-stampede-886488-86.patch, failed testing.

Issue tags:-i/o+i-o

(the slash in the i/o tag breaks the autocomplete from adding new tags)

should this maybe be postponed on one of those issues?
or maybe just start with
reroll (http://drupal.org/patch/reroll)

could use an issue summary update. tips to do an issue summary: http://drupal.org/node/1427826

Priority:Major» Normal
Status:Needs work» Postponed

Yes postponing this on #1014086: Consider using imagecache style generation of CSS and JS aggregates - that massively reduces the window and also avoids the page requests themselves being blocked on asset generation.