Patch for D7 is available.
We need locking around the css and js aggregation. We added locking to other subsystems in Drupal 7, and this one was just missed. Without locking, multiple processes try to build and rebuild the css and js. On our site, this puts a high load on the file system, which causes apache to throw 503 errors.
Also, if you simply update the css or js, the files will not rebuild. Say you upgrade a module, and it adds 5 lines to it's css, or say you modify your theme and change things around, if the bundle is already built, and if no new files were added to the page, then the bundle will not rebuild. I've introduced a "version" for this. The intent of the "version" is that some contrib module can be written to allow incrementing the version either from an admin page or via drush, and that the site developers will know when this needs to be done, and will simply increment the version, forcing a rebuild.
An additional problem surfaced while trying to solve this, which is that if the creation of the css file fails, we currently return FALSE; however the function that calls drupal_build_css_cache() doesn't interprete the result, and simply adds the empty string as a css file. This is an obvious bug. We need to handle the FALSE return in css, just like we do in js.
So my solution is to use locks so that we generate each bundle in only one process (allowing multiple bundles to build simultaneously, but making sure only one process builds each bundle), to use a global lock when saving the map, returning stale bundles during the rebuild or on error.
Anyone new to this issue can problem skip ahead to #30 or so.
Comment | File | Size | Author |
---|
Issue fork drupal-886488
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
catchcss_stampede.patch queued for re-testing.
Comment #3
catchAdding js.
Comment #5
catchminus syntax error.
Comment #7
douggreen CreditAttribution: douggreen commentedPlease initialize $lock_acuire_attempts = 0; in both cases. Also, the logic for $lock_acquired seems to be wrong, or rely on some mystical operator precedence. I'd change this to
Comment #8
catchupdated patch. Some of these fixes are direct from doug, so please credit him in any commits.
Comment #10
catchShould hopefully apply this time, also this time with a global lock instead of per file.
Comment #11
catchOne more time, use a variable for $lock_name.
Comment #12
douggreen CreditAttribution: douggreen commentedThe previous patch by catch still does not solve our cache stampede problem. Forcing everything on the site to rebuild simultaneously causes too long of a wait. This new approach uses a single lock for css and another for js, and let's the system rebuild them one at a time, serving up the older files when it can't get a lock.
Comment #13
douggreen CreditAttribution: douggreen commentedAnd a fix to that release, that unlocks the right key (catches last patch did this right, but mine did not), and also unlocks the key on write failure.
Comment #14
catchWrite through seems the right way to go for this, however it looks like the only window for that to work is between the variable_set() and the file deletion, if that's right is there a way for us to do the delete itself a bit later?
Comment #15
catchDoug pointed out that drupal_delete_if_stale defaults to 30 days, so there's a reasonable chance of them existing, and you can set things higher if you need it.
Comment #16
douggreen CreditAttribution: douggreen commentedThe only problem I see with this approach is that the maps might contain entries that are obsolete. When you push new code that requires a js or css rebuild, other things probably changed and the hash's on some pages might change. We probably want to occasionally remove stale map entries, maybe on cron, that exist in the map, but have an older time than the clear. I'll add that later today.
Comment #17
douggreen CreditAttribution: douggreen commentedNew patch
* adds map cleanup on system cron
* only sets $map['clear'] if there's something in it
Comment #18
douggreen CreditAttribution: douggreen commentedOne small tweak, to unset 'clear' if it's no longer needed.
Also, I see an additional problem, which is that the map is retrieved before we get a lock. It's probably that if we don't get the lock on the first attempt, that another process will update the map, and that once this process gets the lock, it will revert anything another process did to the map. It's even possible that this was the root cause to needing a global lock (and a global lock just minimized the time that this was an issue), and that after this change, we might be able to go back to a lock on the individual keys.
Comment #19
douggreen CreditAttribution: douggreen commented... and now, only re-read the map if file_unmanaged_save_data() succeeds, ... we probably never hit this condition, but it saves us from re-reading the variables on failure.
Comment #20
catchlock_wait() needs the name of the lock as an argument.
Comment #21
douggreen CreditAttribution: douggreen commentedUpdated patch always rebuilds file if it is missing, or being cleared, extra check for file existence was preventing rebuilds on clear, and legacy check from old code, new code should only get into this if the file doesn't exist or it needs to be cleared.
Comment #23
douggreen CreditAttribution: douggreen commentedoops, I did same thing last time
Comment #24
douggreen CreditAttribution: douggreen commentedWe're still having problems. Updated patch file makes sure to atomically change the map file. We're also trying a version that uses cache_set/get, but will post that next if it appears to work.
Comment #26
douggreen CreditAttribution: douggreen commentedFixing patch file again, and updating to head
Comment #28
douggreen CreditAttribution: douggreen commentedAdd missing contents line, that was lost in the merge.
Comment #30
douggreen CreditAttribution: douggreen commentedUpdated version checks if the map exists before writing it, which is the root cause of the last testbot failure.
Updated version returns the $uri on file_unmanaged_save_data write failures. It returns the $uri because we have one, and it's more likely that returning a stale version is better than returning no version. Updated version also duplicates the js code css for handling a FALSE returned by drupal_build_css_cache, by simply not putting it into the css group. Surprisingly, if you do put it into the group, with a value of FALSE, what gets generated is a css link to the root of your site.
Comment #31
douggreen CreditAttribution: douggreen commentedThis issue hasn't gotten much attention yet. I just rewrote the description, and would like to make a case for this being a Drupal 7 critical issue. It's amazing we've done so well on Drupal 7, which speaks to how mature it really was, even pre-launch. But this issue, the one outlined here, has been our biggest cause of downtime. Without this patch, we can't change our theme, and we can
I'm not convinced yet that this is the final solution. I'm advocating that others look at the problem, and help solve it.
I'm overstepping a bit by making this critical, but I'd like to make sure that some key people look.
Comment #33
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribe.
Comment #34
douggreen CreditAttribution: douggreen commentedugh, I can't believe I did that, fixing variable name...
Comment #35
carlos8f CreditAttribution: carlos8f commentedsubscribing
Comment #36
JeremyFrench CreditAttribution: JeremyFrench commentedWhile I think this is a great change, and much needed for D7 I don't think it is a critical (which doesn't mean it won't make D7, just that it won't block the release).
Put simply
It doesn't break the API (although it does add some clean up functions so could count as an API change)
It doesn't contain any strings
It doesn't stop Drupal working in a fundamental way.
See #974066: Friendly reminder: on critical issues and RCs for information on criticals.
If someone else disagrees withe me please bump again.
Comment #37
douggreen CreditAttribution: douggreen commentedThis is a new feature to D7 that prevents a high usage site from updating their code without downtime, that to me stops D7 from working in a critical way, consider security updates. No disrespect Jeremy, I'm sure you've done great work in Drupal. I hope you agree it would be nice to let a few more people see this before lowering the priority. I work with chx and catch, and both those guys know about this issue. I know that either of them (or any of a dozen or more others) will lower the priority when they want to get Drupal out the door or if they agree with you, and I'll accept it then.
Here's one more update. Adding version to the filename, so that upping the version, which already forces a new bundle to build, also updates the filename, forcing CDN's to clear. I thought that was already in here, but wasn't. It's kinda a key component to the version idea.
Comment #39
catchThe return FALSE bug should probably be it's own issue, that's quite serious just by itself if you run into it and it's not directly related to the stampede.
I think this is strictly speaking 'major' in the sense that it will only occur with a certain degree of traffic (or possibly on a site that's i/o constrained with medium traffic), however I don't think we should release with both this issue and #973436: Overzealous locking in variable_initialize() unresolved - since to me the combination of the two is critical (the css/js stampede can itself lead to a variable_set() stampede since that's where the map is stored).
These are extremely hard issues to track down, hard to fix, and there are currently limited sites which can be used as test-beds - which means it makes sense to fix it before people's sites start going down all over the place rather than after when we're flooded with obscure support requests.
Comment #40
webchickLet's stop with the status ping-pong. But I'm happy to take a look at this if it gets RTBC in time.
Comment #41
nnewton CreditAttribution: nnewton commentedRE: The patch in #37
I've been testing this patch today on an EC2 VM. It is fairly easy to generate a css/js stampede simply using ab. You don't even need to simulate NFS-level IO conditions (although one could argue that EC2 is fairly close to that at times), which makes me wonder how long this has been a problem. I forward ported the patch in 37 to work for the latest head and tested it against a vanilla head. The results were not encouraging, it basically had no impact but for slowing down page load due to the locks. I added some logging and I think I know what is going on. While this patch introduces a lock and makes only a single process build a given css group at a time, the function never checks to see if another process has built the CSS group its looking for after waiting for the lock (unless I'm missing something). Thus, all you really end up doing is forcing multiple processes to repeatedly build the same css bundles in sequence.
I've run out of time today, but will continue with this tomorrow and post something that checks for new entries in the $map after waiting for a lock.
Comment #42
nnewton CreditAttribution: nnewton commentedAttached is a slight edit of the patch in #37. As a note, I don't think this is an acceptable way to do this as it makes the logical flow of this function just weird and introduces another break in the loop. However, I wanted to post it to allow for testing of this method of checking the map.
With this patch, after clearing the js/css and hitting a site with 50 concurrent requests we see the following:
As compared to this with just #37:
Comment #43
douggreen CreditAttribution: douggreen commentedGreat progress Narayan :) Yes, I see where the extra variable_initialize and map re-check is required only while were rebuilding the bundle. I'm somewhat concerned about the extra variable initialize after the lock_acquire because of issues raised by catch, and would like to hear what he has to say, but comforted that we only do this during the time that we try to rebuild the bundle.
Comment #44
Anonymous (not verified) CreditAttribution: Anonymous commentedthis looks to be badly needed.
i think the intent is to make sure we get the updated map if another thread already did the work? so, can we add a comment about why we need to use variable_initialize() instead of variable_get() here? also, the comment needs a space at the start and a full-stop at the end, and there's no need to pass the 'array()' to variable_initialize, as that's the default.
we could really use #973436: Overzealous locking in variable_initialize() here...
Comment #45
catchYes this issue is a big part of why I opened the variable_set() one. Narayan's fix looks right to me pending Justin's comments.
Comment #46
Jeremy CreditAttribution: Jeremy commentedUpdated patch to address Justin's comments.
Does this patch intentionally remove support for gzip'ing the cache files? That's a critical bug with this patch imo, unless I'm missing where this logic was moved to.
Also, when we find that a CSS file has been built while we were waiting for a lock we simply exit out of the loop, leaving the lock around. I suppose it doesn't much matter, anyone already waiting on a lock will detect the new map and break, and anyone that comes along afterwards won't bother trying to grab the lock.
I'm also attaching two debug patches. The "nopatch-debug" patch simply adds a sleep (to increase the race time) and debug output to Drupal 7 without adding locking -- with this patch applied enable/disable a theme or otherwise trigger a CSS rebuild in one browser, and then load a page in another browser. This easily illustrates the race condition this patch is trying to fix as we end up building multiple copies of the CSS cache (the same is true for the JS cache though this patch doesn't show that).
The "-debug" patch adds sleep and debut output to Drupal 7 along with the new locking logic -- with this patch applied enable/disable a theme or otherwise trigger a CSS rebuild in one browser, and then load a page in another browser. You can then confirm that the lock is preventing multiple copies of the CSS cache from being built. (Navigate a second page to see the drupal_set_message output...) This also shows how necessary #973436: Overzealous locking in variable_initialize() is -- loading the page with a number of browsers while the CSS cache is being rebuilt we reload variables from the database an excessive number of times (an additional sleep could make that more obvious).
Comment #47
Jeremy CreditAttribution: Jeremy commentedI took another shot at this, trying to simplify the patch a bit. I removed the $version stuff as it's a new feature and not directly relevant to this bug. I also removed the patch to the system module, as I couldn't find anywhere that $map['clear'] is actually being set in the latest version of the patch. I also removed the special handling for the first time we fail to grab a lock as it was uncommented and I wasn't certain I understood why we do that, though perhaps that should be left in with comments added?
Otherwise, I restored the gzip functionality, perform the atomic map update for css and js caching inline, and added more comments to make things more readable. Sorry if this patch is a functional regression, but I was aiming to be sure there's nothing unnecessary in there that would prevent this from getting merged into core. All it's attempting to do is properly protect against the js/css stampede.
Comment #48
Jeremy CreditAttribution: Jeremy commentedAnd the debug version in case it's helpful to anyone trying to test this.
Comment #49
Jeremy CreditAttribution: Jeremy commented#46: 886488.patch queued for re-testing.
Comment #50
Anonymous (not verified) CreditAttribution: Anonymous commentedi agree with jeremy's approach here - i think we need to change as little as possible at this stage.
attached patch is an attempt to improve on #47.
* don't need the $mtime stuff at all, we're only interested in whether an old file exists for $key
* don't lock around the gzip file, because the rewrite rules will handle a missing gzip file, so there's no need to make other process wait for that to hit the disc
* try harder to return the old $uri if we have issues saving the newly generated css data to disc
* lock around variable_del('drupal_css_cache_files') to avoid the race where we read the old map, delete, then write the old values again
i've only done the CSS part, but i think js should have similar treatment, pending feedback on the implementation.
Comment #51
Anonymous (not verified) CreditAttribution: Anonymous commentedoh yeah, this is another issue that would actually be testable if we had #850782: allow testing lock code via async http calls.
Comment #53
Anonymous (not verified) CreditAttribution: Anonymous commentedthis time without the syntax errors....
Comment #55
Anonymous (not verified) CreditAttribution: Anonymous commentedinitialise lock_attempts.
Comment #56
Anonymous (not verified) CreditAttribution: Anonymous commentedok, js now gets the same treatment.
we really need tests of this functionality (locking aside). the patch at #55 was completely broken for the gzip stuff, but all tests passed :-(
Comment #58
Anonymous (not verified) CreditAttribution: Anonymous commented#56: stampede.patch queued for re-testing.
Comment #59
Anonymous (not verified) CreditAttribution: Anonymous commentedcan't reproduce that fail locally...
Comment #61
Anonymous (not verified) CreditAttribution: Anonymous commentedblurgh! not having much luck with the bot.
Comment #62
Anonymous (not verified) CreditAttribution: Anonymous commentedtagging.
Comment #63
carlos8f CreditAttribution: carlos8f commentedI need to review this tonight. The bot should be in a better mood today.
Comment #64
carlos8f CreditAttribution: carlos8f commented#56: stampede.patch queued for re-testing.
Comment #65
Anonymous (not verified) CreditAttribution: Anonymous commentedupdates to keep up with HEAD only.
Comment #66
carlos8f CreditAttribution: carlos8f commentedThe $mtime stuff already has its own issue: #678292: Add a development mode to allow changes to be picked up even when CSS/JS aggregation is enabled so I'm OK with removing that from the patch.
Starting with #50 though it looks like we are introducing a silly regression: we are re-aggregating on every request! This should be put at the top:
I'm still digesting the rest of the patch, but we at least need to address the obvious :)
Comment #67
carlos8f CreditAttribution: carlos8f commentedWorking on some cleanups.
Comment #68
carlos8f CreditAttribution: carlos8f commentedif ($lock_acquired) {
block of the CSS function before returning.Additionally, drupal_build_*_cache() are virtually identical and tough to update since changes need to be made to both js and css functions in separate parts of the behemoth common.inc. I think we should merge these into one and have a $type = 'css' or 'js' for sanity. That doesn't have to happen now though.
Comment #69
Anonymous (not verified) CreditAttribution: Anonymous commentedthanks for the review! you've caught some nice bugs there.
i'll work on some tests today.
Comment #70
carlos8f CreditAttribution: carlos8f commented.
Comment #71
douggreen CreditAttribution: douggreen commentedI've restored a somewhat simplified version system for the bundles. I think we need the version system for high volume sites because even with the race fixes here, some processes are going to have to wait, and our current wait is 1 second.
With a version system, someone can write a contrib module that (1) pre-generate new bundles before upping the version, and thus completely by-pass the contention, and (2) create new bundles after an update to your code, without breaking cached pages to the old bundles.
Comment #72
douggreen CreditAttribution: douggreen commentedWhy'd you change this to 5 tries instead of 4? If we had #802856: Make lock_wait() wait less I wouldn't mind bumping the number even more; but without it, on our site, every second is 25-50 processes, and an extra second adds too many waiting processes.
Comment #73
carlos8f CreditAttribution: carlos8f commented@douggreen, I changed to 5 tries to make the code more readable, so the "< [number]" matches the number of tries. Otherwise you might think there are 5 tries when there are actually 4 due to the ++ being before the variable. And because I couldn't figure out why you would want 4 tries instead of 5 :) That it certainly negotiable.
Comment #74
Anonymous (not verified) CreditAttribution: Anonymous commentedmaybe i'm missing something, but i don't see why versioning is necessary.
in the case where the file exists, but the cache hasn't been cleared, you could write a drush command to generate new cache files when you know your code has changed. the filename is based off of contents, so there are no caching issues.
a call to drupal_build_*_cache() writes the new file and updates the map, and new requests will just serve the existing file.
so, i think we should proceed with #68.
Comment #75
catchI'm in a rush but last time I looked at this, I also thought the filename was based on contents, but it's not, it's based on the filenames only. This is a bug in itself, but afaik we can't fix it without loading all the files into php on every request which seems silly.
Comment #76
catchWill try to review this more in depth later but the recent patches look a lot more straightforward.
Comment #77
Anonymous (not verified) CreditAttribution: Anonymous commentedfilename is based on contents:
Comment #78
carlos8f CreditAttribution: carlos8f commentedRe: #75,
- $key is based on the list of files to aggregate.
- The value of $map[$key] ($uri) is a filename hashed with the file's contents.
- If the $uri file exists, no action is taken (true in HEAD and in #68).
- If it's desired that $uri automatically changes with the file contents, mtime() would work for that, which already has an initiative in #678292: Add a development mode to allow changes to be picked up even when CSS/JS aggregation is enabled.
Patch #71 changes $key to take a $version into account. As it stands, it looks like that idea needs more hashing out (pun intended). Setting a variable to denote a "version" is messy, not documented, and would be better suited as an API function that auto-increments (but we're beyond the point of new features like that).
Re: @justinrandell #74, for that idea to work you would need a cache of all the css/js groups or "bundles" to re-aggregate. I don't think that exists currently. So with those things in mind, I think #68 can stand on its own without a versioning scheme, and versioning can be attacked in a different issue.
Comment #79
catchJustin/Carlos thanks - makes sense with $key vs. $filename, I agree with moving discussion of that to a separate issue (or the existing mtime() one).
Comment #80
carlos8f CreditAttribution: carlos8f commentedRe-upload of #68, because versioning is a feature and should have its own issue.
Comment #81
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated #80 to fix patch noise, no code changes.
Comment #82
catchBeen thinking about this recently and realised two things:
1. Drupal currently saves a gzipped copy of all css/js aggregates in addition to the .css/.js version - this will double i/o load. Any site using a cdn and/or varnish can afford to use mod_deflate on those files instead of having Drupal do it. Can save around 4-500ms per request.
2. We could also prevent the write stampede, or most of it at least, by not trying to have every request build the file themselves, instead have imagecache style callbacks that do the generation - see #1014086: Stampedes and cold cache performance issues with css/js aggregation. This would be an even bigger change than the patch here, and I don't have code for it yet though.
Comment #83
catchTagging.
Comment #84
Wim LeersSubscribing.
Comment #85
sunNice improvement for high-performance sites, but only borderline a bug.
Comment #86
marcingy CreditAttribution: marcingy commentedJust a reroll to convert to git patch.
Comment #88
marcingy CreditAttribution: marcingy commentedHopefully this is better
Comment #89
xjmTagging issues not yet using summary template.
Comment #90
bryancasler CreditAttribution: bryancasler commentedBump
Comment #91
cyberswat CreditAttribution: cyberswat commentedfwiw we are running this code and have enabled php-fpm during a site crash to help with debugging. Turns out the usleep being used here was stacking up with multiple requests causing the site to crash:
Seems that using any type of application code level locking to achieve this is a recipe for disaster on high performance sites. This type of processing should probably be queued in some way so that the end result is simply copying the rendered files into place vs locking.
Comment #92
catchWhile it's not queuing, there is #1014086: Stampedes and cold cache performance issues with css/js aggregation and http://drupal.org/project/agrcache for D7 that takes this out of the page rendering pipeline. No locking, but the potential for stampedes ought to be reduced since it waits until the very last minute to generate files, and a single very popular aggregate that takes a long time to generate won't hold up any other processing at all.
Comment #93
Mark Theunissen CreditAttribution: Mark Theunissen commentedI agree that the lock mechanisms/logic in this patch are problematic. Was going to try it but after reviewing it and reading other comments decided against it. catch's suggestion for an alternative approach seems more sensible to me.
Comment #94
Anonymous (not verified) CreditAttribution: Anonymous commentedjust pointing out #1014086: Stampedes and cold cache performance issues with css/js aggregation is active, and will make this easier.
Comment #95
YesCT CreditAttribution: YesCT commented#88: css-stampede-886488-86.patch queued for re-testing.
Comment #97
YesCT CreditAttribution: YesCT commented(the slash in the i/o tag breaks the autocomplete from adding new tags)
Comment #98
YesCT CreditAttribution: YesCT commentedshould this maybe be postponed on one of those issues?
or maybe just start with
reroll (http://drupal.org/patch/reroll)
could use an issue summary update. tips to do an issue summary: http://drupal.org/node/1427826
Comment #99
catchYes postponing this on #1014086: Stampedes and cold cache performance issues with css/js aggregation - that massively reduces the window and also avoids the page requests themselves being blocked on asset generation.
Comment #100
MiroslavBanov CreditAttribution: MiroslavBanov commentedI got a deadlock on a high profile (Drupal 7) site. The patches do not apply any more, so I re-rolled the latest. This applies to 7.34.
Comment #101
mikeytown2 CreditAttribution: mikeytown2 commentedTesting #100
Comment #102
mikeytown2 CreditAttribution: mikeytown2 commentedBack to D8 postponed. #100 passed the tests.
Just a heads up that AdvAgg offers this for D7 without hacking core.
Comment #105
anavarreComment #114
Darren OhReroll.
Comment #115
Darren OhComment #118
nod_duplicate of #1014086: Stampedes and cold cache performance issues with css/js aggregation?
Comment #119
catchThe approach is different, but it's trying to resolve the same issue, and (hopefully) #1014086: Stampedes and cold cache performance issues with css/js aggregation can land before too much longer.
Comment #120
Darren OhComment #121
bburgPatch in #114 appears to work so far at least on this one high traffic site I was managing. CSS was randomly generating as empty files, which would happen intermittently. I had been pulling my hair out over this for months until I came across this issue and deployed this patch last week. I'll follow up if it regresses again, but fingers crossed! This really should be applied to core.
Comment #123
Darren OhComment #124
Darren OhComment #125
joseph.olstadComment #126
joseph.olstadComment #127
joseph.olstadD10 test looks something like:
new file core/tests/Drupal/FunctionalTests/Asset/AssetOptimizationTest.php
Comment #128
catchThis isn't going to be testable, or not easily, because it's specifically dealing with a race condition where multiple requests hit the same URL at the same time.
The Drupal 10 issue landed, but took an entirely new approach [#3301716].
Comment #129
joseph.olstadok sounds good, removing needs tests.
Comment #130
joseph.olstad