It seems that there is a concurrency problem with the way Drupal 5.x handles variable caching. Consider the following:
Assume that variables are not cached, i.e., there's no 'variables'-row in {cache}-table. Let's say that there's a variable defined in {variable} table: foo->bar
Process A starts executing variable_init:
* read variables from {variable} table by db_query('SELECT * FROM {variable}')
Process B sets a variable:
* variable_set('foo', 'baz');
Process A finishes executing variable_init:
* write cache by: cache_set('variables', 'cache', serialize($variables));
Now the variable cache holds foo->bar and not foo->baz as it should. Thus, the cache inconsistency.
This is a real problem. In a customer site this breaks cron once or twice a day, because of the cache inconsistency of cron_semaphore-variable. The cron_semaphore is deleted correctly in drupal_cron_run() but not from cache. Therefore, every cron.php execution attempt until 1 hour semaphore timeout results in a bogus
watchdog('cron', t('Attempting to re-run cron while it is already running.'), WATCHDOG_WARNING);
and the cron tasks are not run.
| Comment | File | Size | Author |
|---|---|---|---|
| #4 | bootstrap-inc-249185-4.patch | 1.73 KB | skiminki |
Comments
Comment #1
skiminki commentedNot actually tested, but by glancing the code, this seems to affect Drupal 6 too.
Comment #2
damien tournoud commentedThis is one of the race conditions that Drupal it is suffering from. I'm not sure there is any perfect solution.
Comment #3
skiminki commentedOk, I'm going to look at this and offer a patch in few days.
Comment #4
skiminki commentedHere's a patch to fix non-transient cache/variable inconsistency. For analysis, the idea of reworked variable_init() is as follows:
Functions variable_set and variable_del are untouched.
Some remarks:
NOTE!!! The patch is against Drupal 5.7.
Comment #5
damien tournoud commentedThis would solve the problem, but I fear the cost of this is huge (locking both the whole cache table and the variables table???). Could you show some performance analysis?
Bugs like this one should be corrected first in the current HEAD version (7.x), than back-ported.
Comment #6
skiminki commentedBriefly:
However, I'd still prefer not locking cache table. However, this would mean that we'd have to move variable cache to another table. I'm not sure if that'd be wise idea to do this late in 5.x series.
Comment #7
skiminki commentedIn a test site (D5.7 / mysql) with 75 enabled modules and 606 variables, cache rebuild operation takes 0.01-0.02 seconds under low load conditions. This is with regular unaccelerated mod_php. The test site is a full replica of a real production site.
Comment #8
skiminki commentedA simple question: Are you expecting me to make the D7 version of the patch? If so, this could be your lucky day, because I've got some slack time until my next commitments ;)
However, I'd like to propose a bit extended approach for D7:
I think this would make this variables thing as good as possible with the current architecture.
I could split this into three patches. The first patch would cover item 1 and should be backported into 5.x and 6.x. The second patch would cover item 2 and we could consider, whether or not to backport it. The third patch would cover items 3 and 4 and should probably not be backported into 5.x at least.
I'd like to have feedback on this before starting to work.
Comment #9
damien tournoud commentedI suggested a patch implementing soft locking for that type of operation in #251792.
Comment #12
moshe weitzman commentedWe have lock API in core now but I don't know that this solved in d7. I do think it is solved in latest pressflow for d6.
Comment #13
moshe weitzman commentedFYI, here is the Pressflow work in this area: http://bazaar.launchpad.net/~pressflow/pressflow/6/annotate/head:/includ.... note variable_init() locking and variable_cache_rebuild() has deferred db write
Comment #14
fgmI just discovered a problem with the PF patch for this issue under very heavy load: in such a situation, if a slow (or even failing) page is the one holding the variable regeneration lock, the attempt to lock for cache generation can cause every page to wait for the full duration until rebuild success of up to 30s in case of forced lock release, causing requests to pile up and overall throughput to decrease massively until one request finally succeeds in acquiring the lock and building the cache entry at the expiration of the delay.
I looked at the equivalent D7 implementation and, although different - notably because is uses the specific cache_bootstrap bin instead of the default bin - it should suffer from the same problem: if the lock is taken, every page will be waiting during rebuild and, if for any reason that rebuild fails, the whole site will be crawling to a halt for 30 seconds.
The current logic is like this (abstracting the maintenance case):
Pressflow:
Drupal 7 (swapped the "if" branches for easier comparison):
It seems that since the point here is to build a *cache* and not a stable reference, if would make better sense not to try to build the cache if the lock can not be taken, and just proceed from there by querying the DB, just like normally happens on any case of cache unavailability. It could go like this:
That way, if the lock is not acquired, the cache is not initialized with the possibly invalid data concurrency-wise (as to the topic of the thread), and all page requests can still be served from the DB while the cache is being built, only incurring the extra cost of the db query (liable to be in the mysql cache quite often anyway), instead of waiting for the cache reconstruction, which can be long if the page holding the lock has some problem like being aborted of even just being slow because it happens perchance to be a heavy one.
Operations modifying the variables and clearing the variable cache for this, (vset/vdel) are not affected, be they immediate as in D7 or registered for shutdown as in PF.
(on a related note, not sure why D7 performs no check on recursion depth ?)
Comment #15
Anonymous (not verified) commentedi agree that this needs to be changed - the recursive lock call stuff that has creeped in to D7 is an anti-pattern we've got to stamp out. for reference, see #848044: image_style_generate() should use lock_wait() and #802446: Cache stampede protection: drupal_render() and page cache, where i'm trying to replace the recursion in favour of some number of attempts to get the lock, then deal with failure in whatever way makes sense in the context of the code trying to acquire the lock.
i'd favour still trying to get the lock a couple of times in this case, with a much shorter lock period, then proceed to hit the database to rebuild the variables cache failure. there's very little point locking around the cache_set, as it doesn't protect us from the stampede (the cache_set is not the expensive part) and it doesn't help with the concurrency issues raised by the OP.
if we do a couple of quick lock attempts, in the worst case, we'll lose a few seconds, and have a stampede rebuilding the cache, but not as a big as hit as the current code, and in the normal case, we get to keep the stampede protection.
Comment #16
Anonymous (not verified) commentedi think this is now a dup of #973436: Overzealous locking in variable_initialize().
please reopen if i'm missing something.
Comment #17
fgmI fail to appreciate how an issue worked on since 2008 can be a duplicate of one opened in 11/2010, but nevermind.