It seems that there is a concurrency problem with the way Drupal 5.x handles variable caching. Consider the following:

Assume that variables are not cached, i.e., there's no 'variables'-row in {cache}-table. Let's say that there's a variable defined in {variable} table: foo->bar

Process A starts executing variable_init:
* read variables from {variable} table by db_query('SELECT * FROM {variable}')

Process B sets a variable:
* variable_set('foo', 'baz');

Process A finishes executing variable_init:
* write cache by: cache_set('variables', 'cache', serialize($variables));

Now the variable cache holds foo->bar and not foo->baz as it should. Thus, the cache inconsistency.

This is a real problem. In a customer site this breaks cron once or twice a day, because of the cache inconsistency of cron_semaphore-variable. The cron_semaphore is deleted correctly in drupal_cron_run() but not from cache. Therefore, every cron.php execution attempt until 1 hour semaphore timeout results in a bogus
watchdog('cron', t('Attempting to re-run cron while it is already running.'), WATCHDOG_WARNING);
and the cron tasks are not run.

CommentFileSizeAuthor
#4 bootstrap-inc-249185-4.patch1.73 KBskiminki

Comments

skiminki’s picture

Not actually tested, but by glancing the code, this seems to affect Drupal 6 too.

damien tournoud’s picture

This is one of the race conditions that Drupal it is suffering from. I'm not sure there is any perfect solution.

skiminki’s picture

Ok, I'm going to look at this and offer a patch in few days.

skiminki’s picture

Version: 7.x-dev » 5.7
StatusFileSize
new1.73 KB

Here's a patch to fix non-transient cache/variable inconsistency. For analysis, the idea of reworked variable_init() is as follows:

$cached = cache_get('variables', 'cache');
if (!$cached) {
  LOCK TABLES(cache, variable) IN EXCLUSIVE MODE;
  $cached = cache_get('variables', 'cache');
  if (!$cached) {
    // rebuild cache from variable table
  }
  UNLOCK TABLES;
}
return $variables=unserialize($cached);

Functions variable_set and variable_del are untouched.

Some remarks:

  • There is still a slight possibility that variable_init returns inconsistent data, because functions variable_set and variable_del modify the variable-table first and then invalidates the cache. However, as cache invalidation cannot be missed by variable_init, this problem is transient.
  • Note that adding extra synchronization to variable_set and variable_del would not remedy the above, because there's still global $conf array (used by variable_get), which may contain stale information.

NOTE!!! The patch is against Drupal 5.7.

damien tournoud’s picture

Version: 5.7 » 7.x-dev

This would solve the problem, but I fear the cost of this is huge (locking both the whole cache table and the variables table???). Could you show some performance analysis?

Bugs like this one should be corrected first in the current HEAD version (7.x), than back-ported.

skiminki’s picture

Briefly:

  • the overall cost of locking is not huge, because variable-table should be quite stable. Locking is done only when cache is invalidated until cache is rebuilt
  • the fast path is unchanged, i.e., when cache is not invalidated no locking is performed
  • when cache is invalidated, with this patch at most one cache rebuild is performed per invalidation. For busy sites, there could be lots of simultaneous cache rebuilds if no locking was performed. Therefore, I argue that locking while rebuilding is actually faster - or at least eats less CPU and DB resources.

However, I'd still prefer not locking cache table. However, this would mean that we'd have to move variable cache to another table. I'm not sure if that'd be wise idea to do this late in 5.x series.

skiminki’s picture

Version: 5.7 » 7.x-dev

In a test site (D5.7 / mysql) with 75 enabled modules and 606 variables, cache rebuild operation takes 0.01-0.02 seconds under low load conditions. This is with regular unaccelerated mod_php. The test site is a full replica of a real production site.

skiminki’s picture

A simple question: Are you expecting me to make the D7 version of the patch? If so, this could be your lucky day, because I've got some slack time until my next commitments ;)

However, I'd like to propose a bit extended approach for D7:

  1. Locking in variable_init, as in my previous patch.
  2. Postpone variable cache invalidation until exit hooks have been called. This would make certain that there's at most one cache invalidation per processed request, even if the request would change lots of variables.
  3. Move variables cache into a different table because of the locking. Instead, what would be a better table? Could we use variable-table, with special $$cached$$-entry or something like that?
  4. Postpone persisting the modified variables into database until exit hooks have been called. Explore the possibility of persisting the variables and invalidating the variables cache in an atomic operation.

I think this would make this variables thing as good as possible with the current architecture.

I could split this into three patches. The first patch would cover item 1 and should be backported into 5.x and 6.x. The second patch would cover item 2 and we could consider, whether or not to backport it. The third patch would cover items 3 and 4 and should probably not be backported into 5.x at least.

I'd like to have feedback on this before starting to work.

damien tournoud’s picture

I suggested a patch implementing soft locking for that type of operation in #251792.

moshe weitzman’s picture

We have lock API in core now but I don't know that this solved in d7. I do think it is solved in latest pressflow for d6.

moshe weitzman’s picture

FYI, here is the Pressflow work in this area: http://bazaar.launchpad.net/~pressflow/pressflow/6/annotate/head:/includ.... note variable_init() locking and variable_cache_rebuild() has deferred db write

fgm’s picture

I just discovered a problem with the PF patch for this issue under very heavy load: in such a situation, if a slow (or even failing) page is the one holding the variable regeneration lock, the attempt to lock for cache generation can cause every page to wait for the full duration until rebuild success of up to 30s in case of forced lock release, causing requests to pile up and overall throughput to decrease massively until one request finally succeeds in acquiring the lock and building the cache entry at the expiration of the delay.

I looked at the equivalent D7 implementation and, although different - notably because is uses the specific cache_bootstrap bin instead of the default bin - it should suffer from the same problem: if the lock is taken, every page will be waiting during rebuild and, if for any reason that rebuild fails, the whole site will be crawling to a halt for 30 seconds.

The current logic is like this (abstracting the maintenance case):

Pressflow:

if (cache_get succeeds) {
  $variables = cached data;
}
else {
  if (lock_aquire(somekey, 30 seconds)) {
    build $variables from db_query
    cache_set($variables)
    lock_release
  }
  else {
    lock_wait
    $variables = variable_init unless recursion too deep or array() otherwise
  }
  
return $variables;
}

Drupal 7 (swapped the "if" branches for easier comparison):

if (cache_get succeeds) {
  $variables = cached data;
}
else {
  if (lock_aquire(somekey, 1 second)) {
    build $variables from db query
    cache_set($variables)
    lock_release
  }
  else {
    lock_wait
    $variables = variable_initialize without recursion limit
  }
return $variables;
}

It seems that since the point here is to build a *cache* and not a stable reference, if would make better sense not to try to build the cache if the lock can not be taken, and just proceed from there by querying the DB, just like normally happens on any case of cache unavailability. It could go like this:

if (cache_get succeeds) {
  $variables = cached data;
}
else {
  build $variables from db query
  if (lock = lock_aquire(somekey, 1 second)) {
    cache_set($variables)
    lock_release
  }    
}
return $variables;
}

That way, if the lock is not acquired, the cache is not initialized with the possibly invalid data concurrency-wise (as to the topic of the thread), and all page requests can still be served from the DB while the cache is being built, only incurring the extra cost of the db query (liable to be in the mysql cache quite often anyway), instead of waiting for the cache reconstruction, which can be long if the page holding the lock has some problem like being aborted of even just being slow because it happens perchance to be a heavy one.

Operations modifying the variables and clearing the variable cache for this, (vset/vdel) are not affected, be they immediate as in D7 or registered for shutdown as in PF.

(on a related note, not sure why D7 performs no check on recursion depth ?)

Anonymous’s picture

i agree that this needs to be changed - the recursive lock call stuff that has creeped in to D7 is an anti-pattern we've got to stamp out. for reference, see #848044: image_style_generate() should use lock_wait() and #802446: Cache stampede protection: drupal_render() and page cache, where i'm trying to replace the recursion in favour of some number of attempts to get the lock, then deal with failure in whatever way makes sense in the context of the code trying to acquire the lock.

i'd favour still trying to get the lock a couple of times in this case, with a much shorter lock period, then proceed to hit the database to rebuild the variables cache failure. there's very little point locking around the cache_set, as it doesn't protect us from the stampede (the cache_set is not the expensive part) and it doesn't help with the concurrency issues raised by the OP.

if we do a couple of quick lock attempts, in the worst case, we'll lose a few seconds, and have a stampede rebuilding the cache, but not as a big as hit as the current code, and in the normal case, we get to keep the stampede protection.

Anonymous’s picture

Status: Active » Closed (duplicate)

i think this is now a dup of #973436: Overzealous locking in variable_initialize().

please reopen if i'm missing something.

fgm’s picture

I fail to appreciate how an issue worked on since 2008 can be a duplicate of one opened in 11/2010, but nevermind.