We have a reasonably high traffic (around 50k visitors daily) that uses memcache for caching. Of late we have been having strange issues with our site. Some of the issues that we have been facing have been

a) Contents of panes seem to vanish randomly
b) White screen on home page (http://drupal.org/node/979912)
c) Variables cache getting dropped silently (http://drupal.org/node/1174676)
d) Page requests gets slowed down heavily with lock_waits (Visible via newrelic)
e) Random menus rebuilds. Possibly because of missing variables or missing cache items. Visible implication - imagecache menus gets reset to default path sites/default/files.

We are also pounded by bots regularly that worsens the situation by pulling of old nodes that are not in the cache.

We are looking for a performance guru who can take care of these problems for us. I am not sure if the memcache issue queue is the right place to put this but it looks like the problems have some relationship to caching and memcache and we are hoping to reach out to somebody already good at these who can help us with this.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

It looks like you might be running into #853864: views_get_default_view() - race conditions and memory usage although you've listed multiple problems and not all of them would be due to this.

If you want to pay someone to investigate these issues for you then http://tag1consulting.com/ provides that (disclaimer, I work for them).

mshmsh5000’s picture

FileSize
34.01 KB

I've seen the exact set of symptoms on another site, and the same delays via New Relic. These long lock_wait times are a result of stampede protection. I bet you have

$conf['memcache_stampede_protection'] = TRUE;

in settings.php, correct? I haven't found the root cause, but it involves the stampede protection clause in cache_get().

if (variable_get('memcache_stampede_protection', FALSE)) {
  // The process that acquires the lock will get a cache miss, all
  // others will get a cache hit.
  if (lock_acquire("memcache_$cid:$table", variable_get('memcache_stampede_semaphore', 15))) {
    $cache = FALSE;
  }
}

Just took off 5-15 sec of execution time per page by disabling memcache_stampede_protection.

I've attached a screenshot of one of the New Relic traces that doesn't involve the views_get_default_view() race condition. I'll post back to this thread if I make progress on this.

mgriego’s picture

Just out of curiosity, are you guys using the memcache lock replacement? ie:

$conf['lock_inc'] = './sites/all/modules/memcache/memcache-lock.inc';
dirtabulous’s picture

Have you run into any issues turning stampede protection off?

dirtabulous’s picture

And, yes, we have the memcache lock replacement in place.

EKM’s picture

Version: 6.x-1.9 » 6.x-1.10
Status: Active » Needs review
FileSize
1.66 KB

I hit this issue. It looks like the stampede protection causes cache_get to run up additional locks that aren't released until the end of the page load by release_all, causing other processes to spinlock on the same cache.
The fix was to simply release the lock as soon as possible. Patch is attached, needs review!

thedavidmeister’s picture

Status: Needs review » Reviewed & tested by the community

Been using this patch attached to #6 in production for a couple of weeks now. Has improved performance quite a bit when we have concurrent users requesting heavy pages, was getting hugely "spiky" page load times of 15+ seconds for some page requests and now the spikes have completely gone. Haven't noticed any problems with the cache caused by this patch so far.

Not sure if this patch addresses all the issues that the OP raised, but it definitely fixed the lock_wait problems we were having and it looks like the issue referenced in catch's reply in #1 is a better place to discuss other the issues with missing content panes that this patch doesn't address.

Setting this to RTBC simply because the lock_wait issues seem to be the most appropriate to be discussing here in the Memcache queue and the other things seem to be either symptoms of that or unrelated?

yonailo’s picture

I am trying to understand the code, and I don't quite see the point of releasing the lock inside cache_get().

If I understand well, the process that receives the cache miss should call cache_set() at the end of its execution, and it's there where the lock should be released (and that's what the implementation does, btw).

IMHO releasing the code in cache_get() could result in a cache stampede which is exactly what we're trying to avoid here.

EKM’s picture

I missed the responses, but the original / current problem is that locks aren't released until the page is finished, which the above patch does deal to.

On thinking on it, away from time pressures to ship, it feels like there needs to be two sets of locks, one for write and one for read. Read locks are not blocking of other reads, but write locks block both other writes and reads, with additional information that if a write is blocked by another write, it should probably return that other write if possible, assuming it's within a given time. This could be lead to cache corruption, but to properly handle that without the possibility of a read being overwritten halfway through would really require locks to be implemented on memcache (which would make all this much easier anyway).

Comments?

thedavidmeister’s picture

Status: Reviewed & tested by the community » Needs work
markpavlitski’s picture

Version: 6.x-1.10 » 7.x-1.x-dev
Component: Code » memcache.inc
Status: Needs work » Needs review
FileSize
595 bytes

The lock shouldn't be released in cache_get() / $this->get(). The lock is acquired here under the assumption that a later execution path will generate and set the cache entry.

The attached patch releases the lock inside $this->set() once the value has been stored.

Performance testing shows a reasonable decrease in average lock_wait time resulting in faster page load under high concurrency.

SocialNicheGuru’s picture

is this related to http://drupal.org/node/2099893?

Jeremy’s picture

Category: Support request » Bug report
Issue summary: View changes
Status: Needs review » Fixed

Agreed we should be releasing the lock once we've finished with our set. But first we check the globals to be sure the lock is actually set before we release it, avoiding an otherwise unnecessary call to dmemcache_delete if not.

Committed:
http://drupalcode.org/project/memcache.git/commitdiff/7c0c55f

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Jeremy’s picture

Version: 7.x-1.x-dev » 6.x-1.x-dev
Assigned: Unassigned » Jeremy
Priority: Normal » Major
Status: Closed (fixed) » Patch (to be ported)

Re-opening as this still affects the D6 branch.

Jeremy’s picture

Status: Patch (to be ported) » Fixed

Actually, this was fixed as of 6.x-1.9, was looking at a very old codebase. Re-closing.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.