engines/memcache.php line 151-155

      $time = time();
      while ($this->memcache->add('lock_' . $this->name, $this->settings['compress'], 0) === FALSE) {
        if (time() - $time >= 3) {
          return FALSE;
        }

If cache is flushed, and you get many concurrent hits, this will/does spike cpu utilization on your app server (httpd) and your memcache process. It worsens until all cpu resources are consumed, causing response time to plateau above 3s until cache is sufficiently repopulated.

Comments

slantview’s picture

Status: Active » Postponed (maintainer needs more info)

Can you show me any proof that this is really a problem? I tested this with concurrency at over 150 and never saw a problem. I definitely debated this, but I don't want to sleep() there cause the lock will usually get removed after less than a 100th of a second. cache flushing takes .00001 second in my tests even with a fully populated cache.

pmcg’s picture

StatusFileSize
new11.02 KB

The speed of flushing has nothing to do with it, it is the speed of cache rebuilding that's an issue.
If you want proof that this is happening, do a small test with the proceeding line added, and from a clean cache. The speed problem is obvious and the log spam will show you why.

      while ($this->memcache->add('lock_' . $this->name, $this->settings['compress'], 0) === FALSE) {
        watchdog('CacheRouter', 'The bin is locked.',null, WATCHDOG_NOTICE, null);
        if (time() - $time >= 3) {
          return FALSE;
        }

The following test were done without this change to memcache.php.
Clean drupal install with 700 automatically generated nodes via devel.
The only non-core module installed is cacherouter or memcache with gaolei's patch ( http://drupal.org/node/224772 ) one or the other.
All tests were performed with an empty cache, so every request is building cache objects.
The tests were performed with siege with a full url list. (random)
My siege config is attached.

$conf = array(
  'site_name' => 'Haxies',
  'anonymous' => 'Visitor',
#  'cache_inc' => './sites/all/modules/memcache/memcache.inc', 
#  'session_inc' => './sites/all/modules/memcache/memcache-session.inc',
#  'memcache_servers' => array('localhost:11211' => 'default'),
#  'memcache_bins' => array('cache' => 'default'),
);

$conf['cache_inc'] = './sites/all/modules/cacherouter/cacherouter.inc';
$conf['cacherouter'] = array(
  'default' => array(
    'engine' => 'memcache',
    'server' => array('localhost:11211',),
    'shared' => TRUE,
    'prefix' => '',
    'path' => '/tmp/filecache',
  ),
);

With cacherouter

#/etc/init.d/memcache restart
#siege
.....
Lifting the server siege...      done.
Transactions:                    152 hits
Availability:                 100.00 %
Elapsed time:                  59.38 secs
Data transferred:               0.74 MB
Response time:                  9.68 secs
Transaction rate:               2.56 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                   24.77
Successful transactions:         152
Failed transactions:               0
Longest transaction:           17.11
Shortest transaction:           0.01

with no caching

<edit settings.php, disabled and cleared caching @ admin/settings/performance>
#/etc/init.d/memcache stop
#siege
.....
Lifting the server siege...      done.
Transactions:                    136 hits
Availability:                 100.00 %
Elapsed time:                  11.77 secs
Data transferred:               0.75 MB
Response time:                  1.84 secs
Transaction rate:              11.55 trans/sec
Throughput:                     0.06 MB/sec
Concurrency:                   21.27
Successful transactions:         136
Failed transactions:               0
Longest transaction:            3.57
Shortest transaction:           0.04

with patched memcache

<edit settings.php, enabled and cleared caching @ admin/settings/performance>
#/etc/init.d/memcache start
#siege
...
Lifting the server siege...      done.
Transactions:                    184 hits
Availability:                 100.00 %
Elapsed time:                  14.29 secs
Data transferred:               0.96 MB
Response time:                  1.71 secs
Transaction rate:              12.88 trans/sec
Throughput:                     0.07 MB/sec
Concurrency:                   22.03
Successful transactions:         184
Failed transactions:               0
Longest transaction:            3.05
Shortest transaction:           0.12

Please let me know if I've missed something or you see an error with my configuration.
These tests are just quick to show initial results. I can provide more in depth tests if you really want them. Also note that this is not saying that no caching or memcache is better, only that this bug is real.

pmcg’s picture

Status: Postponed (maintainer needs more info) » Active

Setting this back to active.

slantview’s picture

Thank you for your thorough testing. I have been kinda out of the loop for the past week or so due to some personal problems, but this is in my queue to check out this week.

-s

pmcg’s picture

No problem, I'm excited to see what you come up with. I'll have to admit I worked on a similar implementation (memcache key tracking) and the locking was a real issue. The final decision was to scrap the whole idea before I was finished but my (un-implemented) idea was to skip cache writes if the bin was locked, since the content will be cached eventually if it's under a significant amount of demand. Secondly, flushes/updates are the highest priority so they should be queued if the bin is locked and handled sequentially the next time a caching operation is fired.

As I said, I never had the opportunity to implement it, and I don't have the time now to look into it for cacherouter, but maybe these ideas will help, if not ... whatever :)

~paul

triclops’s picture

We solved this in #264440: memcache locking causes unnaturally high number of sets by removing the memcache lock entirely.

pasc’s picture

I'm still unsure about a lock being the best thing to have in a caching system, as there seems to be too much of a risk for things to go wrong when multiple web servers start hitting the same memcached. I'm also concerned by the size the lookup table would get when you start throwing a few thousand keys at it when you consider it has to be sent both ways across the network every time you want to do a set.

While talking about this with triclops earlier on today, we came up with a couple of ideas...

a) I still think it's better to try and not lock at all, even though it makes cache gets go slightly slower, as it's easier to scale that way. The patch in #264440: memcache locking causes unnaturally high number of sets uses that solution. That patch can probably be improved by storing the invalidation data somewhere else, but it's still working nicely. We've been using that patch in production for the last 3 days now.

b) We came up with an alternate solution in case people want to have a lookup table, that doesn't involve blocking every time we can't lock the lookup table in the cache in a set.

Basically, the idea would be to get a "lookup_offset" key and appending that value to the lookup table key. Then, if we get a lock failure, we use incr on that "lookup_offset" key to give us a new offset, and we start writing to the "lookup_{$lookup_offset}" entry from then on. We can then change the behavior on flush to look at all the lookup_* tables, and reset the lookup_offset key. Since incr is atomic, this will not break if you have multiple sets at the same time.

Both of these solutions can be made to work with partial flushes if we want to go that way too.

andypost’s picture

At #224772: problem with cache_clear_all I post a statistics of clearing cache by wildcards
So only one place where we need lookups is clearing cache by wildcards and removing cache_temporary items

For cache_temporary we can use TTL based on cron run (in cache_router ttl = 180 sec)
As about wildcards - i see no solution as clear whole cache but it brings poor performance in shared eviroments - very often flush eliminates all caching!

slantview’s picture

I still haven't had time to fix this, but I see that there are 3 options.

1. Not support partial flushes and get rid of all locking and lookup tables (bad?)
2. Get rid of locks and use expire (#264440: memcache locking causes unnaturally high number of sets)
3. Use some sort of content increment so that you don't have to lock.

Personally I am a fan of #2 and I would evaluate #3 if I could see some sort of code example that shows how this works.

Anybody up for helping me get a prototype of #3 going?

Steve

andypost’s picture

Version: 6.x-1.0-beta3 » 6.x-1.x-dev

Just thoughts...

1) Using locks it's first idea which comes but every lock is bottleneck
2) increments is not possible on every backend (apc, eaccelerator have none)
3) apc_cache_info('user') eaccelerator_list_keys() xcache_list(XC_TYPE_VAR)can be used for enumeration

andypost’s picture

About 3) - increment brings performance penalty on fullkey-generation - need to check bin version and maybe "tag" (mask) version then check key with strpos for all clearing tags

example (thoughts)

1) for each bin store version number which inc on cache_clear_all('*'
2) for subkeys which cleared by mask store version
3) another system key is "last_cleared_version"
4) change key-function to include version number
5) write cron job to clear permanent keys with old version numbers

for example: cache table with "locale:" key which is cleared by mask

 //locale.inc drupal 6
cache_clear_all('locale:', 'cache', TRUE); 

we alredy have _lock and _lookup keys so add _clears (store keys which come from cache_clear_all), _cleared (last cleared version - to delete permanent keys), _version (current version of bin)

init:
- get _version, _clears and store in cache object for bin

fullkey generation:
$key = 'locale:en'
- check $key foreach of _clears if there is hits replace $key substring with substring+version on substring
now $key looks like 'locale:-{version}-en
- now add bin version
$key 'locale:-{version}-en-{bin-version}

on _get and _set there's no changes

on cache_clear_all
if $wildcard
- if $cid = '*' increment _version key
- else check if $cid in _clears else add, increment $cid in _clears (maybe use for it some key)

so storage holds (version numbers can be stored as int
{prefix}_clears {array ('locale:')}
{prefix}_clears-locale: {version=3}
{prefix}_version {version=2}

{prefix}-locale:-1-en-1
{prefix}-locale:-2-en-1
{prefix}-locale:-3-en-1
{prefix}-locale:-1-ru-1
{prefix}-locale:-2-ru-1
{prefix}-locale:-3-ru-1

{prefix}-locale:-1-en-2
{prefix}-locale:-2-en-2
{prefix}-locale:-3-en-2
{prefix}-locale:-1-ru-2
{prefix}-locale:-2-ru-2
{prefix}-locale:-3-ru-2

PS looks expensive on memory but most of keys are temp so they expire by backend other should be cleared on cron run

angch’s picture

Status: Needs review » Active
StatusFileSize
new1.13 KB

I have this bug, or one similar to it, but I think the cause of it is the same, and tracked this down to faulty lock() and unlock() implementation in engines/memcache.php (Drupal 5 version, beta9).

First (minor) bug is that in lock(), the syntax should be $memcache->add($key, $var, $flag, $expire), but memcache->add($key, $flag, $expire) is used instead. In attached patch, I fixed this as well as adding a timeout of 10 seconds of the lock, just in case.

Second, the probable critical bug is in unlock(), where $memcache->delete($key) is used without timeout. Not sure what the default timeout is, but you need to set it to 0 to delete the key immediately. My hunch is that unlock() unpatched does not work, because memcache::delete's timeout does not clear the key immediately.

Edit:...some wireshark sessions later...Seems like a bug in PHP? I'm on the default Ubuntu Intrepid's php5.2.6, and $memcache->delete($key) without a timeout values does not send a memcache DELETE command at all...

Edit:There's a strong possibility that this bug only affects the php5-memcache (version 3.0.1-1) module in Ubuntu Intrepid only as it appears to be bugfree under Ubuntu Hardy php5-memcache version 2.2.0-1. If this is true, then my bug and this bug may not be related.

Edit:Yup, my bug is a different bug, sorry.

In my cases, the lock(menu), unlock(menu), lock(menu) pattern will lock up in the second lock(menu).

Attached patch is for drupal 5.

See also
http://www.php.net/manual/en/function.memcache-add.php
http://www.php.net/manual/en/function.memcache-delete.php

angch’s picture

Status: Active » Needs review
StatusFileSize
new1.12 KB

Patch of the above, but for Drupal 6. Untested, need feedback.

andypost’s picture

Status: Active » Needs review
StatusFileSize
new1.27 KB

Suppose much better set timeout to 1 second and forget about compress setting

andypost’s picture

StatusFileSize
new1.27 KB
mikl’s picture

Hmm, I just had a similar problem – a single Apache worker using 60% CPU and memcached using 40% CPU. I'm not quite sure that it's this bug, but it seemed to be some kind of race condition.

vacilando’s picture

Subscribing

mcurry’s picture

subscribe

andypost’s picture

Good solution was commited to memcache http://drupal.org/cvs?commit=310354

andypost’s picture

Status: Needs review » Needs work
andypost’s picture

#13 was commited to all branches

andypost’s picture

bump

andypost’s picture

http://drupal.org/cvs?commit=410214
commited 10 sec interval

webdevbyjoss’s picture

hi, guys

I'm able to reproduce the issue on my env on latest Drupal 6.22 + latest CacheRouter 6.x-1.0-rc2
This issue is reproducible with load about ~150 requests/second (up to 10000 requests / minute).

PHP script hangs in infinite loop with ~100% CPU utilization

Profiler dump image: drupal-6-cacherouter-memcache-fail.png

Profiling dump clearly shows that it loops in the following code calling time() and MemcachePool::add():

cacherouter/engines/memcache.php:186

    // Lock once by trying to add lock file, if we can't get the lock, we will loop
    // for 3 seconds attempting to get lock.  If we still can't get it at that point,
    // then we give up and return FALSE.
    if ($this->memcache->add($this->lock, 1, FALSE, 10) === FALSE) {
      $time = time();
      while ($this->memcache->add($this->lock, 1, FALSE, 10) === FALSE) {
        if (time() - $time >= 3) {
          return FALSE;
        }
      }
    }

Clearing "memcached" data (daemon restart) helps to break the infinite loop, but after some time new "httpd" process appears again eating ~100% of CPU.