Unthrottled while loop in memcache engine causing high cpu utilization. [#245341]

engines/memcache.php line 151-155

      $time = time();
      while ($this->memcache->add('lock_' . $this->name, $this->settings['compress'], 0) === FALSE) {
        if (time() - $time >= 3) {
          return FALSE;
        }

If cache is flushed, and you get many concurrent hits, this will/does spike cpu utilization on your app server (httpd) and your memcache process. It worsens until all cpu resources are consumed, causing response time to plateau above 3s until cache is sufficiently repopulated.

Comment	File	Size	Author
#15	cr6_memcache.patch	1.27 KB	andypost
#14	cr6_memcache.patch	1.27 KB	andypost
#12	memcache.php_.patch	1.13 KB	angch
#13	drupal6.memcache.php_.patch	1.12 KB	angch
#2	seigerc.txt	11.02 KB	pmcg

Comments

Comment #1

slantview commented 11 April 2008 at 17:30

Status:

Active

» Postponed (maintainer needs more info)

Can you show me any proof that this is really a problem? I tested this with concurrency at over 150 and never saw a problem. I definitely debated this, but I don't want to sleep() there cause the lock will usually get removed after less than a 100th of a second. cache flushing takes .00001 second in my tests even with a fully populated cache.

Comment #2

pmcg commented 14 April 2008 at 05:36

Status	File	Size
new	seigerc.txt	11.02 KB

The speed of flushing has nothing to do with it, it is the speed of cache rebuilding that's an issue.
If you want proof that this is happening, do a small test with the proceeding line added, and from a clean cache. The speed problem is obvious and the log spam will show you why.

      while ($this->memcache->add('lock_' . $this->name, $this->settings['compress'], 0) === FALSE) {
        watchdog('CacheRouter', 'The bin is locked.',null, WATCHDOG_NOTICE, null);
        if (time() - $time >= 3) {
          return FALSE;
        }

The following test were done without this change to memcache.php.
Clean drupal install with 700 automatically generated nodes via devel.
The only non-core module installed is cacherouter or memcache with gaolei's patch ( http://drupal.org/node/224772 ) one or the other.
All tests were performed with an empty cache, so every request is building cache objects.
The tests were performed with siege with a full url list. (random)
My siege config is attached.

$conf = array(
  'site_name' => 'Haxies',
  'anonymous' => 'Visitor',
#  'cache_inc' => './sites/all/modules/memcache/memcache.inc', 
#  'session_inc' => './sites/all/modules/memcache/memcache-session.inc',
#  'memcache_servers' => array('localhost:11211' => 'default'),
#  'memcache_bins' => array('cache' => 'default'),
);

$conf['cache_inc'] = './sites/all/modules/cacherouter/cacherouter.inc';
$conf['cacherouter'] = array(
  'default' => array(
    'engine' => 'memcache',
    'server' => array('localhost:11211',),
    'shared' => TRUE,
    'prefix' => '',
    'path' => '/tmp/filecache',
  ),
);

With cacherouter

#/etc/init.d/memcache restart
#siege
.....
Lifting the server siege...      done.
Transactions:                    152 hits
Availability:                 100.00 %
Elapsed time:                  59.38 secs
Data transferred:               0.74 MB
Response time:                  9.68 secs
Transaction rate:               2.56 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                   24.77
Successful transactions:         152
Failed transactions:               0
Longest transaction:           17.11
Shortest transaction:           0.01

with no caching

<edit settings.php, disabled and cleared caching @ admin/settings/performance>
#/etc/init.d/memcache stop
#siege
.....
Lifting the server siege...      done.
Transactions:                    136 hits
Availability:                 100.00 %
Elapsed time:                  11.77 secs
Data transferred:               0.75 MB
Response time:                  1.84 secs
Transaction rate:              11.55 trans/sec
Throughput:                     0.06 MB/sec
Concurrency:                   21.27
Successful transactions:         136
Failed transactions:               0
Longest transaction:            3.57
Shortest transaction:           0.04

with patched memcache

<edit settings.php, enabled and cleared caching @ admin/settings/performance>
#/etc/init.d/memcache start
#siege
...
Lifting the server siege...      done.
Transactions:                    184 hits
Availability:                 100.00 %
Elapsed time:                  14.29 secs
Data transferred:               0.96 MB
Response time:                  1.71 secs
Transaction rate:              12.88 trans/sec
Throughput:                     0.07 MB/sec
Concurrency:                   22.03
Successful transactions:         184
Failed transactions:               0
Longest transaction:            3.05
Shortest transaction:           0.12

Please let me know if I've missed something or you see an error with my configuration.
These tests are just quick to show initial results. I can provide more in depth tests if you really want them. Also note that this is not saying that no caching or memcache is better, only that this bug is real.

Comment #3

pmcg commented 22 April 2008 at 04:33

Status:

Postponed (maintainer needs more info)

» Active

Setting this back to active.

Comment #4

slantview commented 22 April 2008 at 19:16

Thank you for your thorough testing. I have been kinda out of the loop for the past week or so due to some personal problems, but this is in my queue to check out this week.

-s

Comment #5

pmcg commented 24 April 2008 at 02:13

No problem, I'm excited to see what you come up with. I'll have to admit I worked on a similar implementation (memcache key tracking) and the locking was a real issue. The final decision was to scrap the whole idea before I was finished but my (un-implemented) idea was to skip cache writes if the bin was locked, since the content will be cached eventually if it's under a significant amount of demand. Secondly, flushes/updates are the highest priority so they should be queued if the bin is locked and handled sequentially the next time a caching operation is fired.

As I said, I never had the opportunity to implement it, and I don't have the time now to look into it for cacherouter, but maybe these ideas will help, if not ... whatever :)

~paul

Comment #6

triclops commented 30 May 2008 at 02:11

We solved this in #264440: memcache locking causes unnaturally high number of sets by removing the memcache lock entirely.

Comment #7

pasc commented 2 June 2008 at 13:31

I'm still unsure about a lock being the best thing to have in a caching system, as there seems to be too much of a risk for things to go wrong when multiple web servers start hitting the same memcached. I'm also concerned by the size the lookup table would get when you start throwing a few thousand keys at it when you consider it has to be sent both ways across the network every time you want to do a set.

While talking about this with triclops earlier on today, we came up with a couple of ideas...

a) I still think it's better to try and not lock at all, even though it makes cache gets go slightly slower, as it's easier to scale that way. The patch in #264440: memcache locking causes unnaturally high number of sets uses that solution. That patch can probably be improved by storing the invalidation data somewhere else, but it's still working nicely. We've been using that patch in production for the last 3 days now.

b) We came up with an alternate solution in case people want to have a lookup table, that doesn't involve blocking every time we can't lock the lookup table in the cache in a set.

Basically, the idea would be to get a "lookup_offset" key and appending that value to the lookup table key. Then, if we get a lock failure, we use incr on that "lookup_offset" key to give us a new offset, and we start writing to the "lookup_{$lookup_offset}" entry from then on. We can then change the behavior on flush to look at all the lookup_* tables, and reset the lookup_offset key. Since incr is atomic, this will not break if you have multiple sets at the same time.

Both of these solutions can be made to work with partial flushes if we want to go that way too.

Comment #8

andypost

he/him

Russian

commented 5 June 2008 at 01:54

At #224772: problem with cache_clear_all I post a statistics of clearing cache by wildcards
So only one place where we need lookups is clearing cache by wildcards and removing cache_temporary items

For cache_temporary we can use TTL based on cron run (in cache_router ttl = 180 sec)
As about wildcards - i see no solution as clear whole cache but it brings poor performance in shared eviroments - very often flush eliminates all caching!

Comment #9

slantview commented 30 June 2008 at 21:20

I still haven't had time to fix this, but I see that there are 3 options.

1. Not support partial flushes and get rid of all locking and lookup tables (bad?)
2. Get rid of locks and use expire (#264440: memcache locking causes unnaturally high number of sets)
3. Use some sort of content increment so that you don't have to lock.

Personally I am a fan of #2 and I would evaluate #3 if I could see some sort of code example that shows how this works.

Anybody up for helping me get a prototype of #3 going?

Steve

Comment #10

andypost

he/him

Russian

commented 7 November 2008 at 00:09

Version:

6.x-1.0-beta3

» 6.x-1.x-dev

Just thoughts...

1) Using locks it's first idea which comes but every lock is bottleneck
2) increments is not possible on every backend (apc, eaccelerator have none)
3) apc_cache_info('user') eaccelerator_list_keys() xcache_list(XC_TYPE_VAR)can be used for enumeration

Comment #11

andypost

he/him

Russian

commented 14 March 2009 at 11:07

About 3) - increment brings performance penalty on fullkey-generation - need to check bin version and maybe "tag" (mask) version then check key with strpos for all clearing tags

example (thoughts)

1) for each bin store version number which inc on cache_clear_all('*'
2) for subkeys which cleared by mask store version
3) another system key is "last_cleared_version"
4) change key-function to include version number
5) write cron job to clear permanent keys with old version numbers

for example: cache table with "locale:" key which is cleared by mask

 //locale.inc drupal 6
cache_clear_all('locale:', 'cache', TRUE);

we alredy have _lock and _lookup keys so add _clears (store keys which come from cache_clear_all), _cleared (last cleared version - to delete permanent keys), _version (current version of bin)

init:
- get _version, _clears and store in cache object for bin

fullkey generation:
$key = 'locale:en'
- check $key foreach of _clears if there is hits replace $key substring with substring+version on substring
now $key looks like 'locale:-{version}-en
- now add bin version
$key 'locale:-{version}-en-{bin-version}

on _get and _set there's no changes

on cache_clear_all
if $wildcard
- if $cid = '*' increment _version key
- else check if $cid in _clears else add, increment $cid in _clears (maybe use for it some key)

so storage holds (version numbers can be stored as int
{prefix}_clears {array ('locale:')}
{prefix}_clears-locale: {version=3}
{prefix}_version {version=2}

{prefix}-locale:-1-en-1
{prefix}-locale:-2-en-1
{prefix}-locale:-3-en-1
{prefix}-locale:-1-ru-1
{prefix}-locale:-2-ru-1
{prefix}-locale:-3-ru-1

{prefix}-locale:-1-en-2
{prefix}-locale:-2-en-2
{prefix}-locale:-3-en-2
{prefix}-locale:-1-ru-2
{prefix}-locale:-2-ru-2
{prefix}-locale:-3-ru-2

PS looks expensive on memory but most of keys are temp so they expire by backend other should be cleared on cron run

Comment #12

angch commented 25 March 2009 at 11:51

Status:

Needs review

» Active

Status	File	Size
new	memcache.php_.patch	1.13 KB

I have this bug, or one similar to it, but I think the cause of it is the same, and tracked this down to faulty lock() and unlock() implementation in engines/memcache.php (Drupal 5 version, beta9).

First (minor) bug is that in lock(), the syntax should be $memcache->add($key, $var, $flag, $expire), but memcache->add($key, $flag, $expire) is used instead. In attached patch, I fixed this as well as adding a timeout of 10 seconds of the lock, just in case.

Second, the probable critical bug is in unlock(), where $memcache->delete($key) is used without timeout. Not sure what the default timeout is, but you need to set it to 0 to delete the key immediately. My hunch is that unlock() unpatched does not work, because memcache::delete's timeout does not clear the key immediately.

Edit:...some wireshark sessions later...Seems like a bug in PHP? I'm on the default Ubuntu Intrepid's php5.2.6, and $memcache->delete($key) without a timeout values does not send a memcache DELETE command at all...

Edit:There's a strong possibility that this bug only affects the php5-memcache (version 3.0.1-1) module in Ubuntu Intrepid only as it appears to be bugfree under Ubuntu Hardy php5-memcache version 2.2.0-1. If this is true, then my bug and this bug may not be related.

Edit:Yup, my bug is a different bug, sorry.

In my cases, the lock(menu), unlock(menu), lock(menu) pattern will lock up in the second lock(menu).

Attached patch is for drupal 5.

Comment #13

angch commented 25 March 2009 at 08:19

Status:

Active

» Needs review

Status	File	Size
new	drupal6.memcache.php_.patch	1.12 KB

Patch of the above, but for Drupal 6. Untested, need feedback.

Comment #14

andypost

he/him

Russian

commented 5 April 2009 at 19:02

Status:

Active

» Needs review

Status	File	Size
new	cr6_memcache.patch	1.27 KB

Suppose much better set timeout to 1 second and forget about compress setting

Comment #15

andypost

he/him

Russian

commented 5 April 2009 at 19:03

Status	File	Size
new	cr6_memcache.patch	1.27 KB

Comment #16

mikl

Møn

commented 17 August 2009 at 08:34

Hmm, I just had a similar problem – a single Apache worker using 60% CPU and memcached using 40% CPU. I'm not quite sure that it's this bug, but it seemed to be some kind of race condition.

Comment #17

vacilando commented 29 September 2009 at 16:52

Subscribing

Comment #18

mcurry commented 18 January 2010 at 08:34

Comment #19

andypost

he/him

Russian

commented 22 January 2010 at 13:50

Good solution was commited to memcache http://drupal.org/cvs?commit=310354

Comment #20

andypost

he/him

Russian

commented 29 January 2010 at 19:20

Status:

Needs review

» Needs work

Marked as duplicate #598474: Memcache not working?, #493498: Memcache and cache_clear_all wildcard

#493448: Memcache and cache_clear_all wildcard is implementation for memcache module

Comment #21

andypost

he/him

Russian

commented 29 January 2010 at 19:35

#13 was commited to all branches

Comment #22

andypost

he/him

Russian

commented 14 April 2010 at 03:39

bump

Comment #23

andypost

he/him

Russian

commented 20 August 2010 at 22:25

http://drupal.org/cvs?commit=410214
commited 10 sec interval

Comment #24

webdevbyjoss commented 16 December 2011 at 16:20

hi, guys

I'm able to reproduce the issue on my env on latest Drupal 6.22 + latest CacheRouter 6.x-1.0-rc2
This issue is reproducible with load about ~150 requests/second (up to 10000 requests / minute).

PHP script hangs in infinite loop with ~100% CPU utilization

Profiler dump image: drupal-6-cacherouter-memcache-fail.png

Profiling dump clearly shows that it loops in the following code calling time() and MemcachePool::add():

cacherouter/engines/memcache.php:186

    // Lock once by trying to add lock file, if we can't get the lock, we will loop
    // for 3 seconds attempting to get lock.  If we still can't get it at that point,
    // then we give up and return FALSE.
    if ($this->memcache->add($this->lock, 1, FALSE, 10) === FALSE) {
      $time = time();
      while ($this->memcache->add($this->lock, 1, FALSE, 10) === FALSE) {
        if (time() - $time >= 3) {
          return FALSE;
        }
      }
    }

Clearing "memcached" data (daemon restart) helps to break the infinite loop, but after some time new "httpd" process appears again eating ~100% of CPU.

Unthrottled while loop in memcache engine causing high cpu utilization.

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

News items

Our community

Documentation

Drupal code base

Governance of community