Needs work
Project:
Cache Router
Version:
6.x-1.x-dev
Component:
Code
Priority:
Critical
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
11 Apr 2008 at 08:19 UTC
Updated:
16 Dec 2011 at 16:20 UTC
Jump to comment: Most recent file
Comments
Comment #1
slantview commentedCan you show me any proof that this is really a problem? I tested this with concurrency at over 150 and never saw a problem. I definitely debated this, but I don't want to sleep() there cause the lock will usually get removed after less than a 100th of a second. cache flushing takes .00001 second in my tests even with a fully populated cache.
Comment #2
pmcg commentedThe speed of flushing has nothing to do with it, it is the speed of cache rebuilding that's an issue.
If you want proof that this is happening, do a small test with the proceeding line added, and from a clean cache. The speed problem is obvious and the log spam will show you why.
The following test were done without this change to memcache.php.
Clean drupal install with 700 automatically generated nodes via devel.
The only non-core module installed is cacherouter or memcache with gaolei's patch ( http://drupal.org/node/224772 ) one or the other.
All tests were performed with an empty cache, so every request is building cache objects.
The tests were performed with siege with a full url list. (random)
My siege config is attached.
With cacherouter
with no caching
with patched memcache
Please let me know if I've missed something or you see an error with my configuration.
These tests are just quick to show initial results. I can provide more in depth tests if you really want them. Also note that this is not saying that no caching or memcache is better, only that this bug is real.
Comment #3
pmcg commentedSetting this back to active.
Comment #4
slantview commentedThank you for your thorough testing. I have been kinda out of the loop for the past week or so due to some personal problems, but this is in my queue to check out this week.
-s
Comment #5
pmcg commentedNo problem, I'm excited to see what you come up with. I'll have to admit I worked on a similar implementation (memcache key tracking) and the locking was a real issue. The final decision was to scrap the whole idea before I was finished but my (un-implemented) idea was to skip cache writes if the bin was locked, since the content will be cached eventually if it's under a significant amount of demand. Secondly, flushes/updates are the highest priority so they should be queued if the bin is locked and handled sequentially the next time a caching operation is fired.
As I said, I never had the opportunity to implement it, and I don't have the time now to look into it for cacherouter, but maybe these ideas will help, if not ... whatever :)
~paul
Comment #6
triclops commentedWe solved this in #264440: memcache locking causes unnaturally high number of sets by removing the memcache lock entirely.
Comment #7
pasc commentedI'm still unsure about a lock being the best thing to have in a caching system, as there seems to be too much of a risk for things to go wrong when multiple web servers start hitting the same memcached. I'm also concerned by the size the lookup table would get when you start throwing a few thousand keys at it when you consider it has to be sent both ways across the network every time you want to do a set.
While talking about this with triclops earlier on today, we came up with a couple of ideas...
a) I still think it's better to try and not lock at all, even though it makes cache gets go slightly slower, as it's easier to scale that way. The patch in #264440: memcache locking causes unnaturally high number of sets uses that solution. That patch can probably be improved by storing the invalidation data somewhere else, but it's still working nicely. We've been using that patch in production for the last 3 days now.
b) We came up with an alternate solution in case people want to have a lookup table, that doesn't involve blocking every time we can't lock the lookup table in the cache in a set.
Basically, the idea would be to get a "lookup_offset" key and appending that value to the lookup table key. Then, if we get a lock failure, we use incr on that "lookup_offset" key to give us a new offset, and we start writing to the "lookup_{$lookup_offset}" entry from then on. We can then change the behavior on flush to look at all the lookup_* tables, and reset the lookup_offset key. Since incr is atomic, this will not break if you have multiple sets at the same time.
Both of these solutions can be made to work with partial flushes if we want to go that way too.
Comment #8
andypostAt #224772: problem with cache_clear_all I post a statistics of clearing cache by wildcards
So only one place where we need lookups is clearing cache by wildcards and removing cache_temporary items
For cache_temporary we can use TTL based on cron run (in cache_router ttl = 180 sec)
As about wildcards - i see no solution as clear whole cache but it brings poor performance in shared eviroments - very often flush eliminates all caching!
Comment #9
slantview commentedI still haven't had time to fix this, but I see that there are 3 options.
1. Not support partial flushes and get rid of all locking and lookup tables (bad?)
2. Get rid of locks and use expire (#264440: memcache locking causes unnaturally high number of sets)
3. Use some sort of content increment so that you don't have to lock.
Personally I am a fan of #2 and I would evaluate #3 if I could see some sort of code example that shows how this works.
Anybody up for helping me get a prototype of #3 going?
Steve
Comment #10
andypostJust thoughts...
1) Using locks it's first idea which comes but every lock is bottleneck
2) increments is not possible on every backend (apc, eaccelerator have none)
3) apc_cache_info('user') eaccelerator_list_keys() xcache_list(XC_TYPE_VAR)can be used for enumeration
Comment #11
andypostAbout 3) - increment brings performance penalty on fullkey-generation - need to check bin version and maybe "tag" (mask) version then check key with strpos for all clearing tags
example (thoughts)
1) for each bin store version number which inc on cache_clear_all('*'
2) for subkeys which cleared by mask store version
3) another system key is "last_cleared_version"
4) change key-function to include version number
5) write cron job to clear permanent keys with old version numbers
for example: cache table with "locale:" key which is cleared by mask
we alredy have _lock and _lookup keys so add _clears (store keys which come from cache_clear_all), _cleared (last cleared version - to delete permanent keys), _version (current version of bin)
init:
- get _version, _clears and store in cache object for bin
fullkey generation:
$key = 'locale:en'
- check $key foreach of _clears if there is hits replace $key substring with substring+version on substring
now $key looks like 'locale:-{version}-en
- now add bin version
$key 'locale:-{version}-en-{bin-version}
on _get and _set there's no changes
on cache_clear_all
if $wildcard
- if $cid = '*' increment _version key
- else check if $cid in _clears else add, increment $cid in _clears (maybe use for it some key)
so storage holds (version numbers can be stored as int
{prefix}_clears {array ('locale:')}
{prefix}_clears-locale: {version=3}
{prefix}_version {version=2}
{prefix}-locale:-1-en-1
{prefix}-locale:-2-en-1
{prefix}-locale:-3-en-1
{prefix}-locale:-1-ru-1
{prefix}-locale:-2-ru-1
{prefix}-locale:-3-ru-1
{prefix}-locale:-1-en-2
{prefix}-locale:-2-en-2
{prefix}-locale:-3-en-2
{prefix}-locale:-1-ru-2
{prefix}-locale:-2-ru-2
{prefix}-locale:-3-ru-2
PS looks expensive on memory but most of keys are temp so they expire by backend other should be cleared on cron run
Comment #12
angch commentedI have this bug, or one similar to it, but I think the cause of it is the same, and tracked this down to faulty lock() and unlock() implementation in engines/memcache.php (Drupal 5 version, beta9).
First (minor) bug is that in lock(), the syntax should be $memcache->add($key, $var, $flag, $expire), but memcache->add($key, $flag, $expire) is used instead. In attached patch, I fixed this as well as adding a timeout of 10 seconds of the lock, just in case.
Second, the probable critical bug is in unlock(), where $memcache->delete($key) is used without timeout. Not sure what the default timeout is, but you need to set it to 0 to delete the key immediately. My hunch is that unlock() unpatched does not work, because memcache::delete's timeout does not clear the key immediately.
Edit:...some wireshark sessions later...Seems like a bug in PHP? I'm on the default Ubuntu Intrepid's php5.2.6, and $memcache->delete($key) without a timeout values does not send a memcache DELETE command at all...
Edit:There's a strong possibility that this bug only affects the php5-memcache (version 3.0.1-1) module in Ubuntu Intrepid only as it appears to be bugfree under Ubuntu Hardy php5-memcache version 2.2.0-1. If this is true, then my bug and this bug may not be related.
Edit:Yup, my bug is a different bug, sorry.
In my cases, the lock(menu), unlock(menu), lock(menu) pattern will lock up in the second lock(menu).
Attached patch is for drupal 5.
See also
http://www.php.net/manual/en/function.memcache-add.php
http://www.php.net/manual/en/function.memcache-delete.php
Comment #13
angch commentedPatch of the above, but for Drupal 6. Untested, need feedback.
Comment #14
andypostSuppose much better set timeout to 1 second and forget about compress setting
Comment #15
andypostComment #16
miklHmm, I just had a similar problem – a single Apache worker using 60% CPU and memcached using 40% CPU. I'm not quite sure that it's this bug, but it seemed to be some kind of race condition.
Comment #17
vacilando commentedSubscribing
Comment #18
mcurry commentedsubscribe
Comment #19
andypostGood solution was commited to memcache http://drupal.org/cvs?commit=310354
Comment #20
andypostMarked as duplicate #598474: Memcache not working?, #493498: Memcache and cache_clear_all wildcard
#493448: Memcache and cache_clear_all wildcard is implementation for memcache module
Comment #21
andypost#13 was commited to all branches
Comment #22
andypostbump
Comment #23
andyposthttp://drupal.org/cvs?commit=410214
commited 10 sec interval
Comment #24
webdevbyjoss commentedhi, guys
I'm able to reproduce the issue on my env on latest Drupal 6.22 + latest CacheRouter 6.x-1.0-rc2
This issue is reproducible with load about ~150 requests/second (up to 10000 requests / minute).
PHP script hangs in infinite loop with ~100% CPU utilization
Profiler dump image: drupal-6-cacherouter-memcache-fail.png
Profiling dump clearly shows that it loops in the following code calling time() and MemcachePool::add():
cacherouter/engines/memcache.php:186
Clearing "memcached" data (daemon restart) helps to break the infinite loop, but after some time new "httpd" process appears again eating ~100% of CPU.