I have tested this on both D7 and D6 and symptoms are the same -- when you click on the clear all caches button, it freezes, PHP backend hits timeout, Nginx displays 504 gateway timeout etc, while Redis seems to go into loops and monitor displays things like:
+1384401800.803547 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401800.853805 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401800.929112 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.029426 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.154711 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.304995 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.480312 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.680627 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401801.905955 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401802.156259 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401802.431601 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401802.731968 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401803.057254 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401803.407626 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401803.782965 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401804.183303 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384401804.608594 [0 127.0.0.1:49987] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
Changing flush mode has no effect on this, so it is probably lock code specific. I can reproduce this on almost every attempt to click the clear all caches button.
Comments
Comment #1
omega8cc commentedAnother example -- this is from D6 site (while above is example from D7):
Comment #2
omega8cc commentedMore precisely, the cache clear button freezes in D7 when you click on it again after it worked on the first click, while in D6 it freezes always.
Comment #3
pounardThanks for the detailed report, that's something I stumbled upon but never did took the time to debug. It's very helpful, I'll see into that as soon as I can.
Comment #4
pounardI am unable to reproduce the bug, I even did run three concurrent threads maintaining the variable lock 5 seconds each, no matter what happens it works and always ends up releasing the lock very quickly.
Comment #5
pounardAlso tried with 4 concurent cache clears.
Comment #6
pounardOk some important questions:
Comment #7
pounardI did the same test once again with the Predis driver, using 8 concurrent drush running with random sleep during variable creation and within the lockAcquire and lockRelease mecanisms, I never managed to create a dead lock.
Edit: Got one with running 8 concurrent drush cc all! What happens is that a few of those threads will sleep longer and longer waiting for the others to give the various locks back:
the lockWait() function might be one of the factors wy this fail so bad. Drupal core lock API is really wrong.Edit (2): Still trying, lockWait() is not guilty (and almost never be called except in variable_initialize(), sometimes) and No cache clear ends up being so long that it breaks everything...
Comment #8
omega8cc commentedHere is my testing environment:
After looking at the list of post-2.2.4 commits I think I could try PhpRedis master/head again, even if I have reverted to 2.2.4 because head caused some major WTF in my previous tests with older Redis integration module.
Comment #9
omega8cc commentedIn this environment I can reproduce immediate freeze (just vanilla core installed, no contrib) after just a single hit on the cache clear button in Pressflow 6.28 while on Drupal 7.23 it always freezes on second hit (just *after* the first completes OK).
Comment #10
omega8cc commentedAnd for reference my default config.
Comment #11
pounardThank you once again.
Comment #12
pounardOk, I'm running:
I trust PHP and Redis not being the problem. I would suspect PhpRedis; But before I'll try extensive tests of my implementations of the WATCH/EXEC block.
Comment #13
pounardManual testing of my WATCH/MULTI/EXEC block is actually working fine whatever ends up mixed by the two concurent threads. I'll now try with the same PhpRedis version.
Comment #14
pounardMore questions:
Comment #15
pounardYour key symptom (GET coming again and again) can only happen in the lockWait() method. There is no other possible way of reproducing this exact command so many times: I think that at some point on your environment locks are not being released as they should and another thread kicks in and ends up doing an infinite waiting loop in lockWait(). This can happen only if you acquire a lock for an amount of time equal or greater to the PHP execution time limit, only case where it could end up on a fatal without running the shutdown handler that is supposed to release all locks. This would make all other running PHP concurrent process to find an already acquire lock and run the lockWait() for maximum time, i.e. 30 seconds.
Comment #16
omega8cc commentedJust tested this again, with PhpRedis 2.2.2 this time, same problem. I will now try to downgrade also Redis itself.
Comment #17
pounardI don't think this is linked to PhpRedis or Redis, but an applicative bug in the lock backend or Drupal core.
Comment #18
omega8cc commentedNope, it is everything the same with Redis 2.6.13 and PhpRedis 2.2.2 in my tests. I'm attaching complete monitor output for two subsequent cache clears on D7 site. Note that only the second cache clear produces the lines as shown below. The interesting part is that the loop starts after the
UNWATCHcommand.Comment #19
omega8cc commentedThat said, the cache clear on D7 doesn't hang completely, it just takes much longer on second clear.
With D6 however (using PhpRedis 2.2.2 and Redis 2.6.13) the first cache clear is as slow as the second in D7, but it doesn't hang completely, whereas the second clear hangs until it hits PHP timeout.
Comment #20
pounardDo you have any AJAX requests running on your site, or any other kind of concurent HTTP queries running while you clear your caches? Do the first cache clear ends up well with no WSOD or whatsoever? Do you have any error or warnings either in Drupal watchdog or in error log?
Comment #21
pounardI definitely don't succeed in reproducing it. I need answers of the questions I asked in #14.
If I recall correctly, I experienced it one or twice on another box with a different linux distro and a bugguy PhpRedis @home, I will try on this one maybe tonight.
Comment #22
omega8cc commentedHere is where it hangs on D6 site:
It hangs like this not only on cache clear, but also on submit in the modules admin section.
It is Debian Squeeze and vanilla Drupal core and just Admin module/theme active. Maybe it is the culprit? That would be the only thing which comes on my mind, but it is only in D6 site, while D7 uses default Seven theme and still the second cache clear is always sloooow (but doesn't hang completely).
[EDIT] Nope, it is the same in D6 with Garland in the admin.
> What is you PHP time limit?
max_execution_time = 300
max_input_time = 300
> What is your nginx CGI gateway timeout?
fastcgi_connect_timeout 60;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
> Do you have revelant error messages in error.log?
There are no errors reported in the PHP-FPM backend.
> Do you have revelant error messages in Drupal watchdog?
There is nothing in the dblog/watchdog because the action never completes and just hits timeout.
Comment #23
omega8cc commentedSo in a few subsequent tests Redis monitor freezes in D6 on cache clear here (it is with default modes, not modified):
Not sure if this hints anything helpful :/
Comment #24
pounardThey are helpful, in worst case they help me knowing where the problem is not :) It should not freeze anywhere else than when it plays with locks. #22 log is a lock release, without the return code of the EXEC command I cannot state if this is a success or not. #23 log starts with a lock acquire, which seems successful I guess because even without any kind of return the following command is a menu links cache clear, so menu rebuild actually happens. Sounds weird to have a bug anywhere else. So, if your site really freeze, in the second log, when the KEYS command is invoked, I guess it could be due to a too large packet sent or received. I must explore this path too.
Comment #25
omega8cc commentedThe weird part is that I can reproduce the same issue all the time, both in D7 and D6, with various Redis and PhpRedis versions, and with essentially empty sites (just single node added, zero contrib modules etc.)
Comment #26
omega8cc commentedI guess I should also reference Redis (mostly default, but why not) config used, for complete picture. The value for
maxmemoryis a placeholder, automatically tuned depending on the RAM available on the system.Comment #27
omega8cc commentedOK, once again, here is what Redis monitor displays and where it stays "frozen" while the site goes into its "spinning" mode, until it hits timeout and then there is still nothing more displayed in the monitor besides the last line with "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:*", as shown below:
Meanwhile, the verbose log enabled in Redis shows only that there is active connection and nothing else, until it says in the last line: "Closing idle client".
You can see from this log how long it takes until Redis gives up, while nginx displays 504 Gateway Time-out, because of PHP-FPM timed-out.
And this happens on every attempt to clear all caches in the Pressflow 6.28 site.
Comment #28
omega8cc commentedNow I have tried
drush cc all -din that D6 site a few times, and on command line it performs exactly as it performs via browser (PHP-FPM) in D7 - first clear is fast without issues, second is slower with the flood of:but it doesn't hang!
This seems to prove that there is something specific to PHP-FPM and D6, since only this combination produces the full "freeze" syndrome, while PHP-CLI for D6 and D7 and PHP-FPM for D7 gives always normal first clear and slow second, with the same flood of
lock:menu_rebuildComment #29
omega8cc commentedNo idea what happened. I have re-installed Redis and PhpRedis from head/master:
and used the same config with custom modes added only in D6 (which may not play any role, though) [EDIT] just turned it off and everything still just works!
and it no longer happens! I can click many times in a row on the clear all caches button in D6 and D7 and it is always instant!
Go figure!
I'm going to close this giant ticket for now, until someone could reproduce it (but hopefully not!).
Thank you for your assistance!
Comment #30
omega8cc commentedWell, this happens again after some time. I suspect it is something on the PHP level then, which is weird, but restarting redis-server and then php-fpm helps immediately. I will continue to debug this, but at this stage I believe it is not related to Redis, PhpRedis nor the integration module. Rather something on the php-fpm (and possibly also php-cli, but for different reasons, like Redis cache "poisoned" by php-fpm (maybe opcode cache related etc.) I will report anything I will find to help others debug and avoid this major WTF experience.
Comment #31
omega8cc commentedIt looks like these partially random issues were caused by having
tcp-keepaliveenabled in Redis. After disabling it, we couldn't reproduce the problem any longer.Comment #32
omega8cc commentedTitle update to include the real condition behind this mess we have experienced.
Comment #33
pounardWhow that's very helpful! Thank you very much. I'll do some tests over the weekend and update the documentation accordingly. I'll try to see it's possible to patch the module in order to avoid asking people to change their Redis server configuration.
Comment #34
omega8cc commentedThere must be something else/more, since now, after a few hours of not touching anything, so still with
tcp-keepalivedisabled in Redis, all those symptoms are back in a full glory. Restarting php-fpm has no effect, so it is also not related to anything in the opcode cache. Only restarting Redis helps. I guess there are some other settings to modify, but we will start with just default, vanilla Redis config to see if this is anything specific to our (almost default) config.Comment #35
omega8cc commentedCorrection: restarting redis-server is *not* enough. One must also delete Redis dump.rdb file. This obviously suggests that the problem is caused by some mess stored in dump.rdb -- the question is: what causes this mess and why. We have tested redis-server restarts while gradually reverting redis.conf to vanilla version and nothing helped. Only deleting dump.rdb did help.
Comment #36
omega8cc commentedWe have disabled all
save *lines in redis.conf to use RAM only and we will see what happens during next few hours of testing.Comment #37
omega8cc commentedRe-opening while we still debug this to determine WTF is happening here. After thinking a bit more I suspect that having Redis persistent mode enabled (which is Redis default), so everything ends up dumped from memory to dump.rdb may affect basically all cache bins in various bad ways, silently, because cache bins by definition should not use any kind of "permanent" storage with its own TTL management, not affected directly by integration module nor Drupal. It can be just the lock feature which manifests this problem in a spectacular way.
Comment #38
pounardHum... Me.. Needs.. Thinking...
Those kind of problems I deeply hate.
Comment #39
omega8cc commentedFurther cross-testing revealed that with persistent mode disabled, but with
tcp-keepaliveenabled, the problem re-appears again after fresh Redis restart. So it looks like bothtcp-keepalivemust be disabled (set to0) and persistent mode disabled (by commenting out allsave *lines). We will report back if further testing will prove it correct.Comment #40
pounardThank you very much once again for all this. I am afraid the use case the Redis module is doing with Redis is not good at all. I should go and ask questions on stack overflow and other forums about this. I have no time right now for doing it, I hope I will the next week.
Comment #41
pounardFrom what I read I'm looking for Redis side trail for finding out what is happening.
Just putting in this post random URL I found for later further reading:
Comment #42
omega8cc commentedAfter a few more hours there are still absolutely no issues with both persistent (
save) mode andtcp-keepalivedisabled.[EDIT] I think it is pretty understandable that when Redis is used as a db backend for caching only, any settings which could interfere (and both persistent mode and
tcp-keepalivewill interfere) must be turned off. I can't imagine a workaround nor fix for this requirement, since Drupal must control this backend via integration module in 100%. But obviosuly I don't know Redis internals enough to turn this opinion into a fact :)Comment #43
pounardThis interfere for sure, but it should not create such freeze. This is something very weird.
I will document that as a first step towards fixing it, but I'll also check if my algorithm for locking doesn't have potential race conditions, which is very plausible.
All the URL I linked upper I will start to read when I'll have dedicated time for it at work contains good trails for introspecting into this.
Comment #44
omega8cc commentedThank you for being so responsive, it is greatly appreciated! We at Barracuda team are looking forward to continue to contribute in a small but hopefully helpful way to this great project!
Comment #45
pounardI have to admit that I more or less left the project in an abandonned state since I didn't get much contributions (I don't forget those I got from various people). But knowing this module is being used in real life use case encourages me in maintaining it! So I have to thank you too for being so responsive too.
Comment #46
omega8cc commentedLooks like there are still more hidden issues. After several hours without any previous symptoms, the cache clear still works instantly on D7 site on every click, but D6 site is back to its weird "froze on cache clear" state -- on every click, even if there is only RAM used and no
tcp-keepalive. Only redis-server restart helps. Uh-oh :/Comment #47
omega8cc commentedFurther tests confirmed (so far) that these non-standard mode settings, as shown below, enabled for D6 site could be a reason of reoccurring problems in D6, even if both persistent mode and
tcp-keepalivewere disabled:Initially we have tried these more aggressive modes in D6 with intention to help D6 *avoid* issues with the "frozen" syndrome. But after confirming in a few hours-long tests that both persistent mode and
tcp-keepalivemust be disabled, or the problem hits both D7 and D6 again, we didn't disable these custom modes for D6, and, as reported above, D6 experienced the "frozen" syndrome again.But now, after several more hours of further testing this with both persistent mode and
tcp-keepalivedisabled and all modes left at their defaults (so with lines shown above removed), it no longer happens, at least not in the last 6 hours.Comment #48
omega8cc commentedOK, so it didn't happen again to D6 site only because the only visited URI was admin/settings/performance, to hit the cache clear button. Once we have tried to click around, edit a node and then return to admin/settings/performance to hit the button again, it did froze again, while no custom modes are active and both persistent mode and
tcp-keepaliveare disabled.What is worse, we have tried the same steps with D7 (instead of just keep hitting the cache clear button) and it experienced the same issue again, with the same flood of
lock:menu_rebuildlines as before, while both persistent mode andtcp-keepaliveare disabled.So we are back in the same corner.
Comment #49
pounardThis is a very weird issue, I have to reproduce it. I do some WATCH/MULTI/EXEC transactions in cache set operation, maybe this is related.
Comment #50
pounardOk I reproduced it on my personal box. The symptome is there:
This code pragmatically runs the first bit of the locking transaction:
What happens there is that line:
returns the Redis instance instead of the value, this should never happen before a pipeline or an MULTI/EXEC is started, which make me thinks that at some point, a race condition happens and a MULTI or a pipeline is never closed.
I guess this is my fault, I have to find the broken piece of code now.
Comment #51
pounardI'm not sure, but all of PHP, PhpRedis and the usage I made of it could be altogether the culprit.
I explain, I oftenly write:
This piece of algorithm is not viable, because
$idis a string and$client->get()can return various types of output. First it can return an empty string, case in which I don't get in the if statement because"" != "some other string": this is wrong because when there is no lock set I am therefore the legit owner.Secondly if the return is not a string, it might have unexpected behaviors due to PHP implicit type casting, I should have used
===or!==instead of==or!=I replaced those occurences everywhere using:
And the bug disappeared from my machine. I don't know why, I did not find out the real culprit here, but it just disappeared.
After that I also reworked the PhpRedis transactions handling and added more conditions where the UNWATCH command would be run, and removed some useless DISCARD calls (don't know if this has any effect but hey, you never know).
I'm not really sure what happens there. The bug just disappeared from my box. Worst even, when I switch back to old code version, it did not reappear (WTF?).
Comment #52
pounardPlease could you test this patch?
Comment #53
omega8cc commentedThanks, I will test this patch and will report back. This issues is not easy to reproduce and we were confused already a few times, as documented above. You need to start with clean Redis restart, and then do some things, like node edit, some visits in the admin and only then attempt to click on the cache clear button a few times (on D7) or just once (on D6) while running monitor in redis. Let's see how it goes!
Comment #54
omega8cc commentedThe patch from #52 made no difference. Here are the exact steps to reproduce the freeze in D6 and slowdown on second clear in D7:
1. Restart redis-server and enter its monitor.
2. Visit the site as admin, edit some node.
3. Go to the performance page and hit "Clear cached data" in D6 or "Clear all caches" in D7.
You should get instant freeze in D6 and the redis monitor will halt at line:
"KEYS" "prefix-foo:cache_menu:*"In D7 the first hit on the "Clear all caches" button will work fine, but the second will kind-of-freeze while monitor will be flooded with lines:
Comment #55
omega8cc commentedThe suspicious thing is that both in D6 and D7 it seems to be related to the
cache_menubin.Comment #56
omega8cc commentedWell, I went ahead and excluded cache_menu bin with:
And guess what? I can't reproduce the freeze any longer! Even with latest patch reverted.
Wow. Now I understand even less than before.
Comment #57
omega8cc commentedNow I went even further and enabled
tcp-keepalive 60-- and still no freeze syndrome withcache_menuexcluded.Comment #58
omega8cc commentedOK, why not to enable back the (default) persistent mode in Redis in the next step? Guess what? It made no difference, still no sign of freeze syndrome or any slowdown. Everything just works, if I exclude
cache_menu.I guess this bin needs some special attention? Don't know why, but the results are pretty clear.
Comment #59
pounardI can see that. It's pretty weird, maybe it's due to the massive amount of keys being stored, with extremely long names, maybe at some point it tickles a bit Redis' limits. Don't know, I'll try and see.
Comment #60
omega8cc commentedOK, so I have applied the patch from #52 again and we will see how it goes with
cache_menuexcluded over the next few hours.Comment #61
omega8cc commentedAlso note that all those tests run on a vanilla site with no contrib, so the same syndrome related to
cache_menucan theoretically affect some other bins associated with contrib modules.Comment #62
pounardHum maybe some chunks sent to Redis are too big, could try with this one using cache menu in Redis?
Comment #63
pounardForget about my last patch, I don't why this could hurt when dealing with only 2 or 3 menu entries...
Comment #64
omega8cc commentedRight, so on a more complex D6 site (ManagingNews distro) it freezes on another bin:
"KEYS" "mn.o1.v174q.nyc.host8.biz_:cache_views:*"I have tested this also with more complex D7 site (Open Atrium 2.0) but it didn't halt on any bin while
cache_menuis excluded.Comment #65
omega8cc commentedYeah, the last patch didn't make any difference.
Hmm, should I still see lines like this one (I mean the
lock:menu_rebuildpart):+1384617719.718692 [0 127.0.0.1:33204] "SETEX" "mn.o1.v174q.nyc.host8.biz_:lock:menu_rebuild" "30" "429959810528796f7a47ef4.44059211"After I have excluded
cache_menubin from the Redis backend?[EDIT] Locks are separate from bins, but still..
Comment #66
pounardThis is a lock, not a cache value. The value you see in the end is the unique identifier for the page run which allows the lock to be a mutex (the $owner variable). This is normal that you have this line.
Comment #67
omega8cc commentedTo avoid the freeze syndrome on a D6 sites (ManagingNews) I had to exclude three bins:
This doesn't look good.
Comment #68
omega8cc commentedJust tested this again and both D6 based ManagingNews and OpenAtrium 1 sites require all 3 bins listed above excluded to avoid the freeze syndrome, while D7 based OpenAtrium 2 needs only
cache_menuexcluded. Weird, weird..Comment #69
pounardSays Yoda weird it is, yes.
Comment #70
pounardDon't worry I did not forget this issue, I really don't have time for this right now. I'll continue as soon as I can.
Comment #71
omega8cc commentedNo problem, I understand.
Comment #72
j0rd commentedI have a redis install (PhpRedis) which has around 250k keys after pre-caching what I needed. `drush cc all` seems to take roughly two minutes, where on memcache it took a couple seconds. So I read over this thread. I don't believe my issue is related.
--
I've ran into problems on D6 with the logic relating to locks in the code which rebuilds menu's.
Drupal code which uses locks I've found doesn't always act properly on the return values and can lead to tramples / race conditions.
Here's my old findings with regards to cache_menu rebuilds. Not sure if it ever got fixed, or is even related to the problems you're having.
https://drupal.org/comment/6982994#comment-6982994
There's also race conditions in image_style file creation with regards to database locks. Again, not sure if they've been fixed either.
https://drupal.org/comment/7367644#comment-7367644
So while your lock code might work just fine, Drupal code which uses your locks have been implemented poorly and could be causing their own race conditions or tramples.
Not sure if this helps at all though.
---
PS, I'm in here, because my `drush cc all`s don't seem to clear caches, like they did with memcache and I'm having stale cache issues I'm looking into. No answers here though, but hopefully you guys will figure this one out. I'll let you know if I run into this problem.
Comment #73
omega8cc commentedWell, I just tested this by applying and reverting back and forth this simple patch probably more than 30 times and I can confirm that it fixed the problem both in D6 and D7. Wow!
Comment #74
pounard@#73 whow, will look into that ASAP
Comment #75
pounardFixed by mja on #2140897: cache_clear_all() not properly handled in PhpRedis.php
See commit http://drupalcode.org/project/redis.git/commit/810cce5
Closing this one happily now, it has been a great deal of loosing hair.
Comment #76
pounardRelease 7.x-2.6 is 'en route' and coming soon containing this bugfix.
Comment #77
omega8cc commentedOK, but please note that I have tested everything also with patch from #52 which is not included in 7.x-2.6. So to document the exact code I have tested, here is a diff attached between 7.x-2.6 and my test branch.
Comment #78
pounardOk so I will review this patch and takes whatever needs to be taken.