Lock Backend freezes the site on cache clear when tcp-keepalive is enabled in Redis [#2135545]

Comment	File	Size	Author
#77	redis_70b4698b111b.diff	8.5 KB	omega8cc
#62	2135545-61-random-lock-freeze.patch	1.09 KB	pounard
#52	2135545-52-random-lock-freeze.patch	6.84 KB	pounard
#18	lock-menu_rebuild.txt	350.07 KB	omega8cc

Comment #2

omega8cc commented 14 November 2013 at 04:17

More precisely, the cache clear button freezes in D7 when you click on it again after it worked on the first click, while in D6 it freezes always.

Log in or register to post comments

Comment #3

pounard

French

commented 14 November 2013 at 09:17

Thanks for the detailed report, that's something I stumbled upon but never did took the time to debug. It's very helpful, I'll see into that as soon as I can.

Log in or register to post comments

Comment #4

pounard

French

commented 14 November 2013 at 09:44

I am unable to reproduce the bug, I even did run three concurrent threads maintaining the variable lock 5 seconds each, no matter what happens it works and always ends up releasing the lock very quickly.

Log in or register to post comments

Comment #5

pounard

French

commented 14 November 2013 at 09:46

Also tried with 4 concurent cache clears.

Log in or register to post comments

Comment #6

pounard

French

commented 14 November 2013 at 09:47

Ok some important questions:

Which version of Redis are you using?
Which driver are you using?
Bonus question: depending on the driver (Predis or PhpRedis) which version of Predis or PhpRedis are you using?

Log in or register to post comments

Comment #7

pounard

French

commented 14 November 2013 at 10:18

I did the same test once again with the Predis driver, using 8 concurrent drush running with random sleep during variable creation and within the lockAcquire and lockRelease mecanisms, I never managed to create a dead lock.

Edit: Got one with running 8 concurrent drush cc all! What happens is that a few of those threads will sleep longer and longer waiting for the others to give the various locks back: ~~the lockWait() function might be one of the factors wy this fail so bad. Drupal core lock API is really wrong.~~

Edit (2): Still trying, lockWait() is not guilty (and almost never be called except in variable_initialize(), sometimes) and No cache clear ends up being so long that it breaks everything...

Log in or register to post comments

Comment #8

omega8cc commented 14 November 2013 at 09:59

Here is my testing environment:

v174q:~# redis-server -v
Redis server v=2.6.16 sha=00000000:0 malloc=jemalloc-3.2.0 bits=64
v174q:~# php -i | grep Redis
Redis Support => enabled
Redis Version => 2.2.4
v174q:~# php -v
PHP 5.3.27 (cli) (built: Nov  9 2013 22:34:51)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2013 Zend Technologies
    with the ionCube PHP Loader v4.4.1, Copyright (c) 2002-2013, by ionCube Ltd.
v174q:~#

After looking at the list of post-2.2.4 commits I think I could try PhpRedis master/head again, even if I have reverted to 2.2.4 because head caused some major WTF in my previous tests with older Redis integration module.

Log in or register to post comments

Comment #9

omega8cc commented 14 November 2013 at 10:02

In this environment I can reproduce immediate freeze (just vanilla core installed, no contrib) after just a single hit on the cache clear button in Pressflow 6.28 while on Drupal 7.23 it always freezes on second hit (just *after* the first completes OK).

Log in or register to post comments

Comment #10

omega8cc commented 14 November 2013 at 10:06

And for reference my default config.

Log in or register to post comments

Comment #11

pounard

French

commented 14 November 2013 at 10:19

Thank you once again.

Log in or register to post comments

Comment #12

pounard

French

commented 14 November 2013 at 10:33

Ok, I'm running:

Redis v2.6.13
PHP 5.3.27-pl0-gentoo
And PhpRedis 2.2.2

I trust PHP and Redis not being the problem. I would suspect PhpRedis; But before I'll try extensive tests of my implementations of the WATCH/EXEC block.

Log in or register to post comments

Comment #13

pounard

French

commented 14 November 2013 at 10:44

Manual testing of my WATCH/MULTI/EXEC block is actually working fine whatever ends up mixed by the two concurent threads. I'll now try with the same PhpRedis version.

Log in or register to post comments

Comment #14

pounard

French

commented 14 November 2013 at 11:51

Comment #15

pounard

French

commented 14 November 2013 at 11:59

Your key symptom (GET coming again and again) can only happen in the lockWait() method. There is no other possible way of reproducing this exact command so many times: I think that at some point on your environment locks are not being released as they should and another thread kicks in and ends up doing an infinite waiting loop in lockWait(). This can happen only if you acquire a lock for an amount of time equal or greater to the PHP execution time limit, only case where it could end up on a fatal without running the shutdown handler that is supposed to release all locks. This would make all other running PHP concurrent process to find an already acquire lock and run the lockWait() for maximum time, i.e. 30 seconds.

Log in or register to post comments

Comment #16

omega8cc commented 14 November 2013 at 12:02

Just tested this again, with PhpRedis 2.2.2 this time, same problem. I will now try to downgrade also Redis itself.

v174q:~# redis-server -v
Redis server v=2.6.16 sha=00000000:0 malloc=jemalloc-3.2.0 bits=64
v174q:~# php -i | grep Redis
Redis Support => enabled
Redis Version => 2.2.2
v174q:~# php -v
PHP 5.3.27 (cli) (built: Nov  9 2013 22:34:51)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2013 Zend Technologies
    with the ionCube PHP Loader v4.4.1, Copyright (c) 2002-2013, by ionCube Ltd.
v174q:~#

Log in or register to post comments

Comment #17

pounard

French

commented 14 November 2013 at 12:03

I don't think this is linked to PhpRedis or Redis, but an applicative bug in the lock backend or Drupal core.

Log in or register to post comments

Comment #18

omega8cc commented 14 November 2013 at 12:16

Status	File	Size
new	lock-menu_rebuild.txt	350.07 KB

Nope, it is everything the same with Redis 2.6.13 and PhpRedis 2.2.2 in my tests. I'm attaching complete monitor output for two subsequent cache clears on D7 site. Note that only the second cache clear produces the lines as shown below. The interesting part is that the loop starts after the UNWATCH command.

+1384431064.964890 [0 127.0.0.1:25776] "WATCH" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431064.965032 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431064.965163 [0 127.0.0.1:25776] "UNWATCH"
+1384431064.990454 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.040772 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.116061 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.216367 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.341691 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.491989 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384431065.667294 [0 127.0.0.1:25776] "GET" "uc3.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
[cut]

Log in or register to post comments

Comment #19

omega8cc commented 14 November 2013 at 12:26

That said, the cache clear on D7 doesn't hang completely, it just takes much longer on second clear.

With D6 however (using PhpRedis 2.2.2 and Redis 2.6.13) the first cache clear is as slow as the second in D7, but it doesn't hang completely, whereas the second clear hangs until it hits PHP timeout.

Log in or register to post comments

Comment #20

pounard

French

commented 14 November 2013 at 12:26

Do you have any AJAX requests running on your site, or any other kind of concurent HTTP queries running while you clear your caches? Do the first cache clear ends up well with no WSOD or whatsoever? Do you have any error or warnings either in Drupal watchdog or in error log?

Log in or register to post comments

Comment #21

pounard

French

commented 14 November 2013 at 12:37

I definitely don't succeed in reproducing it. I need answers of the questions I asked in #14.

If I recall correctly, I experienced it one or twice on another box with a different linux distro and a bugguy PhpRedis @home, I will try on this one maybe tonight.

Log in or register to post comments

Comment #22

omega8cc commented 14 November 2013 at 13:02

Here is where it hangs on D6 site:

+1384433107.432048 [0 127.0.0.1:28156] "WATCH" "o1.v174q.nyc.host8.biz_:lock:variable_cache_regenerate"
+1384433107.432095 [0 127.0.0.1:28156] "GET" "o1.v174q.nyc.host8.biz_:lock:variable_cache_regenerate"
+1384433107.432140 [0 127.0.0.1:28156] "MULTI"
+1384433107.432204 [0 127.0.0.1:28156] "DEL" "o1.v174q.nyc.host8.biz_:lock:variable_cache_regenerate"
+1384433107.432216 [0 127.0.0.1:28156] "EXEC"

It hangs like this not only on cache clear, but also on submit in the modules admin section.

It is Debian Squeeze and vanilla Drupal core and just Admin module/theme active. Maybe it is the culprit? That would be the only thing which comes on my mind, but it is only in D6 site, while D7 uses default Seven theme and still the second cache clear is always sloooow (but doesn't hang completely).

[EDIT] Nope, it is the same in D6 with Garland in the admin.

> What is you PHP time limit?

max_execution_time = 300
max_input_time = 300

> What is your nginx CGI gateway timeout?

fastcgi_connect_timeout 60;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;

> Do you have revelant error messages in error.log?

There are no errors reported in the PHP-FPM backend.

> Do you have revelant error messages in Drupal watchdog?

There is nothing in the dblog/watchdog because the action never completes and just hits timeout.

Log in or register to post comments

Comment #23

omega8cc commented 14 November 2013 at 13:13

So in a few subsequent tests Redis monitor freezes in D6 on cache clear here (it is with default modes, not modified):

+1384434560.866502 [0 127.0.0.1:30045] "WATCH" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384434560.866540 [0 127.0.0.1:30045] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384434560.866574 [0 127.0.0.1:30045] "MULTI"
+1384434560.866629 [0 127.0.0.1:30045] "SETEX" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild" "30" "10945133635284cb80cce7a2.23654012"
+1384434560.866645 [0 127.0.0.1:30045] "EXEC"
+1384434560.905031 [0 127.0.0.1:30045] "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:*"
+1384434560.905295 [0 127.0.0.1:30045] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:page-cid:admin/reports/dblog:1" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:tree-data:d96917c235dc9a916ccd5a22841c878e"
+1384434560.905447 [0 127.0.0.1:30045] "SMEMBERS" "d6.o1.v174q.nyc.host8.biz_:cache_block:temporary_items"
+1384434560.905577 [0 127.0.0.1:30045] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_block:temporary_items"
+1384434560.905739 [0 127.0.0.1:30045] "SMEMBERS" "d6.o1.v174q.nyc.host8.biz_:cache_page:temporary_items"
+1384434560.905865 [0 127.0.0.1:30045] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_page:temporary_items"
+1384434560.993286 [0 127.0.0.1:30045] "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:*"

Not sure if this hints anything helpful :/

Log in or register to post comments

Comment #24

pounard

French

commented 14 November 2013 at 13:42

They are helpful, in worst case they help me knowing where the problem is not :) It should not freeze anywhere else than when it plays with locks. #22 log is a lock release, without the return code of the EXEC command I cannot state if this is a success or not. #23 log starts with a lock acquire, which seems successful I guess because even without any kind of return the following command is a menu links cache clear, so menu rebuild actually happens. Sounds weird to have a bug anywhere else. So, if your site really freeze, in the second log, when the KEYS command is invoked, I guess it could be due to a too large packet sent or received. I must explore this path too.

Log in or register to post comments

Comment #25

omega8cc commented 14 November 2013 at 15:37

The weird part is that I can reproduce the same issue all the time, both in D7 and D6, with various Redis and PhpRedis versions, and with essentially empty sites (just single node added, zero contrib modules etc.)

Log in or register to post comments

Comment #26

omega8cc commented 14 November 2013 at 15:41

I guess I should also reference Redis (mostly default, but why not) config used, for complete picture. The value for maxmemory is a placeholder, automatically tuned depending on the RAM available on the system.

Log in or register to post comments

Comment #27

omega8cc commented 14 November 2013 at 16:00

OK, once again, here is what Redis monitor displays and where it stays "frozen" while the site goes into its "spinning" mode, until it hits timeout and then there is still nothing more displayed in the monitor besides the last line with "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:*", as shown below:

+1384443974.860583 [0 127.0.0.1:41235] "HGETALL" "d6.o1.v174q.nyc.host8.biz_:cache:theme_registry:garland"
+1384443974.864243 [0 127.0.0.1:41235] "WATCH" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data"
+1384443974.864360 [0 127.0.0.1:41235] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data"
+1384443974.864498 [0 127.0.0.1:41235] "MULTI"
+1384443974.864717 [0 127.0.0.1:41235] "SETEX" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data" "30" "19146647935284f046d2f9e2.34366075"
+1384443974.864738 [0 127.0.0.1:41235] "EXEC"
+1384443974.890863 [0 127.0.0.1:41235] "WATCH" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data"
+1384443974.890917 [0 127.0.0.1:41235] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data"
+1384443974.890954 [0 127.0.0.1:41235] "MULTI"
+1384443974.891029 [0 127.0.0.1:41235] "DEL" "d6.o1.v174q.nyc.host8.biz_:lock:system_theme_data"
+1384443974.891041 [0 127.0.0.1:41235] "EXEC"
+1384443974.891121 [0 127.0.0.1:41235] "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache:theme_registry*"
+1384443974.891464 [0 127.0.0.1:41235] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache:theme_registry:garland"
+1384443974.891531 [0 127.0.0.1:41235] "WATCH" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384443974.891569 [0 127.0.0.1:41235] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384443974.891613 [0 127.0.0.1:41235] "MULTI"
+1384443974.891677 [0 127.0.0.1:41235] "SETEX" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild" "30" "19146647935284f046d2f9e2.34366075"
+1384443974.891693 [0 127.0.0.1:41235] "EXEC"
+1384443974.931882 [0 127.0.0.1:41235] "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:*"
+1384443974.932151 [0 127.0.0.1:41235] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:tree-data:acb7d5e5d9869166f7e4819fe4779652" "d6.o1.v174q.nyc.host8.biz_:cache_menu:links:navigation:page-cid:admin/settings/performance:1"
+1384443974.932260 [0 127.0.0.1:41235] "SMEMBERS" "d6.o1.v174q.nyc.host8.biz_:cache_block:temporary_items"
+1384443974.932304 [0 127.0.0.1:41235] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_block:temporary_items"
+1384443974.932363 [0 127.0.0.1:41235] "SMEMBERS" "d6.o1.v174q.nyc.host8.biz_:cache_page:temporary_items"
+1384443974.932400 [0 127.0.0.1:41235] "DEL" "d6.o1.v174q.nyc.host8.biz_:cache_page:temporary_items"
+1384443975.018572 [0 127.0.0.1:41235] "KEYS" "d6.o1.v174q.nyc.host8.biz_:cache_menu:*"

Meanwhile, the verbose log enabled in Redis shows only that there is active connection and nothing else, until it says in the last line: "Closing idle client".

[43168] 14 Nov 15:46:06.889 - 1 clients connected (0 slaves), 46669968 bytes in use
[43168] 14 Nov 15:46:11.835 - Accepted 127.0.0.1:41229
[43168] 14 Nov 15:46:11.895 - DB 0: 3475 keys (2 volatile) in 4096 slots HT.
[43168] 14 Nov 15:46:11.896 - 2 clients connected (0 slaves), 46727744 bytes in use
[43168] 14 Nov 15:46:11.915 - Client closed connection
[43168] 14 Nov 15:46:14.855 - Accepted 127.0.0.1:41235
[43168] 14 Nov 15:46:16.903 - DB 0: 3473 keys (3 volatile) in 4096 slots HT.
[43168] 14 Nov 15:46:16.903 - 2 clients connected (0 slaves), 46586800 bytes in use
[43168] 14 Nov 15:46:21.910 - DB 0: 3473 keys (3 volatile) in 4096 slots HT.
[43168] 14 Nov 15:46:21.911 - 2 clients connected (0 slaves), 46586800 bytes in use
[43168] 14 Nov 15:46:26.918 - DB 0: 3473 keys (3 volatile) in 4096 slots HT.
[43168] 14 Nov 15:46:26.918 - 2 clients connected (0 slaves), 46586800 bytes in use
[snip]
[43168] 14 Nov 15:49:12.375 - 2 clients connected (0 slaves), 46586608 bytes in use
[43168] 14 Nov 15:49:16.080 * 10 changes in 300 seconds. Saving...
[43168] 14 Nov 15:49:16.081 * Background saving started by pid 50894
[50894] 14 Nov 15:49:16.264 * DB saved on disk
[50894] 14 Nov 15:49:16.265 * RDB: 1 MB of memory used by copy-on-write
[43168] 14 Nov 15:49:16.282 * Background saving terminated with success
[43168] 14 Nov 15:49:17.383 - DB 0: 3472 keys (2 volatile) in 4096 slots HT.
[43168] 14 Nov 15:49:17.383 - 2 clients connected (0 slaves), 46586608 bytes in use
[snip]
[43168] 14 Nov 15:51:12.550 - 2 clients connected (0 slaves), 46586608 bytes in use
[43168] 14 Nov 15:51:15.053 - Closing idle client
[43168] 14 Nov 15:51:17.557 - DB 0: 3472 keys (2 volatile) in 4096 slots HT.
[43168] 14 Nov 15:51:17.557 - 1 clients connected (0 slaves), 46565680 bytes in use

You can see from this log how long it takes until Redis gives up, while nginx displays 504 Gateway Time-out, because of PHP-FPM timed-out.

And this happens on every attempt to clear all caches in the Pressflow 6.28 site.

Log in or register to post comments

Comment #28

omega8cc commented 14 November 2013 at 16:08

Now I have tried drush cc all -d in that D6 site a few times, and on command line it performs exactly as it performs via browser (PHP-FPM) in D7 - first clear is fast without issues, second is slower with the flood of:

+1384445008.721812 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445008.772110 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445008.847437 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445008.947748 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445009.073133 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445009.223415 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445009.398733 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445009.599061 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"
+1384445009.824407 [0 127.0.0.1:42494] "GET" "d6.o1.v174q.nyc.host8.biz_:lock:menu_rebuild"

but it doesn't hang!

This seems to prove that there is something specific to PHP-FPM and D6, since only this combination produces the full "freeze" syndrome, while PHP-CLI for D6 and D7 and PHP-FPM for D7 gives always normal first clear and slow second, with the same flood of lock:menu_rebuild

Log in or register to post comments

Comment #29

omega8cc commented 14 November 2013 at 17:51

Status:

Active

» Closed (cannot reproduce)

No idea what happened. I have re-installed Redis and PhpRedis from head/master:

v174q:~# redis-server -v
Redis server v=2.6.16 sha=00000000:0 malloc=jemalloc-3.2.0 bits=64
v174q:~# php -i | grep Redis
Redis Support => enabled
Redis Version => 2.2.4

and used the same config with custom modes added only in D6 (which may not play any role, though) [EDIT] just turned it off and everything still just works!

      $conf['redis_flush_mode']             = 1; // Redis default is 0
      $conf['redis_flush_mode_cache_page']  = 2; // Redis default is 1
      $conf['redis_flush_mode_cache_block'] = 2; // Redis default is 1
      $conf['redis_flush_mode_cache_menu']  = 2; // Redis default is 0

and it no longer happens! I can click many times in a row on the clear all caches button in D6 and D7 and it is always instant!

Go figure!

I'm going to close this giant ticket for now, until someone could reproduce it (but hopefully not!).

Thank you for your assistance!

Log in or register to post comments

Comment #30

omega8cc commented 14 November 2013 at 18:08

Well, this happens again after some time. I suspect it is something on the PHP level then, which is weird, but restarting redis-server and then php-fpm helps immediately. I will continue to debug this, but at this stage I believe it is not related to Redis, PhpRedis nor the integration module. Rather something on the php-fpm (and possibly also php-cli, but for different reasons, like Redis cache "poisoned" by php-fpm (maybe opcode cache related etc.) I will report anything I will find to help others debug and avoid this major WTF experience.

Log in or register to post comments

Comment #31

omega8cc commented 14 November 2013 at 19:30

It looks like these partially random issues were caused by having tcp-keepalive enabled in Redis. After disabling it, we couldn't reproduce the problem any longer.

Log in or register to post comments

Comment #32

omega8cc commented 14 November 2013 at 19:33

Title:

Lock Backend freezes the site on cache clear

» Lock Backend freezes the site on cache clear when tcp-keepalive is enabled in Redis

Title update to include the real condition behind this mess we have experienced.

Log in or register to post comments

Comment #33

pounard

French

commented 15 November 2013 at 08:21

Whow that's very helpful! Thank you very much. I'll do some tests over the weekend and update the documentation accordingly. I'll try to see it's possible to patch the module in order to avoid asking people to change their Redis server configuration.

Log in or register to post comments

Comment #34

omega8cc commented 15 November 2013 at 09:44

There must be something else/more, since now, after a few hours of not touching anything, so still with tcp-keepalive disabled in Redis, all those symptoms are back in a full glory. Restarting php-fpm has no effect, so it is also not related to anything in the opcode cache. Only restarting Redis helps. I guess there are some other settings to modify, but we will start with just default, vanilla Redis config to see if this is anything specific to our (almost default) config.

Log in or register to post comments

Comment #35

omega8cc commented 15 November 2013 at 11:10

Correction: restarting redis-server is *not* enough. One must also delete Redis dump.rdb file. This obviously suggests that the problem is caused by some mess stored in dump.rdb -- the question is: what causes this mess and why. We have tested redis-server restarts while gradually reverting redis.conf to vanilla version and nothing helped. Only deleting dump.rdb did help.

Log in or register to post comments

Comment #36

omega8cc commented 15 November 2013 at 10:16

We have disabled all save * lines in redis.conf to use RAM only and we will see what happens during next few hours of testing.

Log in or register to post comments

Comment #37

omega8cc commented 15 November 2013 at 10:50

Status:

Closed (cannot reproduce)

» Needs work

Re-opening while we still debug this to determine WTF is happening here. After thinking a bit more I suspect that having Redis persistent mode enabled (which is Redis default), so everything ends up dumped from memory to dump.rdb may affect basically all cache bins in various bad ways, silently, because cache bins by definition should not use any kind of "permanent" storage with its own TTL management, not affected directly by integration module nor Drupal. It can be just the lock feature which manifests this problem in a spectacular way.

Log in or register to post comments

Comment #38

pounard

French

commented 15 November 2013 at 10:56

Hum... Me.. Needs.. Thinking...

Those kind of problems I deeply hate.

Log in or register to post comments

Comment #39

omega8cc commented 15 November 2013 at 11:03

Further cross-testing revealed that with persistent mode disabled, but with tcp-keepalive enabled, the problem re-appears again after fresh Redis restart. So it looks like both tcp-keepalive must be disabled (set to 0) and persistent mode disabled (by commenting out all save * lines). We will report back if further testing will prove it correct.

Log in or register to post comments

Comment #40

pounard

French

commented 15 November 2013 at 11:05

Thank you very much once again for all this. I am afraid the use case the Redis module is doing with Redis is not good at all. I should go and ask questions on stack overflow and other forums about this. I have no time right now for doing it, I hope I will the next week.

Log in or register to post comments

Comment #41

pounard

French

commented 15 November 2013 at 12:08

From what I read I'm looking for Redis side trail for finding out what is happening.

Just putting in this post random URL I found for later further reading:

Log in or register to post comments

Comment #42

omega8cc commented 15 November 2013 at 15:38

After a few more hours there are still absolutely no issues with both persistent (save) mode and tcp-keepalive disabled.

[EDIT] I think it is pretty understandable that when Redis is used as a db backend for caching only, any settings which could interfere (and both persistent mode and tcp-keepalive will interfere) must be turned off. I can't imagine a workaround nor fix for this requirement, since Drupal must control this backend via integration module in 100%. But obviosuly I don't know Redis internals enough to turn this opinion into a fact :)

Log in or register to post comments

Comment #43

pounard

French

commented 15 November 2013 at 15:46

This interfere for sure, but it should not create such freeze. This is something very weird.

I will document that as a first step towards fixing it, but I'll also check if my algorithm for locking doesn't have potential race conditions, which is very plausible.

All the URL I linked upper I will start to read when I'll have dedicated time for it at work contains good trails for introspecting into this.

Log in or register to post comments

Comment #44

omega8cc commented 15 November 2013 at 15:51

Thank you for being so responsive, it is greatly appreciated! We at Barracuda team are looking forward to continue to contribute in a small but hopefully helpful way to this great project!

Log in or register to post comments

Comment #45

pounard

French

commented 15 November 2013 at 16:34

I have to admit that I more or less left the project in an abandonned state since I didn't get much contributions (I don't forget those I got from various people). But knowing this module is being used in real life use case encourages me in maintaining it! So I have to thank you too for being so responsive too.

Log in or register to post comments

Comment #46

omega8cc commented 16 November 2013 at 00:03

Looks like there are still more hidden issues. After several hours without any previous symptoms, the cache clear still works instantly on D7 site on every click, but D6 site is back to its weird "froze on cache clear" state -- on every click, even if there is only RAM used and no tcp-keepalive. Only redis-server restart helps. Uh-oh :/

Log in or register to post comments

Comment #47

omega8cc commented 16 November 2013 at 09:19

Further tests confirmed (so far) that these non-standard mode settings, as shown below, enabled for D6 site could be a reason of reoccurring problems in D6, even if both persistent mode and tcp-keepalive were disabled:

      $conf['redis_flush_mode']             = 1; // Redis default is 0
      $conf['redis_flush_mode_cache_page']  = 2; // Redis default is 1
      $conf['redis_flush_mode_cache_block'] = 2; // Redis default is 1
      $conf['redis_flush_mode_cache_menu']  = 2; // Redis default is 0

Initially we have tried these more aggressive modes in D6 with intention to help D6 *avoid* issues with the "frozen" syndrome. But after confirming in a few hours-long tests that both persistent mode and tcp-keepalive must be disabled, or the problem hits both D7 and D6 again, we didn't disable these custom modes for D6, and, as reported above, D6 experienced the "frozen" syndrome again.

But now, after several more hours of further testing this with both persistent mode and tcp-keepalive disabled and all modes left at their defaults (so with lines shown above removed), it no longer happens, at least not in the last 6 hours.

Log in or register to post comments

Comment #48

omega8cc commented 16 November 2013 at 09:34

OK, so it didn't happen again to D6 site only because the only visited URI was admin/settings/performance, to hit the cache clear button. Once we have tried to click around, edit a node and then return to admin/settings/performance to hit the button again, it did froze again, while no custom modes are active and both persistent mode and tcp-keepalive are disabled.

What is worse, we have tried the same steps with D7 (instead of just keep hitting the cache clear button) and it experienced the same issue again, with the same flood of lock:menu_rebuild lines as before, while both persistent mode and tcp-keepalive are disabled.

So we are back in the same corner.

Log in or register to post comments

Comment #49

pounard

French

commented 16 November 2013 at 10:00

This is a very weird issue, I have to reproduce it. I do some WATCH/MULTI/EXEC transactions in cache set operation, maybe this is related.

Log in or register to post comments

Comment #50

pounard

French

commented 16 November 2013 at 10:23

Ok I reproduced it on my personal box. The symptome is there:

      $client->watch($key);
      $owner = $client->get($key);

      // If the $key is set they lock is not available
      if (!empty($owner) && $id != $owner) {
        $client->unwatch();
        return FALSE;
      }

This code pragmatically runs the first bit of the locking transaction:

WATCH key
GET key
...

What happens there is that line:

$owner = $client->get($key);

returns the Redis instance instead of the value, this should never happen before a pipeline or an MULTI/EXEC is started, which make me thinks that at some point, a race condition happens and a MULTI or a pipeline is never closed.

I guess this is my fault, I have to find the broken piece of code now.

Log in or register to post comments

Comment #51

pounard

French

commented 16 November 2013 at 11:08

I'm not sure, but all of PHP, PhpRedis and the usage I made of it could be altogether the culprit.

I explain, I oftenly write:

    if ($client->get($key) != $id) {

This piece of algorithm is not viable, because $id is a string and $client->get() can return various types of output. First it can return an empty string, case in which I don't get in the if statement because "" != "some other string": this is wrong because when there is no lock set I am therefore the legit owner.

Secondly if the return is not a string, it might have unexpected behaviors due to PHP implicit type casting, I should have used === or !== instead of == or !=

I replaced those occurences everywhere using:

    $owner = $client->get($key);
    if (!empty($owner) && $id === $owner) {

And the bug disappeared from my machine. I don't know why, I did not find out the real culprit here, but it just disappeared.

After that I also reworked the PhpRedis transactions handling and added more conditions where the UNWATCH command would be run, and removed some useless DISCARD calls (don't know if this has any effect but hey, you never know).

I'm not really sure what happens there. The bug just disappeared from my box. Worst even, when I switch back to old code version, it did not reappear (WTF?).

Log in or register to post comments

Comment #52

pounard

French

commented 16 November 2013 at 11:10

Status	File	Size
new	2135545-52-random-lock-freeze.patch	6.84 KB

Please could you test this patch?

Log in or register to post comments

Comment #53

omega8cc commented 16 November 2013 at 15:01

Thanks, I will test this patch and will report back. This issues is not easy to reproduce and we were confused already a few times, as documented above. You need to start with clean Redis restart, and then do some things, like node edit, some visits in the admin and only then attempt to click on the cache clear button a few times (on D7) or just once (on D6) while running monitor in redis. Let's see how it goes!

Log in or register to post comments

Comment #54

omega8cc commented 16 November 2013 at 15:16

The patch from #52 made no difference. Here are the exact steps to reproduce the freeze in D6 and slowdown on second clear in D7:

1. Restart redis-server and enter its monitor.
2. Visit the site as admin, edit some node.
3. Go to the performance page and hit "Clear cached data" in D6 or "Clear all caches" in D7.

You should get instant freeze in D6 and the redis monitor will halt at line:

"KEYS" "prefix-foo:cache_menu:*"

In D7 the first hit on the "Clear all caches" button will work fine, but the second will kind-of-freeze while monitor will be flooded with lines:

"GET" "prefix-foo:lock:menu_rebuild"
"GET" "prefix-foo:lock:menu_rebuild"
"GET" "uprefix-foo:lock:menu_rebuild"

Log in or register to post comments

Comment #55

omega8cc commented 16 November 2013 at 15:18

The suspicious thing is that both in D6 and D7 it seems to be related to the cache_menu bin.

Log in or register to post comments

Comment #56

omega8cc commented 16 November 2013 at 15:26

Well, I went ahead and excluded cache_menu bin with:

$conf['cache_class_cache_menu'] = 'DrupalDatabaseCache';

And guess what? I can't reproduce the freeze any longer! Even with latest patch reverted.

Wow. Now I understand even less than before.

Log in or register to post comments

Comment #57

omega8cc commented 16 November 2013 at 15:30

Now I went even further and enabled tcp-keepalive 60 -- and still no freeze syndrome with cache_menu excluded.

Log in or register to post comments

Comment #58

omega8cc commented 16 November 2013 at 15:36

OK, why not to enable back the (default) persistent mode in Redis in the next step? Guess what? It made no difference, still no sign of freeze syndrome or any slowdown. Everything just works, if I exclude cache_menu.

I guess this bin needs some special attention? Don't know why, but the results are pretty clear.

Log in or register to post comments

Comment #59

pounard

French

commented 16 November 2013 at 15:38

I can see that. It's pretty weird, maybe it's due to the massive amount of keys being stored, with extremely long names, maybe at some point it tickles a bit Redis' limits. Don't know, I'll try and see.

Log in or register to post comments

Comment #60

omega8cc commented 16 November 2013 at 15:40

OK, so I have applied the patch from #52 again and we will see how it goes with cache_menu excluded over the next few hours.

Log in or register to post comments

Comment #61

omega8cc commented 16 November 2013 at 15:45

Also note that all those tests run on a vanilla site with no contrib, so the same syndrome related to cache_menu can theoretically affect some other bins associated with contrib modules.

Log in or register to post comments

Comment #62

pounard

French

commented 16 November 2013 at 15:46

Status	File	Size
new	2135545-61-random-lock-freeze.patch	1.09 KB

Hum maybe some chunks sent to Redis are too big, could try with this one using cache menu in Redis?

Log in or register to post comments

Comment #63

pounard

French

commented 16 November 2013 at 15:47

Forget about my last patch, I don't why this could hurt when dealing with only 2 or 3 menu entries...

Log in or register to post comments

Comment #64

omega8cc commented 16 November 2013 at 15:56

Right, so on a more complex D6 site (ManagingNews distro) it freezes on another bin:

"KEYS" "mn.o1.v174q.nyc.host8.biz_:cache_views:*"

I have tested this also with more complex D7 site (Open Atrium 2.0) but it didn't halt on any bin while cache_menu is excluded.

Log in or register to post comments

Comment #65

omega8cc commented 16 November 2013 at 16:06

Yeah, the last patch didn't make any difference.

Hmm, should I still see lines like this one (I mean the lock:menu_rebuild part):

+1384617719.718692 [0 127.0.0.1:33204] "SETEX" "mn.o1.v174q.nyc.host8.biz_:lock:menu_rebuild" "30" "429959810528796f7a47ef4.44059211"

After I have excluded cache_menu bin from the Redis backend?

[EDIT] Locks are separate from bins, but still..

Log in or register to post comments

Comment #66

pounard

French

commented 16 November 2013 at 16:18

This is a lock, not a cache value. The value you see in the end is the unique identifier for the page run which allows the lock to be a mutex (the $owner variable). This is normal that you have this line.

Log in or register to post comments

Comment #67

omega8cc commented 16 November 2013 at 16:38

To avoid the freeze syndrome on a D6 sites (ManagingNews) I had to exclude three bins:

      $conf['cache_class_cache_menu']         = 'DrupalDatabaseCache';
      $conf['cache_class_cache_views']        = 'DrupalDatabaseCache';
      $conf['cache_class_cache']              = 'DrupalDatabaseCache';

This doesn't look good.

Log in or register to post comments

Comment #68

omega8cc commented 16 November 2013 at 17:14

Just tested this again and both D6 based ManagingNews and OpenAtrium 1 sites require all 3 bins listed above excluded to avoid the freeze syndrome, while D7 based OpenAtrium 2 needs only cache_menu excluded. Weird, weird..

Log in or register to post comments

Comment #69

pounard

French

commented 16 November 2013 at 18:51

Says Yoda weird it is, yes.

Log in or register to post comments

Comment #70

pounard

French

commented 19 November 2013 at 12:30

Don't worry I did not forget this issue, I really don't have time for this right now. I'll continue as soon as I can.

Log in or register to post comments

Comment #71

omega8cc commented 19 November 2013 at 15:07

No problem, I understand.

Log in or register to post comments

Comment #72

j0rd commented 25 November 2013 at 11:59

I have a redis install (PhpRedis) which has around 250k keys after pre-caching what I needed. `drush cc all` seems to take roughly two minutes, where on memcache it took a couple seconds. So I read over this thread. I don't believe my issue is related.

--

I've ran into problems on D6 with the logic relating to locks in the code which rebuilds menu's.

Drupal code which uses locks I've found doesn't always act properly on the return values and can lead to tramples / race conditions.

Here's my old findings with regards to cache_menu rebuilds. Not sure if it ever got fixed, or is even related to the problems you're having.

https://drupal.org/comment/6982994#comment-6982994

There's also race conditions in image_style file creation with regards to database locks. Again, not sure if they've been fixed either.

https://drupal.org/comment/7367644#comment-7367644

So while your lock code might work just fine, Drupal code which uses your locks have been implemented poorly and could be causing their own race conditions or tramples.

Not sure if this helps at all though.

---

PS, I'm in here, because my `drush cc all`s don't seem to clear caches, like they did with memcache and I'm having stale cache issues I'm looking into. No answers here though, but hopefully you guys will figure this one out. I'll let you know if I run into this problem.

Log in or register to post comments

Comment #73

omega8cc commented 25 November 2013 at 13:42

Well, I just tested this by applying and reverting back and forth this simple patch probably more than 30 times and I can confirm that it fixed the problem both in D6 and D7. Wow!

Log in or register to post comments

Comment #74

pounard

French

commented 25 November 2013 at 13:34

@#73 whow, will look into that ASAP

Log in or register to post comments

Comment #75

pounard

French

commented 26 November 2013 at 16:20

Status:

Needs work

» Fixed

Fixed by mja on #2140897: cache_clear_all() not properly handled in PhpRedis.php
See commit http://drupalcode.org/project/redis.git/commit/810cce5

Closing this one happily now, it has been a great deal of loosing hair.

Log in or register to post comments

Comment #76

pounard

French

commented 27 November 2013 at 13:15

Release 7.x-2.6 is 'en route' and coming soon containing this bugfix.

Log in or register to post comments

Comment #77

omega8cc commented 27 November 2013 at 17:39

Status	File	Size
new	redis_70b4698b111b.diff	8.5 KB

OK, but please note that I have tested everything also with patch from #52 which is not included in 7.x-2.6. So to document the exact code I have tested, here is a diff attached between 7.x-2.6 and my test branch.

Log in or register to post comments

Comment #78

pounard

French

commented 27 November 2013 at 17:50

Ok so I will review this patch and takes whatever needs to be taken.

Log in or register to post comments

Comment #79

11 December 2013 at 17:40

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #80

22 May 2014 at 22:07

Commit 810cce5 on 7.x-2.x, 7.x-2.x-path by Pierre.R:

#2140897, #2135545 - authored by mja - Credits to mja and omega8cc -...

Log in or register to post comments

Lock Backend freezes the site on cache clear when tcp-keepalive is enabled in Redis

Comments