Potential race condition on variable_set() due to merge query implementation [#910010]

I believe I watched this happen this morning, although don't have a reproducible test case yet.

process 1 - does a variable_set(), clears the cache
process 2 - does another variable_set() with a different variable
process 3 - gets a cache miss from the variables cache, the SELECT * is blocked on the SELECT ... FOR UPDATE  from process 2
process 4 - does another variable_set() - continues to block process 2.

This is excerbated by the fact that process 3 can also acquire a lock before hitting the SELECT * - which means any further requests which get a variable cache miss are waiting on that one process before they can continue executing. If you have something else going on which is causing UPDATE queries to take a long time, and a few are issued at once, this situation could last for some time.

I don't see any reason why we have to use SELECT ... FOR UPDATE here instead of SELECT ... LOCK IN SHARE MODE, the latter issues exactly the same write lock on the row, without blocking reads, so in this case process 2's SELECT * would issue immediately.

I can only see one reference to IN SHARE MODE on the original issue that introduced this - http://drupal.org/node/715108#comment-3093918 and that wasn't discussed further. Attaching a patch which does that for MySQL here.

Comment	File	Size	Author
#2	merge_in_share_mode.patch	4.11 KB	catch
#2
	merge_in_share_mode.patch	4.11 KB	catch

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Crell CreditAttribution: Crell commented 13 September 2010 at 15:11

You know, I'm really starting to hate merge queries...

Is this an issue on anything other than MySQL? Does SHARED MODE exist on other DBs?

Also, the first docblock has a whitespace error on the first line. :-)

Comment #2

catch

he/him

English

CreditAttribution: catch commented 13 September 2010 at 16:03

File	Size
merge_in_share_mode.patch	4.11 KB

postgresql has LOCK TABLE foo IN SHARE MODE - which is a write-only lock, but full table level. I couldn't find a SELECT .. FOR UPDATE in postgres but didn't look too hard.

A full table lock would actually be better than the current situation though - since a full table lock blocks both writes and reads, eventually letting a SELECT query through. In this case, you can have potentially infinite locking write queries (if issued against different or alternating variables), while reads are permanently blocked - since SELECT * FROM {variables}; is as blocked by a row level lock as it is by a table level one, but SELECT 1 FROM {variables} WHERE name = 'some_var'; won't be blocked unless it's the exact same variable each time.

Also, requests doing a lock_wait() will eventually get to PHP timeout, however PHP requests waiting on a SELECT .. FOR UPDATE are in mysql time, so they likely won't hit the PHP memory limit, might hit apache timeout though. The only thing that would stop this going on indefinitely is that to get to the point of issuing a variable_set(), the request has to get past variable_initialize(), so eventually the threads doing a variable_set() will finish, releasing the lock, and let a SELECT get through.

Variables may be the only table in core that does this, and also the vast majority of variable_set() will be updates (think the cache flush variables from cache.inc or css aggregation maps). So another option may be be to special case variable_set() to do UPDATE then INSERT and not use db_merge().

Cleaned up the whitespace.

Comment #3

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 September 2010 at 16:09

We decided to remove the lock completely in another issue.

Also, locks should be released as soon as the transaction is committed. So I guess you have been bitten by our poor-man transaction nesting implementation.

Comment #4

catch

he/him

English

CreditAttribution: catch commented 13 September 2010 at 16:11

OK postgres details here http://developer.postgresql.org/pgdocs/postgres/explicit-locking.html

SELECT FOR UPDATE appears to not block reads in postgresql. It also has SELECT FOR SHARE but I don't think that's relevant here. So this might be exclusive to MySQL.

Comment #5

Crell CreditAttribution: Crell commented 16 September 2010 at 18:43

Damien: Are you saying that #2 won't work? I'm not clear on the intricacies here...

Comment #6

catch

he/him

English

CreditAttribution: catch commented 17 September 2010 at 03:01

Priority:

Normal

» Major

Where's the issue that removes the lock?

Also I'm not sure if my original example is clear, but with the current implementation multiple locks can be claimed on the variables table at any one time via overlapping variable_set()

SELECT 1 FROM variables WHERE name = 'foo' FOR UPDATE;

does not block

SELECT 1 FROM {variables} WHERE name = 'bar' FOR UPDATE;

but both can block

SELECT * FROM variables;

That means you could have a potentially infinite combination of the first two, without the SELECT * ever getting to run. I'm bumping this to major, because although it's a rare edge condition, it can lead to an outage (or in the case I saw this, I think led to an outage being worse than it was otherwise while MySQL was having i/o issues). If there's an issue removing the lock entirely, and that patch is more viable than the one here and is likely to go in, then I'm fine with this being marked duplicate, but I'd like to read the other issue before we do that and copy some of the stuff from here over there.

Comment #7

Crell CreditAttribution: Crell commented 23 September 2010 at 22:36

Let's revisit this after #844186: Clarify merge queries lands, as it's a total rewrite of merge queries. :-)

Comment #8

catch

he/him

English

CreditAttribution: catch commented 24 September 2010 at 02:30

Status:

Needs review

» Needs work

That issue is in, and is still using SELECT .. FOR UPDATE. However the patch obviously won't apply now.

Comment #9

Crell CreditAttribution: Crell commented 13 November 2010 at 07:22

Priority:

Major

» Normal

Talking with webchick, this is enough of an edge case that it's not a release blocker. There are potential fixes that are not API breaks, either, so it could be fixed post-7.0. Therefore marking down to normal.

Comment #10

catch

he/him

English

CreditAttribution: catch commented 13 November 2010 at 11:22

Priority:

Normal

» Major

Major issues by definition can be fixed post 7.0. But things which take your site down are still major IMO.

Comment #11

Anonymous (not verified) CreditAttribution: Anonymous commented 20 November 2010 at 00:57

maybe i'm just doing it wrong, but on a stock mysql 5.1 on ubuntu 10.10, i can't reproduce the read-lock.

if a cache miss in variable_initialize() was really blocked by the merge query implementation, then shouldn't i be able to take a site down by doing this, and just leaving it uncommitted:

mysql> TRUNCATE TABLE cache_bootstrap;
mysql> BEGIN;
mysql> SELECT 1 FROM variable WHERE name = 'site_name' FOR UPDATE;

but this doesn't hold up the next request - the select in variable_initialize() returns straight away. catch, what tx_isolation value do you have?

i have NFI how this translates to postgres.

Comment #12

catch

he/him

English

CreditAttribution: catch commented 15 January 2011 at 16:12

Priority:	Major	» Normal
Status:	Needs work	» Postponed (maintainer needs more info)

Hmm, interesting. I need to get back to this, but for now I don't think this is major, and I have other variable patches around that would make this irrelevant.

Comment #13

catch

he/him

English

CreditAttribution: catch commented 18 April 2011 at 14:11

Version:	7.x-dev	» 8.x-dev
Priority:	Normal	» Major
Status:	Postponed (maintainer needs more info)	» Active

http://www.engineyard.com/blog/2011/3-common-rails-mysql-mistakes/

"2. Using SELECT FOR UPDATE".

Putting this back in the active queue until we have conclusive data either way.

Comment #14

catch

he/him

English

CreditAttribution: catch commented 19 April 2011 at 06:58

At least on my localhost the following is true:

Ran these from the command line with two separate sessions:

Session 1:
MariaDB [d7]> BEGIN;
Query OK, 0 rows affected (0.00 sec)

MariaDB [d7]> SELECT * FROM variable WHERE name = 'site_name' FOR UPDATE;
+-----------+----------------------+
| name      | value                |
+-----------+----------------------+
| site_name | s:12:"Site-Install"; |
+-----------+----------------------+
1 row in set (0.00 sec)


Session 2:
MariaDB [d7]> SELECT value FROM variable WHERE name = 'site_name';
+----------------------+
| value                |
+----------------------+
| s:12:"Site-Install"; |
+----------------------+
1 row in set (0.00 sec)

(returns immediately)

However:

Session 1:
MariaDB [d7]> BEGIN;
Query OK, 0 rows affected (0.00 sec)

MariaDB [d7]> SELECT * FROM variable WHERE name = 'site_name' FOR UPDATE;
+-----------+----------------------+
| name      | value                |
+-----------+----------------------+
| site_name | s:12:"Site-Install"; |
+-----------+----------------------+
1 row in set (0.00 sec)

Session 2:

MariaDB [d7]> SELECT value FROM variable WHERE name = 'site_name' FOR UPDATE;

Session 1:
COMMIT;

Session 2:
+----------------------+
| value                |
+----------------------+
| s:12:"Site-Install"; |
+----------------------+
1 row in set (12.08 sec)

The SELECT .. FOR UPDATE is locked until transaction commit.

Looking a bit further, this appears to depend on the transaction isolation type - http://dev.mysql.com/doc/refman/5.0/en/set-transaction.html#isolevel_rea...

I don't have time to test all combinations of this today, but it looks like while this may block a straight SELECT with one of these configurations, you'd have to set that on purpose to trigger this.

Note a similar issue came up with the history table in the memcache queue #1080882: Pages served by MemCacheDrupal hold transactions open longer than the same pages served by DrupalDatabaseCache.

Comment #15

catch

he/him

English

CreditAttribution: catch commented 1 June 2011 at 06:01

Priority:	Major	» Normal
Status:	Active	» Postponed (maintainer needs more info)

Moving to normal/needs more info, not sure there's anything much to do here now at least until there are better steps to reproduce.

Comment #16

Damien Tournoud CreditAttribution: Damien Tournoud commented 28 June 2011 at 18:48

Status:

Postponed (maintainer needs more info)

» Closed (cannot reproduce)

In the standard isolation mode, SELECT FOR UPDATE clearly doesn't block SELECT. So let's close this (feel free to reopen if you bump into this again).