Implement a locking framework for long operations

Damien Tournoud - April 27, 2008 - 11:30
Project:Drupal
Version:6.x-dev
Component:base system
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs review
Issue tags:drupal.org upgrade
Description

Drupal currently face several concurrency issues (see #249185 for variable caching, #238760 for menu rebuilding, ...). All relatively long operations (cache building, ...) are potentially affected by two different types of issues:

  1. database inconsistency: when the operation rebuilds a whole table, using the DELETE-then-INSERT pattern, requests served during the operation will see an inconsistent table (to demonstrate the issue, simply reload twice the module page in Drupal 6.x/7.x);
  2. race conditions: several operations can run at the same time, thus leading to unneeded duplication of effort.

The first issue can be solved by reducing the window of inconsistency (that's what is suggested in #238760 for session_write(), or in #230029 for node_save()), or by using transactions (that's one of the point of the new database layer).

The second issue cannot be solved by simply using transactions. You need some sort of locking (or semaphores), as discussed in #248739 and #238760. Without locking, these long operations can lead to performance hits (because several instances of the operation are run in parallel), and to space hits (because the result of the operation is saved several times in the database, thus in its binary log). About this last point, one of the busiest Drupal site in France (France 24) was forced to disabled locale caching because its invalidation lead to several hundred megabytes of binary log data.

The attached patch implements a pluggable soft locking framework for Drupal. Two points worth noting:

  • Two implementations are available: an APC-based one, which is fast but implies that there is an unique web node; a slower database-based one, that should work across a cluster of nodes. High-performances implementations for a cluster could be made, for example based on Memcached or on a multicast bus.
  • All long standing operations should be locked: the patch only demonstrates how to do this for menu_rebuild() and locale() cache refresh. Several locking strategies can be considered: acquire the lock and give up the operation if it fails (that's the implemented strategy for locale cache refresh), acquire the lock and wait for the other operation to complete (menu_rebuild()), or acquire the lock and retry acquiring until success.
AttachmentSizeStatusTest resultOperations
core-lock-patch-0.patch11.02 KBIdleUnable to apply patch core-lock-patch-0.patchView details | Re-test

#1

Damien Tournoud - April 27, 2008 - 12:51

As a side note: the APC-based implementation requires at least APC 3.0.17, because of a bug of apc_add() (bug #12499).

#2

kbahey - April 27, 2008 - 21:00
Status:needs review» active

We can't have a dependency in core on APC. Not all PHP installations have it, so we can't rely on that.

So, this leaves the db implementation as the candidate for core.

This uses the variable table as a lock. I don't like this idea. Creating a new table for that is better than overloading the variable table doubling as a lock mechanism. And if you make this table cache_lock, which inherits the schema from "cache", it would automatically be eligible for use in APC, memcached, and XCache if you have the cache_router module and assign a bin for cache_lock.

#3

kbahey - April 27, 2008 - 21:01
Status:active» needs work

Setting this to CNW.

#4

lilou - August 30, 2008 - 19:06

Patch still applied.

#5

arhak - August 31, 2008 - 06:11

subscribing

#6

arhak - August 31, 2008 - 06:19

BTW broken link to "the new database layer" which is actually linking this issue #251792

#7

dmitrig01 - November 16, 2008 - 20:17
Status:needs work» needs review

@kbahey i think that various backends can be programmed including one that (as I understand) is already in there that don't depend on APC.

#8

BartVB - November 16, 2008 - 20:27
Status:needs review» needs work

Nice proposal!

Some remarks though: What happens if menu_rebuild() never finishes, for instance because Apache segfaults? When that happens the whole site will be down until the admin manually removes the lock. Could be solved by specifying a timeout with lock_acquire()

Is a global $locks variable necessary? Maybe use a class for this? Not sure what the policy on globals is in Drupal core.

What is $lock_id for?

Using cache_router or somesuch for this sounds like a great plan, using the DB for locking on a (very) busy site won't be fun :D

#9

Damien Tournoud - January 9, 2009 - 22:58

#251792: Implement a locking framework for long operations was a duplicate. As identified in that bug, those cache rebuilding operation can simply kill any moderate traffic Drupal 6 site. This is probably one of the biggest scalability issue we have now.

#10

slantview - January 9, 2009 - 23:02

While I agree in principal there is one thing I don't like:

--- includes/menu.inc 23 Apr 2008 20:01:47 -0000 1.269
+++ includes/menu.inc 27 Apr 2008 10:50:51 -0000
@@ -1636,6 +1636,11 @@ function menu_cache_clear_all() {
  * is different and leaves stale data in the menu tables.
  */
function menu_rebuild() {
+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for the other request to do our work
+    lock_wait('menu_rebuild');
+    return;
+  }

I don't think we should wait on a lock for menu_rebuild() just skip the rebuild since someone else is already handling it. The other thing is that I think doing a menu_cache_clear_all() before we even run the db operations is dumb. I think you should lock, rebuild from db, clear the cache and set it at the same time so at least other clients will get the stale menu and not try to rebuild every time.

#11

Damien Tournoud - January 9, 2009 - 23:08

@slantview: the idea was that during the menu rebuild the whole menu table is in an inconsistent state: something calling menu_rebuild() is also likely to query the menu table right after that (especially in menu administration pages). Plus, a menu_rebuild() should be a fairly quick operation (supposing that we don't run hundreds of them in parallel... of course).

I like your suggestion to clear the cache *after* the operation, and I think it's quite complementary: if a request has called menu_rebuild() explicitly, it should wait for the other rebuild to complete, but if a request has not called menu_rebuild(), it should use the old cached data.

#12

pwolanin - January 10, 2009 - 00:55

As much as it might look like an APi change, I think we really need to backport some version of the to 6.x to deal with several painful race conditions.

#13

pwolanin - January 12, 2009 - 15:24

This looks like a poor choice:

$lock_id = md5(uniqid());

Also, it seems a little wonky to have a bunch of extra global variables here - is there a way to avoid that?

#14

moshe weitzman - January 14, 2009 - 02:53

Very nice work. subscribe.

#15

slantview - January 14, 2009 - 07:15
Status:needs work» needs review

I worked on the patch a little and implemented for www.divx.com. I also am including a D7.x-dev patch and a patch for D6.8 since that is what we are running on our site in case we want to back port this.

I didn't like having an APC specific technology lock.inc, so I renamed lock-db.inc to lock.inc, and changed the get'ers and set'ers to cache_get and cache_set. Now we have a replaceable lock_inc variable so we can replace the locks with whatever we want (including nothing) if we are so inclined. The cache_get and cache_set allow us to use any backend technology that we want for storage for the locking system.

I added a new table into system.install $schema called "cache_lock" that allows us to use database storage by default, but easily replace it with something else.

We use my module "Cache Router" for storage on divx.com and simply replace the cache_lock bin with a memcache backed bin and we added an additional bin to the memcached servers.

All in all, I really believe that this is helping our site. We can now clear cache, save menus and have them rebuilt, etc even at peak traffic periods and there is little to no impact on performance. Previously we would have the site crawling if we did any of that stuff.

There is something else I should note too so this patch doesn't get all the praise for fixing our specific problems. We also had previously converted all the tables over to INNODB, and we converted the menu_router table back to MyISAM as well as all of the search tables. Everything else is running INNODB.

Best,

Steve Rude

AttachmentSizeStatusTest resultOperations
core-lock-patch-D6-01.patch5.93 KBIdleFailed: Failed to apply patch.View details | Re-test
core-lock-patch-D7-01.patch6.65 KBIdleFailed: Failed to install HEAD.View details | Re-test

#16

slantview - January 14, 2009 - 07:25

Aww crap, forgot the -N flag on diff.

Here is a re-roll including the lock.inc. Kinda important. :)

AttachmentSizeStatusTest resultOperations
core-lock-patch-D7-02.patch9.24 KBIdleFailed: Failed to apply patch.View details | Re-test
core-lock-patch-D6-02.patch8.49 KBIdleFailed: Failed to apply patch.View details | Re-test

#17

Damien Tournoud - January 14, 2009 - 10:06

We need a fix for D6, and I'm not convinced it will be the same for D7.

Your usage of cache_get/cache_set simply will work; but doesn't offer enough guarantee:

<?php
+    // try to acquire the lock
+    $lock = cache_get($name, 'cache_lock');
+    if (
$lock === FALSE) {
+     
$locks[$name] = TRUE;
+     
cache_set($name, $locks[$name], 'cache_lock', CACHE_TEMPORARY);
+      return
TRUE;
+    }
+    else {
+      return
FALSE;
+    }
?>

^ This is asynchronous. Under a load of several hundred requests per second split on several web nodes, who knows how many process can acquire this lock?

Compare to my DB implementation in the original patch:

<?php
+    // try to acquire the lock
+    if (@db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", 'lock:' . $name, $lock_id)) {
+     
$locks[$name] = TRUE;
+      return
TRUE;
+    }
+    else {
+      return
FALSE;
+    }
?>

^ This is perfectly atomic.

I fully agree that we need to link cache and lock (it simply makes sense). But, we need to add new primitives to cache.inc for this, that alternative cache implementation may not implement: this would be an API change.

So, would other agree to go for this imperfect cache_get()/cache_set() solution for D6 and implement something better in D7?

#18

Damien Tournoud - January 14, 2009 - 10:08

@pwolanin: as you have seen, this is a very old patch, before I got fully involved in core development. There are also lots of code style errors in there.

#19

Gábor Hojtsy - January 14, 2009 - 10:11

Damien says this will most probably be needed for the d.o upgrade so we can help performance and avoid some issues.

#20

Damien Tournoud - January 14, 2009 - 10:24

After a discussion with Gábor, we agreed that for D6 having a (pluggable) implementation defaulting to an (atomic) db lock would be better. I makes little sense to replace a race condition by another.

If you want to use memcache as your cache implementation, simply implement the (atomic) memcache lock implementation that match your needs.

#21

Damien Tournoud - January 14, 2009 - 10:33
Version:7.x-dev» 6.x-dev

The implementations for D6 and D7 will differ much. I'm dedicating this issue to D6, and I opened a new issue with a proposal for D7: #358663: Implement a locking framework and make caching pluggable.

#22

Damien Tournoud - January 14, 2009 - 10:37
Category:task» bug report

By the way, this one is a bug.

#23

Jeremy - January 28, 2009 - 22:08

Subscribing.

#24

Dries - January 28, 2009 - 22:10

We can solve #238760: menu_router table truncated and site goes down without using locks or API changes. Instead of flushing the menu cache in request A, and rebuilding it in request B, we can rebuild the menu cache in request A and replace the old menu -- I think.

#25

pwolanin - February 5, 2009 - 22:09

@Dries - I don't think you are correct - we are deleting and rebuilding the menu in the same request in most cases. The problem is a race condition. Since we are using a variable whose value is cached for the duration of the request.

We also have a period of time where one process may be trying to build a new menu, and another process sets the flag that a rebuild is needed since the router is missing.

We could read/update the variables table directly to help eliminate the cached variable race condition, but I'm not sure we can solve it cleanly wihout APi changes.

I have been seeing this error even on a no-load localhost, so I think it's pretty critical to fix.

#26

pwolanin - February 9, 2009 - 14:10

per: http://drupal.org/node/369548#comment-1245494

I think we could also help the menu rebuild issue some by not trying to cache the whole router as a serialized blob.

#27

pwolanin - February 10, 2009 - 01:17

Here's a start which attempts to do away with caching the big router blob. That alone should help reduce the race conditions by speeding the process.

AttachmentSizeStatusTest resultOperations
no-blob-251792-27.patch1.4 KBIdleFailed: Failed to apply patch.View details | Re-test

#28

pwolanin - February 10, 2009 - 01:18

whoops - that patch missed menu.inc

AttachmentSizeStatusTest resultOperations
no-blob-251792-28.patch5.85 KBIdleFailed: Failed to apply patch.View details | Re-test

#29

pwolanin - February 10, 2009 - 03:05

And a full db lock implementation - building off Damien's original, but simplifying it as much as possible.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-29.patch11.36 KBIdleFailed: Failed to apply patch.View details | Re-test

#30

pwolanin - February 10, 2009 - 03:27

ok, so to check the effect, I enabled devel module (plus minor fix to devel_menu_link_alter())

I allowed anonymous users to access this devel path which forces a menu rebuild:

ab -n100 -c5 http://127.0.0.1/drupal-6/devel/menu/reset

With the patch, I got just three 404 errors in the log, I assume from when a request came in and the router table was empty.

Without the patch, there are 87 pages of SQL errors in the dblog, from running the same ab. The errors have the form:

Duplicate entry 'taxonomy/autocomplete' for key 1 query: INSERT INTO menu_router (path, load_functions, to_arg_functions, access_callback, access_arguments, page_callback, page_arguments, fit, number_parts, tab_parent, tab_root, title, title_callback, title_arguments, type, block_callback, description, position, weight, file) VALUES ('taxonomy/autocomplete', '', '', 'user_access', 'a:1:{i:0;s:14:\"access content\";}', 'taxonomy_autocomplete', 'a:0:{}', 3, 2, '', 'taxonomy/autocomplete', 'Autocomplete taxonomy', 't', '', 4, '', '', '', 0, 'modules/taxonomy/taxonomy.pages.inc') in /www/drupal-6/includes/menu.inc on line 2388

#31

pwolanin - February 10, 2009 - 03:45

whoops - patch missed bootstrap.inc. Also fixing code comments, typos, etc.

Also, the removal of the router blob caching can certainly be in a separate patch, but I think reading/writing that much data is contributing to the race condition in menu rebuilds, and hence related to the central issue of concern here.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-31.patch12.01 KBIdleFailed: Failed to apply patch.View details | Re-test

#32

moshe weitzman - February 10, 2009 - 14:17

Looks great. I noticed a typo in a function name: local_break($name, $now);

#33

pwolanin - February 10, 2009 - 14:26

With that typo fixed.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-33.patch12.01 KBIdleFailed: Failed to apply patch.View details | Re-test

#34

pwolanin - February 10, 2009 - 14:53

added more doxygen.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-34.patch12.27 KBIdleFailed: Failed to apply patch.View details | Re-test

#35

Damien Tournoud - February 10, 2009 - 22:13

Two nitpicks:

<?php
+/**
+ *  Helper function. Get this request's unique id.
+ */
?>

and

<?php
+/**
+ * Helper function to store the menu router if we have it in memory..
+ */
?>

The lock part looks very good. I'm a little bit worried about changes to the menu rebuild mechanism, especially because some API changes are involved:

<?php
drupal_alter('menu_link', $item, $menu);
drupal_alter('menu_link', $item);
?>

#36

pwolanin - February 10, 2009 - 23:04

Yes - it's a minor API change, but I honestly have no idea why we are passing $menu into that hook.

The few contrib modules that implement it with both params do not use the $menu:

http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/i18n/i18nme...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/minutes/min...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/devel/devel...

So, we could avoid breaking them outright (see attached patch), but it's still a minor API change since $menu may now be NULL.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-36.patch12.23 KBIdleFailed: Failed to apply patch.View details | Re-test

#37

pwolanin - February 11, 2009 - 00:23

neclimdul points out to me: it's a stretch but you could want to lock, release and then lock again. the second lock would fail.

So, I think using a global is the only sensible thing here (back to something closer to Damien's original).

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-37.patch12.34 KBIdleFailed: Failed to apply patch.View details | Re-test

#38

bjaspan - February 11, 2009 - 01:11

I'm really glad to see this happening. I've needed lock capability in Drupal in the past.

Comments upon first reading the patch; note that I have not read any of the history in this issue, as no one else will have when they read the code after it is committed.

Locking semantics are always subtle and need to be documented. lock.inc should include a @defgroup with an overview and theory of operation. Among other things, state clearly that this is a cooperative, advisory lock system.

The answer to all my comments below may be "the patch is correct." If so, my comment becomes, "please add a comment explaining why."

<?php
+function lock_acquire($name, $timeout = 10.0) {
+  global
$locks;
+
+  if (isset(
$locks[$name])) {
+    return (bool)
db_result(db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id()));
?>

I think this says: If I previously acquired a lock, check the db anyway because someone else might have broken it. In that case, why wouldn't you also clear $locks[$name]?

<?php
+    return (bool)@db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());
?>

Are we guaranteed that db_query("UPDATE") returns rows modified?

In the unlikely case that $expire equals the previously set time, this will return false even though the lock is held for the requested amount of time.

<?php
+  if (!lock_acquire('menu_rebuild')) {
+   
// Wait for another request to do our work
+    lock_wait('menu_rebuild');
+    return;
+  }
+
?>

This looks odd. We just waited up to 20 seconds for the lock to clear. It might be available, but we aren't going to try to grab it and do the work? Doesn't our caller deserve to know that we failed? I see something in the phpdoc for menu_rebuild() that it may leave the operation for the next page request, but it says that happens only during update.php or install.php.

#39

pwolanin - February 11, 2009 - 02:30

First pass at addressing Barry's comments.

AttachmentSizeStatusTest resultOperations
lock-no-blob-251792-39.patch14.59 KBIdleFailed: Failed to apply patch.View details | Re-test

#40

slantview - February 13, 2009 - 22:20

I reviewed this patch and it looks mostly good to me. I would be willing to test this out on divx.com before it goes into core if we can get it to patch (reviewed and tested by community) state.

The only thing I saw was that 1 second seems like a very long amount of time to wait. I was thinking 100ms might be a little more realistic. Typically a 200ms page load time is what I consider "good" for most pages, however a 2-4 second page load time would be what I consider "unacceptable". I know these are preferences, but I would like to hear others input on this. Is it too much to ask for a usleep and microseconds instead of sleep(1). One second is quite a long time in web page render time. I know that this is a sensitive topic since too many checks can end up with many, many checks and be heavy on a processor, but 1 second could cause many, many 404s while this is rebuilding and that is almost worse in my mind. If you are doing 400 requests per second or more, this could end up with hundreds or thousands of 404s while rebuilding. This can be especially damaging if the request is done by a web page crawler (e.g. google) and now you have a valid page that is showing as 404 in a search engine.

The other thing is that I don't believe we should accept this patch without a test set that includes verifying that locking actually works for acquire, release, wait and timeout. Also that the lock cannot be taken by two systems simultaneously (obviously).

Thoughts?

#41

pwolanin - February 16, 2009 - 16:38

@slantview - note that this sleep() whoud not affect most page views - only when there is a blocking lock.

usleep() only works on Widows as of PHP 5.0.0, so we could considering using it for D7, but not D6 (which is the target of this patch).
http://www.php.net/manual/en/function.usleep.php

#42

hass - February 19, 2009 - 07:20

Subscribe

#43

slantview - February 19, 2009 - 20:09

@pwolanin - I do understand it would not affect most page views, but if that blocking lock happens regularly (e.g. menu rebuilding) than it could affect a lot more page views than you would want to have to sleep for 1 second while we wait for the rebuild. I don't think it's a show stopper, but could be filed under "nice to have".

#44

pwolanin - February 23, 2009 - 01:07

@slantview - the alternative (current broekn situation) is that all those page views ALSO try to execute a menu rebuild which is slow and costly and probably takes more than the 1 sec.

#45

slantview - February 23, 2009 - 20:57

@pwolanin - I understand the current situation, that is the one that broke our site for launch. I don't want to be a PITA, so I'll drop the issue, it may be something we want to test and adjust accordingly. I'll try to put in some time testing it and see if it really makes a difference.

As far as the windows issue, we could add something like if (function_exists('usleep')) { if we really thought there would be a gain on systems that supported it.

I'll do some testing and see if it actually matters.

#46

torbjorn - February 24, 2009 - 19:39

Subscribing.

#47

natrio - February 27, 2009 - 02:08

subscribing

#48

EgbertB - March 24, 2009 - 12:26

subscribing

#49

pwolanin - March 24, 2009 - 13:07

We should separate out the code of eliminate the caching of the menu router - that should be faster bug fix to get in and will be helpful regardless for performance.

#50

Damien Tournoud - March 24, 2009 - 23:49

<?php
+function lock_renew($name, $timeout = 20.0) {
+  if (
lock_acquire($name)) {
+    list(
$usec, $sec) = explode(' ', microtime());
+   
$expire = (float)$usec + (float)$sec + $timeout;
+   
db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());
+    return (bool)
db_affected_rows();
+  }
+  return
FALSE;
+}
?>

The lock_acquire() doesn't seem to be necessary here, because the UPDATE query is atomic.

<?php
+function lock_wait($name, $delay = 30) {
[...]
+     
lock_break($name, $now);
[...]
+}
+
+function
lock_break($name, $now) {
db_query("DELETE FROM {semaphore} WHERE name = '%s' AND expire < %f", $name, $now);
+}
?>

This is probably prone to race conditions. Why not simply get the value of the lock in lock_wait() and do a lock_break($name, $value)?

#51

beltofte - April 11, 2009 - 09:59

subscribe

#52

pwolanin - April 11, 2009 - 16:35
Status:needs review» needs work

#53

Starminder - April 14, 2009 - 17:35

subscribe

#54

ismaiel - April 15, 2009 - 10:34

subscribe

#55

Pedro Lozano - April 16, 2009 - 22:31

subscribe, I'll try to review.

#56

pwolanin - April 17, 2009 - 01:33

needs a re-roll to remove the menu router stuff, plus to take Damien's comments into consideration.

#57

SocialNicheGuru - April 17, 2009 - 12:54

subscribing

#58

pwolanin - April 18, 2009 - 13:42

Here's a straight re-roll removing the menu router elements (which are at http://drupal.org/node/317775#comment-1438040) but not yet taking Damien's comments into account.

AttachmentSizeStatusTest resultOperations
lock-251792-58-D6.patch8.1 KBIgnoredNoneNone

#59

Turkish Delight - April 21, 2009 - 03:55

subscribing

#60

Starminder - April 21, 2009 - 19:06

subscribe

#61

pwolanin - April 21, 2009 - 19:21
Status:needs work» needs review

This takes account of Damien's comments, but we should probably add the locking for the locale rebuild too.

AttachmentSizeStatusTest resultOperations
lock-251792-60-D6.patch9.36 KBIgnoredNoneNone

#62

pwolanin - April 21, 2009 - 20:06

after more discussion with Damien about making the API very clear.

AttachmentSizeStatusTest resultOperations
lock-251792-62-D6.patch9.51 KBIgnoredNoneNone

#63

pwolanin - April 21, 2009 - 20:14

adding return (bool)db_affected_rows(); to function lock_may_be_available()

AttachmentSizeStatusTest resultOperations
lock-251792-63-D6.patch9.53 KBIgnoredNoneNone

#64

Damien Tournoud - April 21, 2009 - 21:28

I stress-tested that patch, and it appears to be rock solid.

I added back the locale cache rebuild locking that I originally implemented in the original patch, and tested that. Here is an extract of the test log... you see that nearly 80 requests failed to acquire a lock, and that only one of them did the actual rebuilding.

Seems ok to go in for me... I did the last round of review with Peter on the IRC, and the API appears very sane (we will build on that for D7 for sure).

AttachmentSizeStatusTest resultOperations
lock-251792-64-D6.patch6.16 KBIgnoredNoneNone

#65

Damien Tournoud - April 21, 2009 - 21:47

And with the lock.inc file ;)

AttachmentSizeStatusTest resultOperations
lock-251792-65-D6.patch11.51 KBIgnoredNoneNone

#66

webchick - April 21, 2009 - 22:02

#67

pwolanin - April 22, 2009 - 02:49
Version:6.x-dev» 7.x-dev

straight port to D7 - lock.inc not DBTNGified yet.

AttachmentSizeStatusTest resultOperations
lock-251792-65-D7.patch10.9 KBIdleFailed: Failed to apply patch.View details | Re-test

#68

SocialNicheGuru - April 22, 2009 - 12:47

I have been trying to figure out why my install is slow.

I found the following issues:

Caching the entire menu, http://drupal.org/node/317775#comment-1492104

Implement a locking framework for long operations: http://drupal.org/node/251792#comment-1503770

But the system.install system_update_6050 function gets redefined in each

I mention this because they both might get committed which might cause problems.

Are they incompatible approaches?

Chris

#69

pwolanin - April 22, 2009 - 15:31

@activelyOUT - don't worry, whichever is committed first (probably the router patch) we'll re-roll the other to bump the update number.

#70

pwolanin - April 22, 2009 - 16:30

On the assumption that this will get into D6 (hence it there will be a 6xxx update function to add the table) this patch omits an update function for D7, so you'll need a fresh install to test.

We might need to add locking to the modules_rebuild too - on D7 I can get a PDO error by reloading quickly the admin/build/modules page even with this patch:

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'themes/garland/garland.info' for key 1 in system_theme_data() (line 1167 of /Users/Shared/www/drupal-7/modules/system/system.module).

AttachmentSizeStatusTest resultOperations
lock-251792-70-D7.patch10.18 KBIdleFailed: Failed to apply patch.View details | Re-test

#71

pwolanin - April 22, 2009 - 16:41

Updated D6 patch a little to match some tweaks in the D7 patch.

AttachmentSizeStatusTest resultOperations
lock-251792-71-D6.patch10.64 KBIgnoredNoneNone

#72

moshe weitzman - April 25, 2009 - 15:34
Status:needs review» reviewed & tested by the community

I tested the D7 patch and code reviewed. My one pause regarded the different approaches that locale and menu_rebuild take toward acquiring locks. More investigation showed that locale is implementing a 'i'll take a lock if available, otherwise proceed' approach. menu is taking a 'i need a lock and will wait for it' approach. so this patch nicely illustrates the flexibility of the API. RTBC.

#73

Berdir - April 25, 2009 - 20:59

+    if (!db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id())->fetchField()) {

That query is only 50% converted to DBTNG :)

#74

pwolanin - April 26, 2009 - 14:22

This patch fixes that query to use proper DBTNG placeholders.

AttachmentSizeStatusTest resultOperations
lock-251792-74-D7.patch10.23 KBIdleFailed: Failed to apply patch.View details | Re-test

#75

Gerhard Killesreiter - April 27, 2009 - 13:07

Dries requested infos on how the lack of a locking framework effects drupal.org.

The answer is: Currently not in a visible way. Last year we had a serious bug that triggered a race condition and took the site down several times. The actual problem was in memcached, but I think the problem had been mich easier to debug had the original user locked the cache insert.

I think that it is important that we can protect Drupal installs against race conditions.

#76

moshe weitzman - April 27, 2009 - 13:53

Well, the lock patch only kicks in when you enable a module or do some locale operation. enable/disable modules is much more frequent on a new site than a mature one like drupal.org. further, we know what modules to use whereas most sites try lots of modules out. thirdly, when we do try stuff out we do so on dev boxes so drupal.org would not feel those delays. drupal.org is one of the worst sites for judging the benefits of this patch.

#77

slantview - April 28, 2009 - 05:42

There are other things that cause the locks to kick in such as using the book module and updating a node, and any site that updates one of the menus at any point. This could be rare or frequent depending on the site. Unfortunately, you won't see it on d.o very much. We might feel the pain when we finally cutover to the new design however.

#78

pwolanin - April 28, 2009 - 12:17

@slantview - I expect there are indeed other places we will want to apply this sort of locking, but we at least need this code in place to move forward.

#79

c960657 - April 29, 2009 - 18:58
Status:reviewed & tested by the community» needs review

Nice work.

+    $lock_id = uniqid(mt_rand());
Note that uniqid() is rather slow. I'm not sure what could be used instead – perhaps $_SERVER['REMOTE_ADDR'] . $_SERVER['REMOTE_PORT'] . $_SERVER['SERVER_ADDR'] ?

+function lock_may_be_available($name) {
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+  if (!$lock) {
+    return TRUE;
+  }
+  list($usec, $sec) = explode(' ', microtime());
+  $now = (float)$usec + (float)$sec;
+  if ($now > $lock['expire']) {
+    return (bool)db_delete('semaphore')
+      ->condition('name', $name)
+      ->condition('value', $lock['value'])
+      ->execute();
+  }
+  return FALSE;
+}
Isn't there a race condition hidden here? If another thread renews an expired semaphore between the SELECT and the db_delete(), that semaphore is deleted. Adding ->condition('value', $lock['value']) will fix that, I think.

+    'indexes' => array('expire' => array('expire')),
What is the purpose of this index? It isn't used by any queries in the patch.

AFAICT the few lines in lock_init() could just as well be done inside the if block in _lock_id(). This would prevent lock_release_all() from being called in every non-cached request.

+function lock_wait($name, $delay = 30) {
+
+  while ($delay--) {
Nit: Functions should not begin with an empty line.

I am not sure I understand whether this is intended to go into D7 also, but here are some D7-only comments:

+    list($usec, $sec) = explode(' ', microtime());
+    $expire = (float)$usec + (float)$sec + $timeout;
As of PHP 5.0 you can just write $expire = microtime(TRUE) + $timeout.

+  static $lock_id;
This doesn't use the new static caching API. Wouldn't that be useful?

#80

c960657 - April 29, 2009 - 18:59
Status:needs review» needs work

#81

pwolanin - April 29, 2009 - 20:13

@c960657 - from those links seems we should just set more_entropy to TRUE. Since this is only called once per page load, I don't think we need to worry too much about speed.

re: the possible race for db_delete, I assume you meant ->condition('expire', $lock['expire']) rather than ->condition('value', $lock['value'])?

For D7, we likely should uses the new static cache API.

#82

c960657 - April 29, 2009 - 20:17

re: the possible race for db_delete, I assume you meant ->condition('expire', $lock['expire']) rather than ->condition('value', $lock['value'])?

I did :-)

#83

Damien Tournoud - April 29, 2009 - 20:49

@c960657: indeed we need a condition on both the expire and the lock value.

#84

pwolanin - May 4, 2009 - 02:08
Status:needs work» needs review

I think this addresses the concerns in #79.

AttachmentSizeStatusTest resultOperations
lock-251792-84-D7.patch8.76 KBIdleFailed: Failed to install HEAD.View details | Re-test

#85

pwolanin - May 4, 2009 - 02:10

oops - somehow the diff missed the schema.

AttachmentSizeStatusTest resultOperations
lock-251792-85-D7.patch10.28 KBIdleFailed: Failed to apply patch.View details | Re-test

#86

moshe weitzman - May 13, 2009 - 17:02

If we can move code out of lock_init(), thats desirable. That way lock.inc will only get included and parsed when it is needed. or maybe thats gonna happen anyway.

Code looks good to me, and tests pass.

#87

c960657 - May 13, 2009 - 20:42

Assume there is a lock that is very “popular”, i.e. it is often aquired as soon as it has been released. If the lock is held when lock_wait() is called, the function keeps waiting until no thread has the lock, i.e. other threads may acquire and release the lock several times before the function returns. Wouldn't it be more useful if it returned already when the lock_id that held the lock, when the function was entered, has been released, even if a new lock has been acquired by another thread (in other words, select 'value' at the top of the function and return as soon as that value has disappeared from the database)? This would be useful in events where you want to make sure that a given function has been called at least once by some thread before returning, e.g. use cases like menu_rebuild().

+      'value' => array(
+        'description' => 'A value for the semaphore.',
I suggest the name “lock_id” for this column.

function menu_rebuild() {
-  variable_del('menu_rebuild_needed');
+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    lock_wait('menu_rebuild');
+    return FALSE;
+  }
If another thread has the lock and has almost finished rebuilding the menu when the method is called, we need to build the menu once more when the lock has been released, right? I.e. something like this:
+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    lock_wait('menu_rebuild');
+    if (!lock_acquire('menu_rebuild')) {
+      lock_wait('menu_rebuild');
+      return FALSE;
+    }
+  }
Also, the new meaning of the return value is not documented.

+function lock_acquire($name, $timeout = 20.0) {
+  global $locks;
+
+  if (isset($locks[$name])) {
+    // We acquired a lock before, so check whether our lock is intact.
+    if (!db_query("SELECT 1 FROM {semaphore} WHERE name = :name AND value = :value", array(':name' => $name, ':value' => _lock_id()))->fetchField()) {
Would it be better to simply call lock_renew() here? Or trigger an error? It seems a bit odd trying to acquire a semaphore you already hold (or that you previously held but did not release).

-  _menu_clear_page_cache();
+  if (lock_renew('menu_rebuild')) {
+    _menu_navigation_links_rebuild($menu);
+    ...
+  }
I wonder whether we should just keep on running. Otherwise we cannot guarantee that the tasks inside the if block has been done, before we return. If the lock has expired, we may want to report it using watchdog.

#88

markus_petrux - May 14, 2009 - 09:36

I would be interested in (and would be glad to test) the D6 port of this feature, as stated in Duplicate entry in menu_router: ... - comment #49.

PS: Subscribe.

#89

donquixote - May 15, 2009 - 08:42

subscribe

#90

EvanDonovan - May 15, 2009 - 20:58

I also would like to test the D6 port of this feature, as per #246653: Duplicate entry in menu_router: Different symptoms, same issue & #238760: menu_router table truncated and site goes down. Which patch on here is good to test for D6?

I'm currently using the patch in #246653-47: Duplicate entry in menu_router: Different symptoms, same issue, so that panels and views can work properly for me, but it is a non-optimal solution.

#92

System Message - June 3, 2009 - 01:20
Status:needs review» needs work

The last submitted patch failed testing.

#93

aaron - June 3, 2009 - 13:34

subscribe

#94

aaron - June 3, 2009 - 16:26

To anyone contemplating #91, this script introduces huge security holes [and is thus unpublished now chx]: for instance, you don't want a robot to discover the page and begin hitting up the database with a bunch of menu rebuilds. The easiest way to accomplish this is to revisit the Modules page (at admin/build/modules). Or you could make a quick module and do something like the following, even adding a cron job to rebuild the menu periodically (making sure not to activate it until 3AM in the morning or other non-peak time). Then you can just disable this module when and if a patch fixing the original problem goes through.

I personally don't recommend using either this approach or the approach in #91. If you are unable to visit administrative pages in the first place, just log in as user 1. And then, again, just visit /admin/build/modules, which will rebuild your menu system.

<?php
function lock_bandaid_menu() {
 
$items['admin/build/lock-bandaid'] = array(
   
'page callback' => 'lock_bandaid_page',
   
'access callback' => 'user_access',
   
'access arguments' => array('administer menu'),
   
'title' => 'Menu Bandaid',
   
'description' => 'Rebuild the menu system.'
 
);
  return
$items;
}

function
lock_bandaid_page() {
 
menu_rebuild();
  return
t('Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')));
}

function
lock_bandaid_cron() {
  if (
variable_get('lock_bandaid_cron_timestamp', 0) <= time()) {
   
menu_rebuild();
   
watchdog('lock_bandaid', 'Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')), WATCHDOG_NOTICE, l(t('Menu Bandaid'), 'admin/build/lock-bandaid'));
   
variable_set('lock_bandaid_cron_timestamp', time() + 3600 * 24);
  }
}
?>

I haven't actually tested this, but just offer it as a more secure alternative to the former.

#95

EvanDonovan - June 4, 2009 - 21:40

@aaron: I get what you mean. But if the {menu_rebuild} table is completely trashed, you won't be able to visit admin/build/modules, nor will you be able to visit the login page to log in as uid 1.

I think the best solution to the concern you raise is to delete the menu_rebuild.php from your server & re-upload it if needed.

Using .htpassword to protect the file would be a more advanced solution.

#96

arhak - June 10, 2009 - 16:14

I agree with #95, since it's an infallible and quick solution
always taking into account that the script shouldn't remain in the live site much longer that needed (and it will be only needed once on production sites, maybe on development site it might be more actively required)

I also used to use another solution, which consisted on having an SQL query at hand to fix the menu_router entry for admin/build/modules and then be able to navigate to that page which would rebuild the menu_router (analogous for devel's functions), but in long time terms I used to forgot to carry on with that query, but always remember how to recreate a quick php script invoking the menu rebuild.

#97

Deltoid - June 22, 2009 - 11:02
Version:7.x-dev» 6.12
Component:base system» menu system

I have the same issue here with 6.12. I've tried to implement the most recent patch for D6 from #71 but I always get an error when applying the patch. If I disable Organic Groups the problem goes away since OG rebuilds the menu_router table very aggressively. Is there or will there be an updated D6 patch to fix this?

#98

tstoeckler - June 22, 2009 - 23:16
Version:6.12» 7.x-dev
Component:menu system» base system

#99

mark_r - June 27, 2009 - 23:40

subscribe

#100

arhak - July 3, 2009 - 15:26

IMO the whole "renew" thing is unnatural, it's odd code.

OTOH why attempting to wait?
what wait for?
waiting for 20 secs when timeout probably is set to 30 secs?

IMHO it should be attempt to lock the semaphore down
if got it do the job
else don't do it (maybe write it down to make another attempt in subsequent requests)

I have implemented a locking module for Drupal 6.x, right now I'm testing it and tuning it, but it works by now, everything seems to be fine having 10-50 pages making AJAX requests every 10 seconds to the same semaphore for a (real use case) job
that seems to be a lot of requests

nevertheless, I don't think this would be the ideal solution to the problem. It is just a solution. And it might be it for a while.

there are some known issues to think about and discuss, but they won't show up in a long while (for example, integer auto-increments stands for 4 thousands millions of rows, which will be a long while until getting there)

would you like to review this module of mine?
should I upload it to this thread?
I think it might (at least) light you up (if not truly useful)

#101

EvanDonovan - July 3, 2009 - 15:56

@arhak:
So if I'm understanding you correctly, you're creating a 6.x module to deal with the locking on the menu_rebuild? Or is it just a generalized locking framework, that could possibly be applied to this specific issue?

It might be best if you posted it as a project on Drupal.org (if you have CVS access). That way, we could have a more extended discussion about it over there, rather than lengthening this thread even further. Also, having it as an actual project, rather than an attachment in a long issue, means that people like me who would appreciate a locking solution for Drupal right now would be better able to benefit from it. (Since patches could be applied to it, new versions released, etc.)

Anyway, it's great that you're working on this!

#102

Deltoid - July 3, 2009 - 21:56

I'm not sure this has any bearing on the surrounding issue, but I thought it would be better to post it than not.

What I have found is that, after a number of hours testing and general messing about with various modules that I have running on my development site, this problem goes away and the speed of my site returns to normal if I disable the "OG Content Type Admin" module (6.x-1.0.) - http://drupal.org/project/og_content_type_admin.

I have a bunch of modules running (Views, OG, Panels, CCK, etc.) and don't seem to have any issues unless I enable this module. Anyone else run this module and see the issue disapper if disabled?

#103

irakli - July 8, 2009 - 12:27

#104

donquixote - July 8, 2009 - 12:47

#103, #10:
And return false if the menu rebuilding was skipped. The info might be useful to whatever module triggered the menu rebuilding.

#105

markus_petrux - July 11, 2009 - 14:41

I'm not sure if it has been discussed elsewhere, but there's another approach to minimize these lengthy operations.

Many places in core and contrib modules trigger a full menu rebuild just to update a few paths, or even just update, add or remove one single path. It looks like the menu system could be extended in some way, so that it is possible to just rebuild a part of the whole thing.

For example, I'm CCK and I need to rebuild the menu when the label of a field has been changed, just because the label is used for subtabs in the admin/content side. Well, it would be nice if there was a method to just rebuild admin/content/*, or even just the link it being updated.

And so on... I'm sure many places in core (as well as in contrib) could take advantage of *partial* menu rebuilds.

#106

donquixote - July 11, 2009 - 15:38

@markus_petrux,
please check Rewrite the function menu_rebuild. The topic starter is a bit too simplified, better read the replies.

Things to consider:
- It has been said that menu_links can grow very big, and should not be loaded in memory all at once.
- I don't know how big menu_router can become, but the current menu_rebuild algorithm does use PHP arrays in the size of the menu_router table. So I guess it is not a problem.

There is a bit of a controversy how this fits in the roadmap for the menu system. chx has argued that other issues should be fixed first, to clean up the db structure:
- #484234: Big task cleanup: remove type and tasks from hook_menu
- #191360: UB Usability: menu parent choosers
- Look into your DB, table "menu_router", column "tab_root", which associates different menu_router rows with each other. The idea (as I understand) is to remove this column from menu_router, and store the association somewhere else (in menu_links, or in a separate table).
- On the long run, the menu system should get a CRUD interface. I don't know how long this will take, and I don't know the implications on existing modules.

My opinion:
I think an optimization of menu_rebuild is possible without a change in the DB structure. Later, when the above issues are fixed, we can still do whatever is in the roadmap.

#107

pwolanin - July 12, 2009 - 00:10

A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure.

#108

pwolanin - July 12, 2009 - 01:07

@c960657 re: #87. Since a semaphore need not be a lock, we choose the generic name 'value'

re: "It seems a bit odd trying to acquire a semaphore you already hold (or that you previously held but did not release). "

The point of this is that some other thread may have broken/stolen the lock, so you need to check.

#109

pwolanin - July 12, 2009 - 15:17
Status:needs work» needs review

re: #87, putting the lock_renew code into lock_acquire makes sense to me - makes for even simpler code.

lock_wait() does have the problem described - that a process my wait a long time for a popular lock. So, maybe that needs to be documented? Adding new code comments and doxygen.

Inserting the navigation links is a somewhat separate operation, but yes, I see how an extreme situation where doing the menu rebuild took a long time and someone did something dumb like put that call in index.php you might fail to get the links rebuilt. In this re-roll I just extended the lock time a little and don't try to renew in menu_rebuild().

NOTE - I still think this is critical to backport to D6, so there is not an update function. You must do a fresh D7 install to create the table.

AttachmentSizeStatusTest resultOperations
lock-251792-109-D7.patch10.24 KBIdleFailed: Failed to apply patch.View details | Re-test

#110

pwolanin - July 12, 2009 - 21:25

Damien pointed out a minor typo, and also suggested that we should take an optimistic approach to acquiring the lock - that way we avoid an extra select query in most cases (since we expect the lock generally to be available).

AttachmentSizeStatusTest resultOperations
lock-251792-110-D7.patch10.55 KBIdleFailed: Failed to apply patch.View details | Re-test

#111

pwolanin - July 12, 2009 - 21:48

Logic is not quite right in the catch {} block. This should fix it.

AttachmentSizeStatusTest resultOperations
lock-251792-111-D7.patch10.53 KBIdleFailed: Failed to apply patch.View details | Re-test

#112

c960657 - July 12, 2009 - 21:54

@c960657 re: #87. Since a semaphore need not be a lock, we choose the generic name 'value'

The word "lock" is used all over in the patch. The column is used to hold the request id returned by _lock_id(), so I think the column name should reflect that.

#114

pwolanin - July 12, 2009 - 22:45

@c960657 -yes, lock is all over the patch, but the 'semaphore' table is intended to be generic and NOT limited to use by this lock API.

#115

c960657 - July 13, 2009 - 20:51

Ah, okay. If the table is intended to hold more than just semaphores but various variables in general (as suggested by the table description), I suggest using another name, perhaps “variable_volatile”. In computer science, the word “semaphore” has a very specific meaning that doesn't match the current description IMO.

+  $schema['semaphore'] = array(
+    'description' => 'Table for storing locks and other flags that should not be cached by the variable system.',

#116

pwolanin - July 13, 2009 - 21:22

Well, this is not necessarily volatile as much as it's non-cached. While I realize this is a bit of an abuse of the term semaphore, 'flag' is already used for other things. We could call it 'mutex' but that's still more specific than it's intended scope.

How about 'message' instead? I don't really care much.

#117

Dave Reid - July 13, 2009 - 21:37

-1 to {semaphore}. Discussed on irc for {lock}. It doesn't conflict with any current module namespaces, isn't a SQL reserved word, and since I don't see this being generalized to warrant a general table name.

#118

pwolanin - July 13, 2009 - 22:01

webchick says 'semaphore' is fine - so here's a re-roll just trying to clarify the table description. Only that one line changed.

AttachmentSizeStatusTest resultOperations
lock-251792-117-D7.patch10.56 KBIdleFailed: Failed to apply patch.View details | Re-test

#119

pwolanin - July 13, 2009 - 22:04

re: #117 - eaton later pointed out that 'lock' is an Oracle reserved word.

#120

webchick - July 13, 2009 - 22:04

Also, although lock is not a SQL-99 reserved keyword, it is in certain DBMSes.

#121

Deltoid - July 14, 2009 - 10:12

D6 version of this patch?

#122

jmpalomar - July 14, 2009 - 11:53

Subscribing.

#123

markus_petrux - July 14, 2009 - 12:42

@pwolanin at #107: "A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure."

I know, but I was just trying to see this from a certain distance. Aside from this locking framework, which is a nice addition by itself, I think that maybe part of the menu system in Drupal could be re-worked to allow partial rebuilds. For example, when I launch a menu rebuild because I have added a link to admin/build/foo/bar, only that part of the menu could be updated. To allow other modules change that behavior, a new hook could be added to allow other say so. And well, this is just an idea. But I think it would worth thinking a bit more about it because even with a locking framework, the whole menu rebuild may take a lot of time, while other requests wait, and requests waiting may mean more Apache threads, more database connections, poor UX unless we can offer them cookies and tee or coffee, etc. :)

#124

pwolanin - July 14, 2009 - 13:29

@markus - we save new links as a convenience - they are fundamentally different then the main work in the menu rebuild, which is to rebuild the router. LINKS != ROUTER ITEMS

#125

markus_petrux - July 14, 2009 - 13:42

Sure, I know the difference, but the concern about the increased number of threads waiting and consuming Apache/PHP/memory resources still stands.

The locking framework helps to prevent race conditions that cause duplicate errors in DB, but this does not affect on time time required to rebuild the whole menu router, and the stuff related to links.

Anyway, my aim was just to mention this. Even when locks are used, rebuilding the whole thing is overkill when only a portion of the menu will really change, so there may be a way to optimize this, I think.

#126

EvanDonovan - July 14, 2009 - 14:49

markus_petrux:
In #512962-25: Rewrite the function menu_rebuild (EDIT: esp. menu_router_build) chx suggests that after the way that menu local tasks are handled has been rewritten (see #484234: Big task cleanup: remove type and tasks from hook_menu), there might be the possibility of using CRUD operations for handling menu router items rather than the draconian system currently in place. I like this suggestion, too, but unfortunately lack the technical skills to work on it.

#127

Damien Tournoud - July 14, 2009 - 15:08
Status:needs review» reviewed & tested by the community

Let's get this in!

#128

febbraro - July 16, 2009 - 21:07

Here is a re-roll for D6 (based on 6.13) the system update hook number was in conflict with recent updates.

AttachmentSizeStatusTest resultOperations
lock-251792-128-D6.patch9.98 KBIgnoredNoneNone

#129

pwolanin - July 16, 2009 - 21:11

Thanks Frank, though the logic changed a bit in the latest D7, so we should mimic that.

#130

febbraro - July 16, 2009 - 22:32

Thanks for pointing that out Peter.

Re-rolled after bloodying up my eyes trying to tease apart the differences, but I think I got it all. The only part I'm unsure about is the optimistic locking part (and using error suppression for D6 instead of catching a PDOException) so maybe pay special attention there.

AttachmentSizeStatusTest resultOperations
lock-251792-130-D6.patch10.01 KBIgnoredNoneNone

#131

pwolanin - July 17, 2009 - 00:49

Great! thanks. I will take a look and compare.

#132

pwolanin - July 17, 2009 - 00:58

the D6 looks pretty good, but needs to use the PHP 4.4 compatible calls to microtime:

+    list($usec, $sec) = explode(' ', microtime());
+    $expire = (float)$usec + (float)$sec + $timeout;

#133

sun - July 17, 2009 - 02:26
Status:reviewed & tested by the community» needs work

+ * requests should try to acquire a lock before proceeding.  For example,

No double spaces in comments please. It's the current standard, although grammatically correct.

+function lock_init() {
+  global $locks;

Why not a drupal_static() instead here? The entire static system was designed to (also) replace most of our globals. -1 for introducing new ones.

Additionally, and after reading a bit more of this patch, this function should probably reset each and everything, because we are in a new request.

+function _lock_id() {
+  static $lock_id;
+
+  if (!isset($lock_id)) {
+    // Assign a unique id.
+    $lock_id = uniqid(mt_rand(), TRUE);
+    // We only register a shutdown function if a lock is used.
+    register_shutdown_function('lock_release_all', $lock_id);
+  }
+  return $lock_id;
+}

This looks like there can be only one long running process in one request. 3D ?

+ * @return TRUE if the lock was acquired, FALSE if it failed.
+ */
+function lock_acquire($name, $timeout = 30.0) {

(and elsewhere) Description of @return should be on the next line, just like @param.

+function lock_may_be_available($name) {

I'm worried about the namespace of those functions. Having worked in many integration projects, it was really helpful that most of Drupal's functions are prefixed with 'drupal_' or some other, presumably unique namespace.

#134

Josh Waihi - July 17, 2009 - 02:42
Status:needs work» reviewed & tested by the community

subscribing

#135

pwolanin - July 17, 2009 - 02:47

@sun - are you reading the D6 or the D7 patch?

The D7 one uses:

+function _lock_id() {
+  $lock_id = &drupal_static(__FUNCTION__);

#136

pwolanin - July 17, 2009 - 02:53
Status:reviewed & tested by the community» needs work

Re-roll of D6 patch. Cleaned up code comments per sun.

@sun - no, there can potentially be multiple locks held by the same process (each with a different name), though in general seems like something you'd avoid. A single lock (single name) can only be held by one process.

I don't understand your comment about resetting everything in function _lock_id()

AttachmentSizeStatusTest resultOperations
lock-251792-136-D6.patch10.89 KBIgnoredNoneNone

#137

pwolanin - July 17, 2009 - 02:59
Status:needs work» needs review

fixed code comments for D7 version.

@sun - re: namespace issues. menu.inc --> menu_X() functions, lock.inc --> lock_X() functions.

I don't really care a lot if Angie or Dries want to add 'drupal_' to these new functions, but we are essentially defining a pluggable interface (like cache_get/cache_set) so I'd prefer to stay with the more generic names. We so far I think we never make any drupal_X() functions pluggable.

AttachmentSizeStatusTest resultOperations
lock-251792-137-D7.patch10.58 KBIdleFailed: Failed to apply patch.View details | Re-test

#138

sun - July 17, 2009 - 13:07

Thanks for re-rolling.

oh - ok, I think the introduction and explanation in @defgroup needs some improvement and clarification. I first understood that this was a locking system that works within one request, allowing to prevent duplicate invocations (and thereby resetting everything in the next request would have made sense). Reading a bit more through the actual functions, it seems there are two different types of locking, and developers have to invoke the wanted one (somehow). That should be clarified... or if I'm still not getting it right, then the entire explanation in @defgroup could use an overhaul. Possibly also providing example snippets using @code and @endcode.

Furthermore, I wondered why a locking system in lock.inc uses a table named 'semaphore'...?

We should additionally reserve the "lock" namespace in contrib. In case we keep the table name, reserving the "semaphore" namespace wouldn't hurt either. (just like we did for http://drupal.org/project/field)

#139

pwolanin - July 17, 2009 - 13:15

@sun - this entire patch is about INTER-process locking. there is nothing at all about INTRA-process locking. Can you point to the part of the doc you find confusing? There is only one kind of lock and only one function to acquire one in the patch.

The 'semaphore' table is meant to be somewhat generic and available for other sorts of locking implementations.

#140

febbraro - July 17, 2009 - 13:39

Peter, Thanks for the PHP4 compatibility pointer, the patch in #136 applies cleanly and works for me against 6.13.

Thanks for the re-roll.

#141

Josh Waihi - July 17, 2009 - 22:45
Status:needs review» needs work

shouldn't $retry be set to TRUE at the start?

+    // Optimistically try to acquire the lock, then
+    // retry once if it fails.
+    $expire = microtime(TRUE) + $timeout;
+    $retry = FALSE;
+    do {
+      try {
+        db_insert('semaphore')
+          ->fields(array(
+            'name' => $name,
+            'value' => _lock_id(),
+            'expire' => $expire,
+          ))
+          ->execute();
+        $locks[$name] = TRUE;
+        $retry = FALSE;
+      }
+      catch (PDOException $e) {
+        // Suppress the error, and if this is our first pass through the loop,
+        // try to break the lock in case it's expired.
+        $retry = $retry ? FALSE : lock_may_be_available($name);
+      }
+    } while ($retry);

And can we please not introduce db_query into core when we don't need to:
+function lock_may_be_available($name) {
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+  if (!$lock) {
+    return TRUE;
+  }

I like the name 'lock' or 'locks' for a table name rather than 'semaphore' - its easier for people to understand.

#142

Damien Tournoud - July 17, 2009 - 22:49

shouldn't $retry be set to TRUE at the start?

No.

And can we please not introduce db_query into core when we don't need to:
+function lock_may_be_available($name) {
+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+ if (!$lock) {
+ return TRUE;
+ }

Erm. What??

I like the name 'lock' or 'locks' for a table name rather than 'semaphore' - its easier for people to understand.

Semaphore is the perfect technical term for that. Semaphores can be used for other thing then purely locking.

#143

Josh Waihi - July 17, 2009 - 23:35

I thought db_select would have been the preferred function over db_query?

#144

Damien Tournoud - July 17, 2009 - 23:40
Status:needs work» reviewed & tested by the community

I thought db_select would have been the preferred function over db_query?

No (or at least not yet). db_query() is recommended (but not mandatory) for all the simple static queries, while db_select() is mandatory for all dynamic queries.

There have been enough bikeshed on this issue. It has been marked RTBC two times already. Let's this third time be the one.

#145

sun - July 18, 2009 - 01:01
Status:reviewed & tested by the community» needs work

Given that I see many Drupal rockstars on this issue supporting the patch, I'm not questioning or trying to understand the reasoning behind the implementation.

However, the @defgroup really needs quite some clarification. At the very least, please add example snippets that outline how to use this locking system. Furthermore, the concept of having to lock certain operations is a new territory for many developers, so we really need to make sure the documentation takes the audience into account. So let's explain it, like you explain something entirely new to someone else.

Aside from that, the code of the actual functions could use some more inline comments about reasoning. For example:

+    do {
+      try {
+        db_insert('semaphore')
+          ->fields(array(
+            'name' => $name,
+            'value' => _lock_id(),
+            'expire' => $expire,
+          ))
+          ->execute();
+        $locks[$name] = TRUE;
+        $retry = FALSE;
+      }
+      catch (PDOException $e) {
+        // Suppress the error, and if this is our first pass through the loop,
+        // try to break the lock in case it's expired.
+        $retry = $retry ? FALSE : lock_may_be_available($name);
+      }
+    } while ($retry);

Wow. That really looks like some really dirty code. Most probably, we simply don't have another way to do stuff like that currently. But if that is really the case, there should really be a comment explaining why we need to do it in an awkward way like that.

+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

Small nitpick: Why not using single quotes here?

+function lock_wait($name, $delay = 30) {
+  while ($delay--) {
+    sleep(1);
+    if (lock_may_be_available($name)) {
+      return FALSE;
+    }
+  }
+  return TRUE;
+}

Every. Single. Line. Of. This. Function. Definitely. Needs. A. Comment.

Lastly, regarding menu_rebuild() we might have to turn the action into a "real" trigger/action, to solve issues like #306316: Regression: node_types_rebuild should rebuild menu... (separate issue though)

#146

Josh Waihi - July 18, 2009 - 01:33

+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

so PDO inserts the the quotes around :name?

#147

pwolanin - July 18, 2009 - 02:04
Status:needs work» needs review

@sun - we use double quotes for all DB queries as a standard afaik (though that may be a < D7 carry-over).

substantially increased code comments in the attached.

AttachmentSizeStatusTest resultOperations
lock-251792-146-D7.patch12.77 KBIdleFailed: Failed to apply patch.View details | Re-test

#148

dmitrig01 - July 18, 2009 - 02:16
Status:needs review» needs work

<?php
+ * this is an APi intested to support alternative implmentation, code using
spelling
+ capitalization

+    // We acquired a lock before, so try to renew it.
grammar

+    // This function shoudl only be called by a request that faled to get a
spelling

+    // be avaiable in menu_execute_active_handler() resulting in a 404.
spelling
?>

Schema code style is wrong too (end parens should be on a new line)

#149

pwolanin - July 18, 2009 - 02:37
Status:needs work» needs review

With code comment fixes from sun and dimitrig01

AttachmentSizeStatusTest resultOperations
lock-251792-147-D7.patch13.08 KBIdleFailed: Failed to apply patch.View details | Re-test

#150

dmitrig01 - July 18, 2009 - 02:53
Status:needs review» needs work

This is re: sun's current explanation (http://drupalbin.com/10524)

[7:46pm] dmitrig01: * then delete the current data in the database to insert the new afterwards. <-- new what
[7:46pm] dmitrig01: This is a cooperative, advisory lock system. Any long-running operation, <-- what is?
[7:46pm] dmitrig01: which could potentially be attempted in parallel by multiple requests *NEEDS A COMMA* should
[7:47pm] dmitrig01: try to acquire a lock before proceeding. <-- what's that?
[7:47pm] dmitrig01: To use this API, pick a unique name for the lock. <-- that sounds like that's the *only* thing you need to do
[7:48pm] dmitrig01: the thing never explains the concept of locks, it just jumps straight from the problem to "thisis a lock api"
[7:48pm] dmitrig01: and, "This is a cooperative, advisory lock system" makes no sense to me
[7:49pm] dmitrig01: explain the concept of acquiring and releasing locks, then say that every lock also needs a unique ID, then show the function example
[7:49pm] dmitrig01: * A function that has acquired a lock may attempt to renew a lock (extend the <-- why would it want to do that
[7:50pm] dmitrig01: * Failure to renew a lock is indicative that another request has acquired <-- that just flew straight over my head. why/how would a different request acquire a lock if the first already has acquired it?
[7:50pm] dmitrig01: * it may call lock_wait() <-- and what would that do?

#151

sun - July 18, 2009 - 04:00

Thanks for re-rolling. The docs + comments make much more sense to me now! :)

In addition to dmitrig01's review, a few more nitpicks:

+    $success = (bool)db_update('semaphore')

(and elsewhere) There should be space between the casting statement and the value.

+    // Optimistically try to acquire the lock, then
+    // retry once if it fails.

This comment doesn't seem to wrap at 80 chars.

+        // then $retry is FALSE.  In this case, the insert must have failed

Please remove the double space in comment.

+        // Since this will break the lock in case it's expired.

(possibly elsewhere) We try to avoid abbreviations in comments, i.e. "it is expired".

+    // This function should only be called by a request that faled to get a
+    // lock, so we sleep first to given the parallel request a chance to finish
+    // and release the lock.

This function should only be called rather belongs into the PHPDoc description.

Typo in "faled" and "given".

+    if (lock_may_be_available($name)) {
+      // No longer need to wait.

Given the updated docs, the comment in here can probably vanish.

Though: Wouldn't it make sense to rename that function to lock_is_released() ?

+function lock_release($name) {
...
+  db_delete('semaphore')
+    ->condition('name', $name)
+    ->condition('value', _lock_id())

This cries for unit tests. (which should actually be very easy to implement)

+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    // We choose to block here since otherwise the router item may not
+    // be available in menu_execute_active_handler() resulting in a 404.
+    lock_wait('menu_rebuild');
+    return FALSE;
+  }

Maybe that entire menu_rebuild() update should be deferred to another issue... anyway, I'd guess it would make more sense to inject a drupal_goto() in here, redirecting the user to the very same path he tried to access previously, continuously doing it all over again until the lock is released... Because, displaying a 403/404 to admins on well-known pages is a major WTF.

+  else {
+    variable_del('menu_rebuild_needed');
+  }
+  lock_release('menu_rebuild');

Shouldn't the lock be released within the else ?

+    'primary key' => array('name'),
+    );

Wrong indentation.

#152

pwolanin - July 18, 2009 - 13:24
Status:needs work» needs review

@sun -

we call lock_wait(), so we should not see the 404. We *don't* want a looping goto - that's extremely bad..

The function name 'lock_may_be_available' describes EXACTLY the purpose of the function - see above discussion. it should not change.

The lock_release() MUST NOT be in the else - otherwise we would possibly be blocking out the next page request which needs to rebuild again (though ideally our shutdown function would take care of that).

@dimitri - the unique ID is an internal implementation detail that the user of the API doesn't need to know anything about. I don't think we need to explain the concept of locks much more - I added one more sentence about it.

@dimitri - this is not hardware locking - lock expiration or even bad coding could cause the current request's lock to be acquired by another request and then unavailable to be renewed.

added a little more explanation in addition to WS/spelling corrections from sun.

AttachmentSizeStatusTest resultOperations
lock-251792-152-D7.patch13.22 KBIdleFailed: 12071 passes, 0 fails, 1 exceptionView details | Re-test

#153

sun - July 18, 2009 - 17:17

Thanks for incorporating the improvements!

The lock_release() MUST NOT be in the else - otherwise we would possibly be blocking out the next page request which needs to rebuild again (though ideally our shutdown function would take care of that).

If it is that special, then it would make much sense to put this reasoning in code then.

The only other remaining issue is to add some tests for this functionality then.

#154

pwolanin - July 18, 2009 - 17:34

The logic is simple - if you acquire a lock you should release it when you are done. I guess that's not in the top block anywhere - adding one more sentence.

AttachmentSizeStatusTest resultOperations
lock-251792-154-D7.patch13.36 KBIdleFailed: Failed to apply patch.View details | Re-test

#155

pwolanin - July 19, 2009 - 18:18

Ok, so I have to conceded that adding tests was useful - turns out that the float to string and back casting would lead to intermittent failures when trying to break a lock since we were doing a = comparison. Altered the code so make sure the timeout is at least 0.001 seconds and then do a <= compare after adding 0.0001.

AttachmentSizeStatusTest resultOperations
lock-251792-155-D7.patch18.76 KBIdleFailed: Failed to apply patch.View details | Re-test

#156

sun - July 19, 2009 - 20:00

Awesome!

Final issues: Can you squeeze some blank lines into the test (where logical "chapters" start/end)? Also, one of the last assertion messages contains the word/typo "befine".

#157

pwolanin - July 19, 2009 - 20:09

code comment fixes per #156

AttachmentSizeStatusTest resultOperations
lock-251792-156-D7.patch18.87 KBIdleFailed: Failed to install HEAD.View details | Re-test

#158

sun - July 19, 2009 - 22:15
Status:needs review» reviewed & tested by the community

Thank you! Ready to go.

#159

markus_petrux - July 19, 2009 - 23:35

Minor issue. There are 3 indentation spaces here:

<?php
+   return (bool) db_delete('semaphore')
+      ->
condition('name', $name)
+      ->
condition('value', $lock['value'])
+      ->
condition('expire', 0.0001 + $expire, '<=')
+      ->
execute();
?>

#160

pwolanin - July 19, 2009 - 23:45

fixed indent

AttachmentSizeStatusTest resultOperations
lock-251792-160-D7.patch18.87 KBIdleFailed: Failed to install HEAD.View details | Re-test

#161

Josh Waihi - July 20, 2009 - 04:59

go go gadget commit!

#162

sun - July 20, 2009 - 04:59

+    'fields' => array(
+      'name' => array(
+        'description' => 'Primary Key: Unique name.',
+        'type' => 'varchar',
+        'length' => 255,
+        'not null' => TRUE,
+        'default' => ''),
+      'value' => array(
+        'description' => 'A value for the semaphore.',
+        'type' => 'varchar',
+        'length' => 255,
+        'not null' => TRUE,
+        'default' => ''),
+      'expire' => array(
+        'description' => 'A Unix timestamp with microseconds indicating when the semaphore should expire.',
+        'type' => 'float',
+        'size' => 'big',
+        'not null' => TRUE),
+      ),

Now really the last one! The closing array braces are slightly misplaced here.

#163

febbraro - July 20, 2009 - 12:48

Re-roll of D6 version to take the float/string conversion into account. Also incorporated all of the lock.inc doc updates.

AttachmentSizeStatusTest resultOperations
lock-251792-163-D6.patch12.88 KBIgnoredNoneNone

#164

pwolanin - July 20, 2009 - 13:01

ok, this should fix the spacing on the schema.

AttachmentSizeStatusTest resultOperations
lock-251792-164-D7.patch18.89 KBIdleFailed: Failed to apply patch.View details | Re-test

#165

coltrane - July 29, 2009 - 17:23

Patch still applies with offset, minor code comments follow, so patch is a soft reroll of 164.

Line 21 and 22 of lock.inc, "obtaining" and "operation" is misspelled:

By obtaiing a lock, one request notifies any other requests that a specific opertation is in progress which must not be executed in parallel.

Line 50 of lock.inc: extra space in "page request to proceed on the assumption that a parallel request completed" before "assumption"

AttachmentSizeStatusTest resultOperations
lock-251792-165-D7.patch18.89 KBIdleFailed: Failed to apply patch.View details | Re-test

#166

awry - August 3, 2009 - 16:43

subscribe

#167

coltrane - August 3, 2009 - 17:48
Status:reviewed & tested by the community» needs review

Couple more minor double-spacing on new sentences:

+ * that the operation in question be complete.  After lock_wait() returns,

+ * by setting the 'lock_inc' variable to an alternate include filepath.  Since

+    // request acquired the lock and set a new expire time.  We add a small

+ * specified delay in seconds is reached.  This should not be used with locks

#168

coltrane - August 3, 2009 - 17:49
Status:needs review» needs work

Oops, should be CNW even though it's minor.

#169

TC44 - August 3, 2009 - 20:11

Subscribing. Will test the latest D6 patch.

#170

pwolanin - August 4, 2009 - 03:01
Status:needs work» reviewed & tested by the community

Removed double spaces.

AttachmentSizeStatusTest resultOperations
lock-251792-170-D7.patch17.63 KBIdleFailed: Failed to apply patch.View details | Re-test

#171

xtfer - August 4, 2009 - 04:50

I have this running successfully on a patched 6.13 production environment for two weeks without problems - and our menu router errors have disappeared too.

#172

TC44 - August 5, 2009 - 01:52

Just wanted to report that I had one instance of menu router errors today after patching with #163 D6 patch (happened while saving a computed field). This is still far less than prior to patching, where I was getting them frequently. The instance today may have been pilot error on my part though.

#173

febbraro - August 5, 2009 - 18:28

@TC44 Are you using admin_menu by any chance?

For everyone else, there is a "problem" with this locking mechanism and admin_menu that makes this not bulletproof as best I can tell.

admin_menu, in _admin_menu_rebuild_links() calls menu_router_build() directly not menu_rebuild() since it needs the menu data structure. This bypasses the locking mechanism and any concurrent request to rebuild the menu will result in the duplicate key error on insertion into menu_router. This does not have anything to do with the locking mechanism per se, however this patch contains the lock implementation that goes into menu_rebuild()

Questions: Should the lock protection be moved to help prevent this scenario? Should there be a different lock around the DELETE/INSERT statements inside _menu_router_rebuild() themselves? A different change all together? Should everyone just ignore me?

I know we can't control how modules are implemented, but we probably should put protections in where possible.

#174

TC44 - August 5, 2009 - 18:46

@fabbraro,

I'm not. Is it possible that the errors would show if drupal choked on my computed field code? It only happened the one time when I was saving, so I'm suspecting that was the case, since I was creating a computed field at the time, and it happened on the save.

Hasn't happened today at all, so I think the patch is working as it should, and maybe my case was just becuase of the bad comp field code?

#175

pwolanin - August 5, 2009 - 20:11

@febbraro - I think you are mistaken. Note that there is a param to menu_router_build() that defaults to FALSE. in this case, admin_menu will only get the already-built router array. admin_menu only calls this after a regular menu_rebuild() call completes, so logically the router will always be in memory already. Note - this is all by design.

http://api.drupal.org/api/function/menu_router_build/6

#176

TC44 - August 5, 2009 - 21:48

I just got menu router errors again while viewing admin/build/modules. (it happened after testing a patch for custom breadcrumbs, which was spitting some other errors - not sure if it's related, but I'm going to uninstall cb and check again).

#177

febbraro - August 5, 2009 - 22:01

@pwolanin: I'm probably missing something here (it's been a long day)

Is it possible that:
* request 1 calls menu_rebuild()
* which calls menu_router_rebuild()
* which triggers admin_menu_menu_alter()
* and sets admin_menu_rebuild_links to true
* starts to rebuild the router
* just as request 2 comes in for a page which calls admin_menu_footer()
* which checks admin_menu_rebuild_links and determines that it is supposed to rebuild the menu
* (but the first request is not done yet so there is no already build router array)
* then you will have two requests in _menu_router_rebuild() battling it out for control of the universe.

Like I said, I could just have crap for brains right now, but I don't see how it is protected against in every case. (or maybe I'm just trying to rationalize a scenario that makes me see why I'm still getting the error)

Thanks for putting up with my theorizing.

#178

pwolanin - August 5, 2009 - 23:54

@febbraro - in general, that's possible - but admin_menu in specific is coded so that it will not do this. In general, _menu_router_rebuild() is supposed to be private and should not be called. Essentially the scenario you outline is what's often happening now fairly often without the lock - so you will get a pile of SQL errors.

#179

webchick - August 6, 2009 - 07:19

Wow. I have to say, this is a really excellent patch. Even though it's 3am I was able to follow along with the code well and understand everything it's doing. It's extremely well-documented! I am hard-pressed to find anything wrong with it at all (I mentioned a few minor things below).

However, Dries in his replies here has seemed non-plussed about committing it. I'm not sure I feel comfortable doing so when he's not around. :\ Anyone have any other sources that would show that Dries changed his mind since #24 and #75?

Here are my comments, at any rate. They're all pretty minor, so I'll leave at RTBC. I haven't read the entire issue yet, so apologies if this repeats anything.

+++ includes/lock.inc
@@ -0,0 +1,252 @@
+function lock_init() {

I guess this could potentially be interpreted as a hook_init() if you had a "lock" module installed. Maybe rename to lock_initiate() or lock_start() or something. Probably not a big deal though.

+++ includes/lock.inc
@@ -0,0 +1,252 @@
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

Query should use single quotes, not double quotes.

+++ modules/simpletest/tests/lock.test
@@ -0,0 +1,58 @@
+    // Let another request acquire the lock.
+    $this->drupalGet('system-test/lock-acquire');
+    $this->assertText($lock_acquired, t('Lock acquired by the other request.'), t('Lock'));
+    // The other request finished and should have released its lock.
+    $this->assertTrue(lock_acquire('system_test_lock_acquire'), t('Lock acquired by this request.'), t('Lock'));
+    // This request holds the lock, the other request cannot get it.
+    $this->drupalGet('system-test/lock-acquire');
+    $this->assertText($lock_not_acquired, t('Lock not acquired by the other request.'), t('Lock'));
+    lock_release('system_test_lock_acquire');

Could you clarify this just a bit more? I don't quite follow how the first time system-test/lock-acquire is requested it completes the lock and the second time it doesn't.

Also, menu rebuild is definitely a good use case for this. Is cron another?

This review is powered by Dreditor.

#180

pwolanin - August 6, 2009 - 11:42

@webchick - thanks for the feedback. Did we switch the standard for quoting queries?

Cron, block rehash, and several other places in core would certainly be candidates for converting to this or uding this - So, I'd expect a few small follow-up patches. The two conversions here are proof-of-function and also two of the most serious problems where we need the locking.

The clarity of commenting is likely due to thorough vetting above by sun :-)

#181

Berdir - August 6, 2009 - 11:54

Another problem where this would *really* help is #412808: Handling of missing files and functions inside the registry, that is currently a *huge* DX issue and it would be great if we could safely rebuild the registry in such a case.

#182

donquixote - August 6, 2009 - 12:05

I think febbraro made a very valid point.

The protections built into admin_menu (the current version) usually prevent simultaneous menu rebuilds. Admin menu relies on menu_rebuild always being called before admin_menu_build, so the menu should be already in cache. But that's not what we call a robust solution!

menu_router_build is a public function, and as such it should have its own locking protection.

That said, things around menu_rebuild might see some big changes anyway:
- in D7, menu_router_build does no longer write anything to the DB.
- the new admin_menu does not use menu_router_build anymore (see http://drupal.org/node/334120).
- in D6, menu_router_build erases the entire menu_router table, then inserts the new contents. Usually that's totally overkill, and it leaves the system with an empty menu_router table for some seconds.
- in D7 this process happens in menu_router_save, which is called directly in menu_rebuild. It is faster, because the inserts are done in one big query.
- There is a D6 patch out there (see http://drupal.org/node/512962#comment-1870358) which replaces this with a diff-based table rebuilding (scan for differences, write the differences to the DB. In many cases, this means NO write operations). This patch is nice for itself, but there might be some conflicts with the things done in D7.
- The other part of menu_rebuild (_menu_navigation_links_rebuild) needs a speed tweak as well, it's far too slow.

For the future I could imagine a hook_menu_router_update($menu, $changes), that would be invoked with every menu_rebuild, and could be implemented by any module to "listen" to menu router changes. _menu_navigation_links_rebuild could then be moved into a hook implementation in menu.module. Anything about menu_links can then be moved out of core menu.inc.

#183

pwolanin - August 6, 2009 - 13:02

http://drupal.org/node/2497 and all D6 code suggests that we use "" for SQL?

#184

Damien Tournoud - August 6, 2009 - 13:05

@pwolanin: for D7, we are trying to stick with single quotes, now that there is no need for double quotes anymore. That should be documented.

#185

pwolanin - August 6, 2009 - 18:54

re-roll with the very minor changes suggested by webchick

AttachmentSizeStatusTest resultOperations
lock-251792-184-D7.patch17.66 KBIdleFailed: Failed to apply patch.View details | Re-test

#186

TC44 - August 8, 2009 - 04:25

Getting massive amounts of duplicate entry/menu router errors again today, even with #163 D6 patch. Seem to get more frequent over time (and happen now when even just viewing a node).

#187

webchick - August 8, 2009 - 13:38

Can anyone who worked on this patch respond to TC44's test results? Could they be related to this patch, or are they more likely caused by something else?

#188

donquixote - August 8, 2009 - 15:18

@TC44 (#186)
If you have xdebug installed, could you check where and how often menu_router_build() is called during the offending request? If you find menu_router_build being called by admin_menu, can you check execution time for this function?

My guess is that it's the problem explained in #177. As said, menu_router_build in D6 needs special protection, in addition to the protection for menu_rebuild. In D7 that's less of a problem, because the harmful things happen in _menu_router_save which is called directly by menu_rebuild.

#189

pwolanin - August 8, 2009 - 16:59

@TC44 - if you are seeing this just viewing a node, then you almost certainly have either hacked core or have a really bad contrib module that's doing something it shouldn't. Can you diff your core install to CVS or a clean download?

Perhaps some contrib module is calling into one of the internal functions directly? We could, perhaps, add code to acquire the a lock at each step, though this is generally unneede overhead for anyone using the APi right.

#190

pwolanin - August 8, 2009 - 17:20

admin_menu 6.x-1.x only calls this in hook_footer(), though it uses at this point a variable for passing the state, so there are some possible issues in terms of variable caching. admin_menu 6.x-3.x is doing somehting odd in terms of examining the path - are you using that module?

As above - the D7 patch makes some API changes by splitting the build form the save that should make this a non-issue regardless for D7.

#191

TC44 - August 8, 2009 - 19:33

@donquixote, @webchick, @pwolanin,

Thanks for the responses.

I do find that it happens more when I'm updating a view, then viewing a node. I don't have xdebug installed, but I can install and do some checking on that. I can verify that core is not hacked on my installation. I think I've kept it as clean as possible with the exeption of an exposed form filters override and theme overrides, and the patch applied here.

I did read another post related to this where the user found it happened after reinstalling views, after doing testing by reinststalling modules one by one.. but I haven't done a step by step reinstall to test in the same manner. I do make use of views and cck quite heavily though. I also usually have both firefox and safari open, so I'm sometimes doing quick refreshes in both, which sometimes leads to the errors.

Just to be clear, I was getting probably the same frequency now as before patching. The patch did seem to stop the errors for the first day though. It may very well be related to a specific module, but I'm not sure which one. I do vigorous backups at almost every step, so I could always restore the db from various points and test, but that could be a long process.

Let me figure out how to install xdebug so I can provide some more info..

Thanks

#192

pwolanin - August 8, 2009 - 19:49

A, so it likely whatever Views is doing - but in any cased that's D6 not D7.

#193

TC44 - August 8, 2009 - 20:25

I don't want to say that for sure, as it happens sometimes when saving regular nodes as well, so it views could be a part of it, but not necessarily the cause.

#194

donquixote - August 8, 2009 - 21:15

I can imagine that views ajax calls generally cause more concurrency than the usual plain requests. On the other hand, a views ajax call should not call menu_router_build().

@TC44:
If you want to get some quick results without xdebug, you could add a drupal_set_message() into your menu_router_build(), and let it print the current $_GET['q'] and debug_backtrace(), and what else comes to your mind.
Unlike print_r or var_dump, drupal_set_message() will not kill your ajax calls. Instead, the message is buffered for the next chance to be printed as a regular drupal message.

Anyway, I think it's quite obvious that menu_router_build in D6 needs a lock.

----

Looking for menu_rebuild or menu_router_build in D6 views:

- views_ui_edit_view_form_submit() has a menu_rebuild()
- views_ui_enable_page() has a menu_rebuild()
- views_ui_disable_page() has a menu_rebuild()
- view::delete() has a menu_rebuild()

and all the rest of views does a lot of menu-related things.

if the patch does its job right, menu_rebuild() should be generally protected. But they can still be disrupted by unprotected calls to menu_router_build(), most likely caused by admin_menu.module, as pointed out above.

#195

sun - August 8, 2009 - 23:06

Since there seems to be some confusion about admin_menu in this issue: Yes, 1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out). 3.x does not invoke menu_rebuild() anymore (aside from its built-in menu link to trigger a menu rebuild) and instead 100% relies on the regular menu system to output the menu links.

#196

donquixote - August 9, 2009 - 02:59

sun (#195):
"1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out)."

It invoked menu_router_build, not menu_rebuild.
And yes, in a usual request, menu_router_build will be called by a regular menu_rebuild, before admin_menu calls menu_router_build for a second time.
But in #177 you can read how it can happen that menu_router_build() is called without a previous call to menu_router.

For those who don't believe it, have a look into the code.

<?php
function admin_menu_footer($main = 0) {
  if (!
user_access('access administration menu') || admin_menu_suppress(FALSE)) {
    return;
  }

 
// Check for the flag indicating that we need to rebuild.
 
if (variable_get('admin_menu_rebuild_links', FALSE)) {
   
module_load_include('inc', 'admin_menu');
   
_admin_menu_rebuild_links();
   
variable_del('admin_menu_rebuild_links');
  }
?>

Usually, 'admin_menu_rebuild_links' has been set to 'true' in the same request. But, in some cases it might be that another request set this variable to true. This is where we get into trouble.

<?php
function _admin_menu_rebuild_links() {
 
// Get the newly rebuilt menu.
 
$menu = menu_router_build();
?>

As we see, menu_router_build is called with no further if-clauses.

And yes, the new admin_menu works differently. But still, as menu_router_build is a public function, it should be lock protected, don't you think?

-----------

pwolanin (#178):
"In general, _menu_router_rebuild() is supposed to be private and should not be called."

Yes, but menu_router_build() is public.

-------

sun (#195):
You are right that the new version of admin_menu will solve this issue (it might cause other side effects, I don't know). If it turns out that no other D6 module relies on the current way menu_router_build() works in D6, we could backport the change from D7, which turns menu_router_build into a peaceful function. If we do this, we don't need a lock any more for menu_router_build, the lock for menu_rebuild will be enough.

#197

TC44 - August 14, 2009 - 22:40

Just a follow up.. I'm suspecting my issues were related to the "Content Type Administration by Organic Group" module.

I didn't get to xdebug yet, but I've just implemented #4 change as suggested here: http://drupal.org/node/374730

No further menu router errors so far, but it's only been 30min of testing. Still, they were happening approx every 5th node view, node edit or view edit prior.

#198

EvanDonovan - August 17, 2009 - 16:16

Yes, there is a known issue with the Content Type Administration by Organic Group contributed module that will render your site virtually unusable. Please read #330135: menu_router_build call in hook_init is a serious performance hit. I would strongly recommend that until that issue is resolved, that you disable the module and test again.

#199

TC44 - August 17, 2009 - 17:19

@EvanDonovan,

Thanks for the reply and the link. I did read that thread previously. I've been running perfectly since patching without a single memery router error since. I've been working on my site and testing fairly heavily over the weekend, so I think this is a good sign. So for CTABOG, the D6 patch provided in this thread plus change #4 here http://drupal.org/node/374730 has so far solved the problem for me. And yes, in my case CTABOG was certainly the cause of my problems.

#200

Dries - August 17, 2009 - 20:35
Status:reviewed & tested by the community» fixed

Alright, I spent another 30 minutes looking at this patch and decided to commit it as is. It is solid code and documentation, despite the fact that I might have done some things slightly differently. I don't think one would have been better than the other, so I didn't feel like holding this back. Thanks for all the work.

I'd like to see us add an entry to CHANGELOG.txt though.

#201

moshe weitzman - August 17, 2009 - 21:25
Version:7.x-dev» 6.x-dev
Status:fixed» needs work

Awesome work! Can we get a final D6 patch for Gabor? This is murdering D6 sites.

#202

febbraro - August 17, 2009 - 21:42

@moshe

The D6 patch from #163 is still working for core
http://drupal.org/node/251792#comment-1831346

Or should it contain the extra lock as mentioned in #196 to protect against rouge modules? :p
http://drupal.org/node/251792#comment-1903500

Was there something else that needed to be included?

#203

pwolanin - August 17, 2009 - 23:08
Status:needs work» needs review

#204

moshe weitzman - August 18, 2009 - 03:09

Could someone start D7 patches for using the API in "Cron, block rehash, and several other places in core ...". Put links here so we can help with reviews.

#206

c960657 - August 19, 2009 - 19:11

#207

mrfelton - September 16, 2009 - 08:32

subscribing. In need of the a final D6 port.

#208

gpk - September 22, 2009 - 11:15

#209

gpk - September 22, 2009 - 11:21

#210

pixelpreview - September 23, 2009 - 10:25

subscribing for final patch to d6

#211

azinck - September 25, 2009 - 19:11

subscribe -- looking for the D6 patch

#212

Starminder - September 29, 2009 - 12:00

subscribe - please needs some help on this one

#213

andypost - October 1, 2009 - 01:39

#163 changes D6 schema so need a Gabor decision about backporting.

#214

gpk - October 1, 2009 - 10:38

Apart from minor code style issues, the patch at #163 (if it still applies) should still work for D6, as it did on August 17 (#202).

IMO the problem with rogue modules (#202) can be addressed if and when it happens. #163 is a big enough improvement to make this core-worthy.

What might help Gabor, apart from code style fixes, is people confirming that this works on a variety of live sites and doesn't cause any other problems. #202 suggests it works OK, problems *were* reported at #186 but did turn out to be a rogue module (#198, #199 - Content Type Administration by Organic Group, this has now been fixed #374730: Site down due to this module?). I will attempt to get this running on a live site shortly.

#215

gpk - October 1, 2009 - 10:34

Adding tag per #200.

#216

gpk - October 1, 2009 - 10:46
Version:6.x-dev» 7.x-dev
Status:needs review» needs work

Unless I'm mistaken, 7.x system.install is missing the update routine to install the new {semaphore} table.

#217

Berdir - October 1, 2009 - 10:49
Version:7.x-dev» 6.x-dev
Status:needs work» needs review

*If* this is added to D6, D7 does not need a update function.

#218

gpk - October 1, 2009 - 11:00

I would say, *if* this is added to D6, and *if* we require everyone to upgrade to the final 6.x release before upgrading to 7.x (which we usually don't AFAIK, although it is often recommended), then D7 does not need an update function..

#219

andypost - October 1, 2009 - 11:12
Version:6.x-dev» 7.x-dev

Common practice is add D7 upgrade path, then try to backport it o D6, then change D7 upgrade path

#220

System Message - October 1, 2009 - 11:30
Status:needs review» needs work

The last submitted patch failed testing.

#221

EvanDonovan - October 1, 2009 - 19:07

@gpk: I have used the patch from #214 on a large (50,000+ nodes) Drupal site, but only on the test install, not the one that was actively being visited.

I have done various things to force menu_rebuild, such as saving the modules page or saving views. I have also done these things more than once at the same time to see if I could cause the database to get into an inconsistent state. I was not able to do so.

However, I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock. If it gets committed to Drupal 6 core, though, I will use it, simply to avoid having to keep applying my hack. I'll just have to see if the performance implications are really bad - especially in cases such as when people try to save multiple views at once.

#222

nonsie - October 1, 2009 - 22:14

Subscribe

#223

gpk - October 2, 2009 - 14:17

@221:
>I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock

I've still not used it in anger ... yes, surprised to hear that it seems slower, as you say possibly your table lock doesn't last for as long as the 'menu_rebuild' lock which does more than just lock {menu_router}. And again lock_wait() only checks the lock every second, perhaps it should do so more often.

>see if I could cause the database to get into an inconsistent state. I was not able to do so.
Looking at the code is there a theoretical opportunity for rows to be left out of {menu_router}?

Suppose I have a (disabled) module aaa that defines a menu path.

1. Some other operation takes place that causes a menu rebuild
2. I enable module aaa and the menu rebuild then invoked happens to clash with the one from 1.
3. I don't acquire the lock in menu_rebuild() and so I have to wait (for up to 30s) for the lock to become available, i.e. for the other menu_rebuild to finish
4. My menu_rebuild() returns, without having done anything (i.e. the assumption is that the other rebuild has done what is necessary).

But - will hook aaa_menu() have been invoked? The list of menu hooks to call is determined early in http://api.drupal.org/api/function/menu_router_build/6, which is itself called right at the top of http://api.drupal.org/api/function/menu_rebuild/6

Thoughts?

#224

Viybel - October 13, 2009 - 11:14

Patch #163 updated for Drupal 6.14: with incremented function system_update_6054()

Don't forget to run update.php after rolling the patch.

AttachmentSizeStatusTest resultOperations
lock-251792-224-D6.patch12.88 KBIgnoredNoneNone

#225

Mike_Waters - October 8, 2009 - 13:56

FYI for those interested, a pressflow branch with this patch has been created. See https://answers.launchpad.net/pressflow/+question/85119
(Thanks go to David Strauss from fourkitchens)

Is this locking system going to be included in a future revision of D6, or will we have to wait until D7 to see it?
There is an item on the D6 issue queue, but it has been marked as a duplicate of this (which is not in the D6 issue queue)

#226

adamsohn - October 12, 2009 - 16:04

For the record, I started having this problem, found this issue queue yesterday and I don't have access to the command line on the server with the site. So I downloaded Drupal 6.14, put in my local directory with wamp. Then I downloaded Eclipse which made it easy to do the patch in #224.

I don't know anything about anything when it comes to web development. I so proud of myself right now. Didn't know I could do all that. Thanks guys you're all a big help.

BTW, my site, since I added the patch, is running so much faster!

#227

Carlos Miranda Levy - October 12, 2009 - 19:36

Works like a charm.

Finally fixed with it what was driving me insane (although not sure what was more insane, the modules page slow speed or the dozens of open threads on how to fix it...)

Might want to mention here to run update.php after installing the patch :)

#228

achilles085 - October 15, 2009 - 05:58

hope others may find this thread easily

there are other thread regarding errors in menu_router but this is the only patch worked for me.

Thanks for the patch...it works great!

#229

gaele - October 20, 2009 - 20:57

Re: patch in #224:

After the patch and before hitting 'Update' in update.php I get big red error messages like this:

Table 'drupal.semaphore' doesn't exist query: SELECT expire, value FROM semaphore WHERE name = 'locale_cache_nl' in /includes/lock.inc at line 154.

I know where it's coming from, but this will scare most users.

#230

JamesK - October 20, 2009 - 22:31

Did you run update.php as people have mentioned I don't even know how many times?

#231

Carlos Miranda Levy - October 24, 2009 - 09:49

Table semaphore is created by the patch when you run update.php

#232

Carlos Miranda Levy - October 24, 2009 - 10:09

Is it safe to use this patch on Drupal 6 with Cacherouter (cache_router - http://www.drupal.org/project/cache_router ) module enabled?

#233

SamRose - October 24, 2009 - 18:12

tested patch at http://drupal.org/node/251792#comment-2125124 working on multiple drupal 6.14 sites

#234

JamesK - October 26, 2009 - 18:56
Version:7.x-dev» 6.14
Status:needs work» reviewed & tested by the community

I think ratbc is justified. It would be great to get this committed.

#235

gaele - October 27, 2009 - 16:49

Re #230: this error message appears *while running* update.php, on the pages before the actual update happens.

#236

pimousse98 - November 9, 2009 - 19:52

@gaele: you might want to put your site in maintenance mode while updating modules/applying patches.

Everyone else: do we know if this patch will be included in a future release? just judging by the number of threads about menu_router, it would be worthwhile.

#237

gpk - November 10, 2009 - 11:26

@230, 236: I think gaele's point is that when an administrator upgrades their site they will see the error message during the update process, and this will worry some of them.

#238

andypost - November 12, 2009 - 04:46
Version:6.14» 6.x-dev
Issue tags:-needs CHANGELOG.txt entry

Upgrade path patch for D7 posted in #563106: Cannot upgrade from Drupal 6 to Drupal 7 - meta issue #42

CHANGELOG.txt entry already in under Task handling - this should be added for on-going D6 patch too.

#239

chrism2671 - November 12, 2009 - 15:02

I applied this patch under D6 but it doesn't seem to have helped.

Does this apply specifically to MyISAM tables as I altered the menu tables to InnoDB to try and reduce locking problems.

#240

System Message - November 19, 2009 - 11:21
Status:reviewed & tested by the community» needs review

chrism2671 requested that failed test be re-tested.

#241

pwolanin - November 19, 2009 - 15:22

For Drupal 6: usleep() doesn't work on windows for PHP < 5.0 Thus, sleep() is the only option and we must lock for increments of 1 sec. For D7, we could possibly switch to usleep() so that the (for example) we retry every 100 ms.

 
 

Drupal is a registered trademark of Dries Buytaert.