Drupal currently face several concurrency issues (see #249185 for variable caching, #238760 for menu rebuilding, ...). All relatively long operations (cache building, ...) are potentially affected by two different types of issues:

  1. database inconsistency: when the operation rebuilds a whole table, using the DELETE-then-INSERT pattern, requests served during the operation will see an inconsistent table (to demonstrate the issue, simply reload twice the module page in Drupal 6.x/7.x);
  2. race conditions: several operations can run at the same time, thus leading to unneeded duplication of effort.

The first issue can be solved by reducing the window of inconsistency (that's what is suggested in #238760 for session_write(), or in #230029 for node_save()), or by using transactions (that's one of the point of the new database layer).

The second issue cannot be solved by simply using transactions. You need some sort of locking (or semaphores), as discussed in #248739 and #238760. Without locking, these long operations can lead to performance hits (because several instances of the operation are run in parallel), and to space hits (because the result of the operation is saved several times in the database, thus in its binary log). About this last point, one of the busiest Drupal site in France (France 24) was forced to disabled locale caching because its invalidation lead to several hundred megabytes of binary log data.

The attached patch implements a pluggable soft locking framework for Drupal. Two points worth noting:

  • Two implementations are available: an APC-based one, which is fast but implies that there is an unique web node; a slower database-based one, that should work across a cluster of nodes. High-performances implementations for a cluster could be made, for example based on Memcached or on a multicast bus.
  • All long standing operations should be locked: the patch only demonstrates how to do this for menu_rebuild() and locale() cache refresh. Several locking strategies can be considered: acquire the lock and give up the operation if it fails (that's the implemented strategy for locale cache refresh), acquire the lock and wait for the other operation to complete (menu_rebuild()), or acquire the lock and retry acquiring until success.
CommentFileSizeAuthor
#286 251792_track.patch636 bytesandypost
#224 lock-251792-224-D6.patch12.88 KBViybel
#185 lock-251792-184-D7.patch17.66 KBpwolanin
#170 lock-251792-170-D7.patch17.63 KBpwolanin
#165 lock-251792-165-D7.patch18.89 KBcoltrane
#164 lock-251792-164-D7.patch18.89 KBpwolanin
#163 lock-251792-163-D6.patch12.88 KBfebbraro
#160 lock-251792-160-D7.patch18.87 KBpwolanin
#157 lock-251792-156-D7.patch18.87 KBpwolanin
#155 lock-251792-155-D7.patch18.76 KBpwolanin
#154 lock-251792-154-D7.patch13.36 KBpwolanin
#152 lock-251792-152-D7.patch13.22 KBpwolanin
#149 lock-251792-147-D7.patch13.08 KBpwolanin
#147 lock-251792-146-D7.patch12.77 KBpwolanin
#137 lock-251792-137-D7.patch10.58 KBpwolanin
#136 lock-251792-136-D6.patch10.89 KBpwolanin
#130 lock-251792-130-D6.patch10.01 KBfebbraro
#128 lock-251792-128-D6.patch9.98 KBfebbraro
#118 lock-251792-117-D7.patch10.56 KBpwolanin
#111 lock-251792-111-D7.patch10.53 KBpwolanin
#110 lock-251792-110-D7.patch10.55 KBpwolanin
#109 lock-251792-109-D7.patch10.24 KBpwolanin
menu_rebuild.php_.txt313 bytesEvanDonovan
#85 lock-251792-85-D7.patch10.28 KBpwolanin
#84 lock-251792-84-D7.patch8.76 KBpwolanin
#74 lock-251792-74-D7.patch10.23 KBpwolanin
#71 lock-251792-71-D6.patch10.64 KBpwolanin
#70 lock-251792-70-D7.patch10.18 KBpwolanin
#67 lock-251792-65-D7.patch10.9 KBpwolanin
#65 lock-251792-65-D6.patch11.51 KBDamien Tournoud
#64 lock-251792-64-D6.patch6.16 KBDamien Tournoud
#63 lock-251792-63-D6.patch9.53 KBpwolanin
#62 lock-251792-62-D6.patch9.51 KBpwolanin
#61 lock-251792-60-D6.patch9.36 KBpwolanin
#58 lock-251792-58-D6.patch8.1 KBpwolanin
#39 lock-no-blob-251792-39.patch14.59 KBpwolanin
#37 lock-no-blob-251792-37.patch12.34 KBpwolanin
#36 lock-no-blob-251792-36.patch12.23 KBpwolanin
#34 lock-no-blob-251792-34.patch12.27 KBpwolanin
#33 lock-no-blob-251792-33.patch12.01 KBpwolanin
#31 lock-no-blob-251792-31.patch12.01 KBpwolanin
#29 lock-no-blob-251792-29.patch11.36 KBpwolanin
#28 no-blob-251792-28.patch5.85 KBpwolanin
#27 no-blob-251792-27.patch1.4 KBpwolanin
#16 core-lock-patch-D7-02.patch9.24 KBslantview
#16 core-lock-patch-D6-02.patch8.49 KBslantview
#15 core-lock-patch-D6-01.patch5.93 KBslantview
#15 core-lock-patch-D7-01.patch6.65 KBslantview
core-lock-patch-0.patch11.02 KBDamien Tournoud
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

As a side note: the APC-based implementation requires at least APC 3.0.17, because of a bug of apc_add() (bug #12499).

kbahey’s picture

Status: Needs review » Active

We can't have a dependency in core on APC. Not all PHP installations have it, so we can't rely on that.

So, this leaves the db implementation as the candidate for core.

This uses the variable table as a lock. I don't like this idea. Creating a new table for that is better than overloading the variable table doubling as a lock mechanism. And if you make this table cache_lock, which inherits the schema from "cache", it would automatically be eligible for use in APC, memcached, and XCache if you have the cache_router module and assign a bin for cache_lock.

kbahey’s picture

Status: Active » Needs work

Setting this to CNW.

lilou’s picture

Patch still applied.

arhak’s picture

subscribing

arhak’s picture

BTW broken link to "the new database layer" which is actually linking this issue #251792

dmitrig01’s picture

Status: Needs work » Needs review

@kbahey i think that various backends can be programmed including one that (as I understand) is already in there that don't depend on APC.

BartVB’s picture

Status: Needs review » Needs work

Nice proposal!

Some remarks though: What happens if menu_rebuild() never finishes, for instance because Apache segfaults? When that happens the whole site will be down until the admin manually removes the lock. Could be solved by specifying a timeout with lock_acquire()

Is a global $locks variable necessary? Maybe use a class for this? Not sure what the policy on globals is in Drupal core.

What is $lock_id for?

Using cache_router or somesuch for this sounds like a great plan, using the DB for locking on a (very) busy site won't be fun :D

Damien Tournoud’s picture

#251792: Implement a locking framework for long operations was a duplicate. As identified in that bug, those cache rebuilding operation can simply kill any moderate traffic Drupal 6 site. This is probably one of the biggest scalability issue we have now.

slantview’s picture

While I agree in principal there is one thing I don't like:

--- includes/menu.inc	23 Apr 2008 20:01:47 -0000	1.269
+++ includes/menu.inc	27 Apr 2008 10:50:51 -0000
@@ -1636,6 +1636,11 @@ function menu_cache_clear_all() {
  * is different and leaves stale data in the menu tables.
  */
 function menu_rebuild() {
+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for the other request to do our work
+    lock_wait('menu_rebuild');
+    return;
+  }

I don't think we should wait on a lock for menu_rebuild() just skip the rebuild since someone else is already handling it. The other thing is that I think doing a menu_cache_clear_all() before we even run the db operations is dumb. I think you should lock, rebuild from db, clear the cache and set it at the same time so at least other clients will get the stale menu and not try to rebuild every time.

Damien Tournoud’s picture

@slantview: the idea was that during the menu rebuild the whole menu table is in an inconsistent state: something calling menu_rebuild() is also likely to query the menu table right after that (especially in menu administration pages). Plus, a menu_rebuild() should be a fairly quick operation (supposing that we don't run hundreds of them in parallel... of course).

I like your suggestion to clear the cache *after* the operation, and I think it's quite complementary: if a request has called menu_rebuild() explicitly, it should wait for the other rebuild to complete, but if a request has not called menu_rebuild(), it should use the old cached data.

pwolanin’s picture

As much as it might look like an APi change, I think we really need to backport some version of the to 6.x to deal with several painful race conditions.

pwolanin’s picture

This looks like a poor choice:

$lock_id = md5(uniqid());

Also, it seems a little wonky to have a bunch of extra global variables here - is there a way to avoid that?

moshe weitzman’s picture

Very nice work. subscribe.

slantview’s picture

Status: Needs work » Needs review
FileSize
6.65 KB
5.93 KB

I worked on the patch a little and implemented for www.divx.com. I also am including a D7.x-dev patch and a patch for D6.8 since that is what we are running on our site in case we want to back port this.

I didn't like having an APC specific technology lock.inc, so I renamed lock-db.inc to lock.inc, and changed the get'ers and set'ers to cache_get and cache_set. Now we have a replaceable lock_inc variable so we can replace the locks with whatever we want (including nothing) if we are so inclined. The cache_get and cache_set allow us to use any backend technology that we want for storage for the locking system.

I added a new table into system.install $schema called "cache_lock" that allows us to use database storage by default, but easily replace it with something else.

We use my module "Cache Router" for storage on divx.com and simply replace the cache_lock bin with a memcache backed bin and we added an additional bin to the memcached servers.

All in all, I really believe that this is helping our site. We can now clear cache, save menus and have them rebuilt, etc even at peak traffic periods and there is little to no impact on performance. Previously we would have the site crawling if we did any of that stuff.

There is something else I should note too so this patch doesn't get all the praise for fixing our specific problems. We also had previously converted all the tables over to INNODB, and we converted the menu_router table back to MyISAM as well as all of the search tables. Everything else is running INNODB.

Best,

Steve Rude

slantview’s picture

Aww crap, forgot the -N flag on diff.

Here is a re-roll including the lock.inc. Kinda important. :)

Damien Tournoud’s picture

We need a fix for D6, and I'm not convinced it will be the same for D7.

Your usage of cache_get/cache_set simply will work; but doesn't offer enough guarantee:

+    // try to acquire the lock
+    $lock = cache_get($name, 'cache_lock');
+    if ($lock === FALSE) {
+      $locks[$name] = TRUE;
+      cache_set($name, $locks[$name], 'cache_lock', CACHE_TEMPORARY);
+      return TRUE;
+    }
+    else {
+      return FALSE;
+    }

^ This is asynchronous. Under a load of several hundred requests per second split on several web nodes, who knows how many process can acquire this lock?

Compare to my DB implementation in the original patch:

+    // try to acquire the lock
+    if (@db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", 'lock:' . $name, $lock_id)) {
+      $locks[$name] = TRUE;
+      return TRUE;
+    }
+    else {
+      return FALSE;
+    }

^ This is perfectly atomic.

I fully agree that we need to link cache and lock (it simply makes sense). But, we need to add new primitives to cache.inc for this, that alternative cache implementation may not implement: this would be an API change.

So, would other agree to go for this imperfect cache_get()/cache_set() solution for D6 and implement something better in D7?

Damien Tournoud’s picture

@pwolanin: as you have seen, this is a very old patch, before I got fully involved in core development. There are also lots of code style errors in there.

Gábor Hojtsy’s picture

Issue tags: +drupal.org upgrade

Damien says this will most probably be needed for the d.o upgrade so we can help performance and avoid some issues.

Damien Tournoud’s picture

After a discussion with Gábor, we agreed that for D6 having a (pluggable) implementation defaulting to an (atomic) db lock would be better. I makes little sense to replace a race condition by another.

If you want to use memcache as your cache implementation, simply implement the (atomic) memcache lock implementation that match your needs.

Damien Tournoud’s picture

Version: 7.x-dev » 6.x-dev

The implementations for D6 and D7 will differ much. I'm dedicating this issue to D6, and I opened a new issue with a proposal for D7: #358663: Implement a locking framework and make caching pluggable.

Damien Tournoud’s picture

Category: task » bug

By the way, this one is a bug.

Jeremy’s picture

Subscribing.

Dries’s picture

We can solve #238760: menu_router table truncated and site goes down without using locks or API changes. Instead of flushing the menu cache in request A, and rebuilding it in request B, we can rebuild the menu cache in request A and replace the old menu -- I think.

pwolanin’s picture

@Dries - I don't think you are correct - we are deleting and rebuilding the menu in the same request in most cases. The problem is a race condition. Since we are using a variable whose value is cached for the duration of the request.

We also have a period of time where one process may be trying to build a new menu, and another process sets the flag that a rebuild is needed since the router is missing.

We could read/update the variables table directly to help eliminate the cached variable race condition, but I'm not sure we can solve it cleanly wihout APi changes.

I have been seeing this error even on a no-load localhost, so I think it's pretty critical to fix.

pwolanin’s picture

per: http://drupal.org/node/369548#comment-1245494

I think we could also help the menu rebuild issue some by not trying to cache the whole router as a serialized blob.

pwolanin’s picture

FileSize
1.4 KB

Here's a start which attempts to do away with caching the big router blob. That alone should help reduce the race conditions by speeding the process.

pwolanin’s picture

FileSize
5.85 KB

whoops - that patch missed menu.inc

pwolanin’s picture

FileSize
11.36 KB

And a full db lock implementation - building off Damien's original, but simplifying it as much as possible.

pwolanin’s picture

ok, so to check the effect, I enabled devel module (plus minor fix to devel_menu_link_alter())

I allowed anonymous users to access this devel path which forces a menu rebuild:

ab -n100 -c5 http://127.0.0.1/drupal-6/devel/menu/reset

With the patch, I got just three 404 errors in the log, I assume from when a request came in and the router table was empty.

Without the patch, there are 87 pages of SQL errors in the dblog, from running the same ab. The errors have the form:

Duplicate entry 'taxonomy/autocomplete' for key 1 query: INSERT INTO menu_router (path, load_functions, to_arg_functions, access_callback, access_arguments, page_callback, page_arguments, fit, number_parts, tab_parent, tab_root, title, title_callback, title_arguments, type, block_callback, description, position, weight, file) VALUES ('taxonomy/autocomplete', '', '', 'user_access', 'a:1:{i:0;s:14:\"access content\";}', 'taxonomy_autocomplete', 'a:0:{}', 3, 2, '', 'taxonomy/autocomplete', 'Autocomplete taxonomy', 't', '', 4, '', '', '', 0, 'modules/taxonomy/taxonomy.pages.inc') in /www/drupal-6/includes/menu.inc on line 2388
pwolanin’s picture

FileSize
12.01 KB

whoops - patch missed bootstrap.inc. Also fixing code comments, typos, etc.

Also, the removal of the router blob caching can certainly be in a separate patch, but I think reading/writing that much data is contributing to the race condition in menu rebuilds, and hence related to the central issue of concern here.

moshe weitzman’s picture

Looks great. I noticed a typo in a function name: local_break($name, $now);

pwolanin’s picture

FileSize
12.01 KB

With that typo fixed.

pwolanin’s picture

FileSize
12.27 KB

added more doxygen.

Damien Tournoud’s picture

Two nitpicks:

+/**
+ *  Helper function. Get this request's unique id.
+ */

and

+/**
+ * Helper function to store the menu router if we have it in memory..
+ */

The lock part looks very good. I'm a little bit worried about changes to the menu rebuild mechanism, especially because some API changes are involved:

-  drupal_alter('menu_link', $item, $menu);
+  drupal_alter('menu_link', $item);
pwolanin’s picture

FileSize
12.23 KB

Yes - it's a minor API change, but I honestly have no idea why we are passing $menu into that hook.

The few contrib modules that implement it with both params do not use the $menu:

http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/i18n/i18nme...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/minutes/min...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/devel/devel...

So, we could avoid breaking them outright (see attached patch), but it's still a minor API change since $menu may now be NULL.

pwolanin’s picture

FileSize
12.34 KB

neclimdul points out to me: it's a stretch but you could want to lock, release and then lock again. the second lock would fail.

So, I think using a global is the only sensible thing here (back to something closer to Damien's original).

bjaspan’s picture

I'm really glad to see this happening. I've needed lock capability in Drupal in the past.

Comments upon first reading the patch; note that I have not read any of the history in this issue, as no one else will have when they read the code after it is committed.

Locking semantics are always subtle and need to be documented. lock.inc should include a @defgroup with an overview and theory of operation. Among other things, state clearly that this is a cooperative, advisory lock system.

The answer to all my comments below may be "the patch is correct." If so, my comment becomes, "please add a comment explaining why."

+function lock_acquire($name, $timeout = 10.0) {
+  global $locks;
+
+  if (isset($locks[$name])) {
+    return (bool)db_result(db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id()));

I think this says: If I previously acquired a lock, check the db anyway because someone else might have broken it. In that case, why wouldn't you also clear $locks[$name]?

+    return (bool)@db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());

Are we guaranteed that db_query("UPDATE") returns rows modified?

In the unlikely case that $expire equals the previously set time, this will return false even though the lock is held for the requested amount of time.

+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request to do our work
+    lock_wait('menu_rebuild');
+    return;
+  }
+

This looks odd. We just waited up to 20 seconds for the lock to clear. It might be available, but we aren't going to try to grab it and do the work? Doesn't our caller deserve to know that we failed? I see something in the phpdoc for menu_rebuild() that it may leave the operation for the next page request, but it says that happens only during update.php or install.php.

pwolanin’s picture

FileSize
14.59 KB

First pass at addressing Barry's comments.

slantview’s picture

I reviewed this patch and it looks mostly good to me. I would be willing to test this out on divx.com before it goes into core if we can get it to patch (reviewed and tested by community) state.

The only thing I saw was that 1 second seems like a very long amount of time to wait. I was thinking 100ms might be a little more realistic. Typically a 200ms page load time is what I consider "good" for most pages, however a 2-4 second page load time would be what I consider "unacceptable". I know these are preferences, but I would like to hear others input on this. Is it too much to ask for a usleep and microseconds instead of sleep(1). One second is quite a long time in web page render time. I know that this is a sensitive topic since too many checks can end up with many, many checks and be heavy on a processor, but 1 second could cause many, many 404s while this is rebuilding and that is almost worse in my mind. If you are doing 400 requests per second or more, this could end up with hundreds or thousands of 404s while rebuilding. This can be especially damaging if the request is done by a web page crawler (e.g. google) and now you have a valid page that is showing as 404 in a search engine.

The other thing is that I don't believe we should accept this patch without a test set that includes verifying that locking actually works for acquire, release, wait and timeout. Also that the lock cannot be taken by two systems simultaneously (obviously).

Thoughts?

pwolanin’s picture

@slantview - note that this sleep() whoud not affect most page views - only when there is a blocking lock.

usleep() only works on Widows as of PHP 5.0.0, so we could considering using it for D7, but not D6 (which is the target of this patch).
http://www.php.net/manual/en/function.usleep.php

hass’s picture

Subscribe

slantview’s picture

@pwolanin - I do understand it would not affect most page views, but if that blocking lock happens regularly (e.g. menu rebuilding) than it could affect a lot more page views than you would want to have to sleep for 1 second while we wait for the rebuild. I don't think it's a show stopper, but could be filed under "nice to have".

pwolanin’s picture

@slantview - the alternative (current broekn situation) is that all those page views ALSO try to execute a menu rebuild which is slow and costly and probably takes more than the 1 sec.

slantview’s picture

@pwolanin - I understand the current situation, that is the one that broke our site for launch. I don't want to be a PITA, so I'll drop the issue, it may be something we want to test and adjust accordingly. I'll try to put in some time testing it and see if it really makes a difference.

As far as the windows issue, we could add something like if (function_exists('usleep')) { if we really thought there would be a gain on systems that supported it.

I'll do some testing and see if it actually matters.

torbjorn’s picture

Subscribing.

natrio’s picture

subscribing

EgbertB’s picture

subscribing

pwolanin’s picture

We should separate out the code of eliminate the caching of the menu router - that should be faster bug fix to get in and will be helpful regardless for performance.

Damien Tournoud’s picture

+function lock_renew($name, $timeout = 20.0) {
+  if (lock_acquire($name)) {
+    list($usec, $sec) = explode(' ', microtime());
+    $expire = (float)$usec + (float)$sec + $timeout;
+    db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());
+    return (bool)db_affected_rows();
+  }
+  return FALSE;
+}

The lock_acquire() doesn't seem to be necessary here, because the UPDATE query is atomic.

+function lock_wait($name, $delay = 30) {
[...]
+      lock_break($name, $now);
[...]
+}
+
+function lock_break($name, $now) {
+  db_query("DELETE FROM {semaphore} WHERE name = '%s' AND expire < %f", $name, $now);
+}

This is probably prone to race conditions. Why not simply get the value of the lock in lock_wait() and do a lock_break($name, $value)?

beltofte’s picture

subscribe

pwolanin’s picture

Status: Needs review » Needs work
Starminder’s picture

subscribe

ismaiel’s picture

subscribe

Pedro Lozano’s picture

subscribe, I'll try to review.

pwolanin’s picture

needs a re-roll to remove the menu router stuff, plus to take Damien's comments into consideration.

SocialNicheGuru’s picture

subscribing

pwolanin’s picture

FileSize
8.1 KB

Here's a straight re-roll removing the menu router elements (which are at http://drupal.org/node/317775#comment-1438040) but not yet taking Damien's comments into account.

Turkish Delight’s picture

subscribing

Starminder’s picture

subscribe

pwolanin’s picture

Status: Needs work » Needs review
FileSize
9.36 KB

This takes account of Damien's comments, but we should probably add the locking for the locale rebuild too.

pwolanin’s picture

FileSize
9.51 KB

after more discussion with Damien about making the API very clear.

pwolanin’s picture

FileSize
9.53 KB

adding return (bool)db_affected_rows(); to function lock_may_be_available()

Damien Tournoud’s picture

FileSize
6.16 KB

I stress-tested that patch, and it appears to be rock solid.

I added back the locale cache rebuild locking that I originally implemented in the original patch, and tested that. Here is an extract of the test log... you see that nearly 80 requests failed to acquire a lock, and that only one of them did the actual rebuilding.

Seems ok to go in for me... I did the last round of review with Peter on the IRC, and the API appears very sane (we will build on that for D7 for sure).

Damien Tournoud’s picture

FileSize
11.51 KB

And with the lock.inc file ;)

webchick’s picture

pwolanin’s picture

Version: 6.x-dev » 7.x-dev
FileSize
10.9 KB

straight port to D7 - lock.inc not DBTNGified yet.

SocialNicheGuru’s picture

I have been trying to figure out why my install is slow.

I found the following issues:

Caching the entire menu, http://drupal.org/node/317775#comment-1492104

Implement a locking framework for long operations: http://drupal.org/node/251792#comment-1503770

But the system.install system_update_6050 function gets redefined in each

I mention this because they both might get committed which might cause problems.

Are they incompatible approaches?

Chris

pwolanin’s picture

@activelyOUT - don't worry, whichever is committed first (probably the router patch) we'll re-roll the other to bump the update number.

pwolanin’s picture

FileSize
10.18 KB

On the assumption that this will get into D6 (hence it there will be a 6xxx update function to add the table) this patch omits an update function for D7, so you'll need a fresh install to test.

We might need to add locking to the modules_rebuild too - on D7 I can get a PDO error by reloading quickly the admin/build/modules page even with this patch:

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'themes/garland/garland.info' for key 1 in system_theme_data() (line 1167 of /Users/Shared/www/drupal-7/modules/system/system.module).
pwolanin’s picture

FileSize
10.64 KB

Updated D6 patch a little to match some tweaks in the D7 patch.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

I tested the D7 patch and code reviewed. My one pause regarded the different approaches that locale and menu_rebuild take toward acquiring locks. More investigation showed that locale is implementing a 'i'll take a lock if available, otherwise proceed' approach. menu is taking a 'i need a lock and will wait for it' approach. so this patch nicely illustrates the flexibility of the API. RTBC.

Berdir’s picture

+    if (!db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id())->fetchField()) {

That query is only 50% converted to DBTNG :)

pwolanin’s picture

FileSize
10.23 KB

This patch fixes that query to use proper DBTNG placeholders.

Gerhard Killesreiter’s picture

Dries requested infos on how the lack of a locking framework effects drupal.org.

The answer is: Currently not in a visible way. Last year we had a serious bug that triggered a race condition and took the site down several times. The actual problem was in memcached, but I think the problem had been mich easier to debug had the original user locked the cache insert.

I think that it is important that we can protect Drupal installs against race conditions.

moshe weitzman’s picture

Well, the lock patch only kicks in when you enable a module or do some locale operation. enable/disable modules is much more frequent on a new site than a mature one like drupal.org. further, we know what modules to use whereas most sites try lots of modules out. thirdly, when we do try stuff out we do so on dev boxes so drupal.org would not feel those delays. drupal.org is one of the worst sites for judging the benefits of this patch.

slantview’s picture

There are other things that cause the locks to kick in such as using the book module and updating a node, and any site that updates one of the menus at any point. This could be rare or frequent depending on the site. Unfortunately, you won't see it on d.o very much. We might feel the pain when we finally cutover to the new design however.

pwolanin’s picture

@slantview - I expect there are indeed other places we will want to apply this sort of locking, but we at least need this code in place to move forward.

c960657’s picture

Status: Reviewed & tested by the community » Needs review

Nice work.

+    $lock_id = uniqid(mt_rand());

Note that uniqid() is rather slow. I'm not sure what could be used instead – perhaps $_SERVER['REMOTE_ADDR'] . $_SERVER['REMOTE_PORT'] . $_SERVER['SERVER_ADDR'] ?

+function lock_may_be_available($name) {
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+  if (!$lock) {
+    return TRUE;
+  }
+  list($usec, $sec) = explode(' ', microtime());
+  $now = (float)$usec + (float)$sec;
+  if ($now > $lock['expire']) {
+    return (bool)db_delete('semaphore')
+      ->condition('name', $name)
+      ->condition('value', $lock['value'])
+      ->execute();
+  }
+  return FALSE;
+}

Isn't there a race condition hidden here? If another thread renews an expired semaphore between the SELECT and the db_delete(), that semaphore is deleted. Adding ->condition('value', $lock['value']) will fix that, I think.

+    'indexes' => array('expire' => array('expire')),

What is the purpose of this index? It isn't used by any queries in the patch.

AFAICT the few lines in lock_init() could just as well be done inside the if block in _lock_id(). This would prevent lock_release_all() from being called in every non-cached request.

+function lock_wait($name, $delay = 30) {
+
+  while ($delay--) {

Nit: Functions should not begin with an empty line.

I am not sure I understand whether this is intended to go into D7 also, but here are some D7-only comments:

+    list($usec, $sec) = explode(' ', microtime());
+    $expire = (float)$usec + (float)$sec + $timeout;

As of PHP 5.0 you can just write $expire = microtime(TRUE) + $timeout.

+  static $lock_id;

This doesn't use the new static caching API. Wouldn't that be useful?

c960657’s picture

Status: Needs review » Needs work
pwolanin’s picture

@c960657 - from those links seems we should just set more_entropy to TRUE. Since this is only called once per page load, I don't think we need to worry too much about speed.

re: the possible race for db_delete, I assume you meant ->condition('expire', $lock['expire']) rather than ->condition('value', $lock['value'])?

For D7, we likely should uses the new static cache API.

c960657’s picture

re: the possible race for db_delete, I assume you meant ->condition('expire', $lock['expire']) rather than ->condition('value', $lock['value'])?

I did :-)

Damien Tournoud’s picture

@c960657: indeed we need a condition on both the expire and the lock value.

pwolanin’s picture

Status: Needs work » Needs review
FileSize
8.76 KB

I think this addresses the concerns in #79.

pwolanin’s picture

FileSize
10.28 KB

oops - somehow the diff missed the schema.

moshe weitzman’s picture

If we can move code out of lock_init(), thats desirable. That way lock.inc will only get included and parsed when it is needed. or maybe thats gonna happen anyway.

Code looks good to me, and tests pass.

c960657’s picture

Assume there is a lock that is very “popular”, i.e. it is often aquired as soon as it has been released. If the lock is held when lock_wait() is called, the function keeps waiting until no thread has the lock, i.e. other threads may acquire and release the lock several times before the function returns. Wouldn't it be more useful if it returned already when the lock_id that held the lock, when the function was entered, has been released, even if a new lock has been acquired by another thread (in other words, select 'value' at the top of the function and return as soon as that value has disappeared from the database)? This would be useful in events where you want to make sure that a given function has been called at least once by some thread before returning, e.g. use cases like menu_rebuild().

+      'value' => array(
+        'description' => 'A value for the semaphore.',

I suggest the name “lock_id” for this column.

 function menu_rebuild() {
-  variable_del('menu_rebuild_needed');
+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    lock_wait('menu_rebuild');
+    return FALSE;
+  }

If another thread has the lock and has almost finished rebuilding the menu when the method is called, we need to build the menu once more when the lock has been released, right? I.e. something like this:

+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    lock_wait('menu_rebuild');
+    if (!lock_acquire('menu_rebuild')) {
+      lock_wait('menu_rebuild');
+      return FALSE;
+    }
+  }

Also, the new meaning of the return value is not documented.

+function lock_acquire($name, $timeout = 20.0) {
+  global $locks;
+
+  if (isset($locks[$name])) {
+    // We acquired a lock before, so check whether our lock is intact.
+    if (!db_query("SELECT 1 FROM {semaphore} WHERE name = :name AND value = :value", array(':name' => $name, ':value' => _lock_id()))->fetchField()) {

Would it be better to simply call lock_renew() here? Or trigger an error? It seems a bit odd trying to acquire a semaphore you already hold (or that you previously held but did not release).

-  _menu_clear_page_cache();
+  if (lock_renew('menu_rebuild')) {
+    _menu_navigation_links_rebuild($menu);
+    ...
+  }

I wonder whether we should just keep on running. Otherwise we cannot guarantee that the tasks inside the if block has been done, before we return. If the lock has expired, we may want to report it using watchdog.

markus_petrux’s picture

I would be interested in (and would be glad to test) the D6 port of this feature, as stated in Duplicate entry in menu_router: ... - comment #49.

PS: Subscribe.

donquixote’s picture

subscribe

EvanDonovan’s picture

I also would like to test the D6 port of this feature, as per #246653: Duplicate entry in menu_router: Different symptoms, same issue & #238760: menu_router table truncated and site goes down. Which patch on here is good to test for D6?

I'm currently using the patch in #246653-47: Duplicate entry in menu_router: Different symptoms, same issue, so that panels and views can work properly for me, but it is a non-optimal solution.

Status: Needs review » Needs work

The last submitted patch failed testing.

aaron’s picture

subscribe

aaron’s picture

To anyone contemplating #91, this script introduces huge security holes [and is thus unpublished now chx]: for instance, you don't want a robot to discover the page and begin hitting up the database with a bunch of menu rebuilds. The easiest way to accomplish this is to revisit the Modules page (at admin/build/modules). Or you could make a quick module and do something like the following, even adding a cron job to rebuild the menu periodically (making sure not to activate it until 3AM in the morning or other non-peak time). Then you can just disable this module when and if a patch fixing the original problem goes through.

I personally don't recommend using either this approach or the approach in #91. If you are unable to visit administrative pages in the first place, just log in as user 1. And then, again, just visit /admin/build/modules, which will rebuild your menu system.

function lock_bandaid_menu() {
  $items['admin/build/lock-bandaid'] = array(
    'page callback' => 'lock_bandaid_page',
    'access callback' => 'user_access',
    'access arguments' => array('administer menu'),
    'title' => 'Menu Bandaid',
    'description' => 'Rebuild the menu system.'
  );
  return $items;
}

function lock_bandaid_page() {
  menu_rebuild();
  return t('Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')));
}

function lock_bandaid_cron() {
  if (variable_get('lock_bandaid_cron_timestamp', 0) <= time()) {
    menu_rebuild();
    watchdog('lock_bandaid', 'Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')), WATCHDOG_NOTICE, l(t('Menu Bandaid'), 'admin/build/lock-bandaid'));
    variable_set('lock_bandaid_cron_timestamp', time() + 3600 * 24);
  }
}

I haven't actually tested this, but just offer it as a more secure alternative to the former.

EvanDonovan’s picture

@aaron: I get what you mean. But if the {menu_rebuild} table is completely trashed, you won't be able to visit admin/build/modules, nor will you be able to visit the login page to log in as uid 1.

I think the best solution to the concern you raise is to delete the menu_rebuild.php from your server & re-upload it if needed.

Using .htpassword to protect the file would be a more advanced solution.

arhak’s picture

I agree with #95, since it's an infallible and quick solution
always taking into account that the script shouldn't remain in the live site much longer that needed (and it will be only needed once on production sites, maybe on development site it might be more actively required)

I also used to use another solution, which consisted on having an SQL query at hand to fix the menu_router entry for admin/build/modules and then be able to navigate to that page which would rebuild the menu_router (analogous for devel's functions), but in long time terms I used to forgot to carry on with that query, but always remember how to recreate a quick php script invoking the menu rebuild.

Delty’s picture

Version: 7.x-dev » 6.12
Component: base system » menu system

I have the same issue here with 6.12. I've tried to implement the most recent patch for D6 from #71 but I always get an error when applying the patch. If I disable Organic Groups the problem goes away since OG rebuilds the menu_router table very aggressively. Is there or will there be an updated D6 patch to fix this?

tstoeckler’s picture

Version: 6.12 » 7.x-dev
Component: menu system » base system
mark_r’s picture

subscribe

arhak’s picture

IMO the whole "renew" thing is unnatural, it's odd code.

OTOH why attempting to wait?
what wait for?
waiting for 20 secs when timeout probably is set to 30 secs?

IMHO it should be attempt to lock the semaphore down
if got it do the job
else don't do it (maybe write it down to make another attempt in subsequent requests)

I have implemented a locking module for Drupal 6.x, right now I'm testing it and tuning it, but it works by now, everything seems to be fine having 10-50 pages making AJAX requests every 10 seconds to the same semaphore for a (real use case) job
that seems to be a lot of requests

nevertheless, I don't think this would be the ideal solution to the problem. It is just a solution. And it might be it for a while.

there are some known issues to think about and discuss, but they won't show up in a long while (for example, integer auto-increments stands for 4 thousands millions of rows, which will be a long while until getting there)

would you like to review this module of mine?
should I upload it to this thread?
I think it might (at least) light you up (if not truly useful)

EvanDonovan’s picture

@arhak:
So if I'm understanding you correctly, you're creating a 6.x module to deal with the locking on the menu_rebuild? Or is it just a generalized locking framework, that could possibly be applied to this specific issue?

It might be best if you posted it as a project on Drupal.org (if you have CVS access). That way, we could have a more extended discussion about it over there, rather than lengthening this thread even further. Also, having it as an actual project, rather than an attachment in a long issue, means that people like me who would appreciate a locking solution for Drupal right now would be better able to benefit from it. (Since patches could be applied to it, new versions released, etc.)

Anyway, it's great that you're working on this!

Delty’s picture

I'm not sure this has any bearing on the surrounding issue, but I thought it would be better to post it than not.

What I have found is that, after a number of hours testing and general messing about with various modules that I have running on my development site, this problem goes away and the speed of my site returns to normal if I disable the "OG Content Type Admin" module (6.x-1.0.) - http://drupal.org/project/og_content_type_admin.

I have a bunch of modules running (Views, OG, Panels, CCK, etc.) and don't seem to have any issues unless I enable this module. Anyone else run this module and see the issue disapper if disabled?

irakli’s picture

donquixote’s picture

#103, #10:
And return false if the menu rebuilding was skipped. The info might be useful to whatever module triggered the menu rebuilding.

markus_petrux’s picture

I'm not sure if it has been discussed elsewhere, but there's another approach to minimize these lengthy operations.

Many places in core and contrib modules trigger a full menu rebuild just to update a few paths, or even just update, add or remove one single path. It looks like the menu system could be extended in some way, so that it is possible to just rebuild a part of the whole thing.

For example, I'm CCK and I need to rebuild the menu when the label of a field has been changed, just because the label is used for subtabs in the admin/content side. Well, it would be nice if there was a method to just rebuild admin/content/*, or even just the link it being updated.

And so on... I'm sure many places in core (as well as in contrib) could take advantage of *partial* menu rebuilds.

donquixote’s picture

@markus_petrux,
please check Rewrite the function menu_rebuild. The topic starter is a bit too simplified, better read the replies.

Things to consider:
- It has been said that menu_links can grow very big, and should not be loaded in memory all at once.
- I don't know how big menu_router can become, but the current menu_rebuild algorithm does use PHP arrays in the size of the menu_router table. So I guess it is not a problem.

There is a bit of a controversy how this fits in the roadmap for the menu system. chx has argued that other issues should be fixed first, to clean up the db structure:
- #484234: Big task cleanup: remove type and tasks from hook_menu
- #191360: UB Usability: menu parent choosers
- Look into your DB, table "menu_router", column "tab_root", which associates different menu_router rows with each other. The idea (as I understand) is to remove this column from menu_router, and store the association somewhere else (in menu_links, or in a separate table).
- On the long run, the menu system should get a CRUD interface. I don't know how long this will take, and I don't know the implications on existing modules.

My opinion:
I think an optimization of menu_rebuild is possible without a change in the DB structure. Later, when the above issues are fixed, we can still do whatever is in the roadmap.

pwolanin’s picture

A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure.

pwolanin’s picture

@c960657 re: #87. Since a semaphore need not be a lock, we choose the generic name 'value'

re: "It seems a bit odd trying to acquire a semaphore you already hold (or that you previously held but did not release). "

The point of this is that some other thread may have broken/stolen the lock, so you need to check.

pwolanin’s picture

Status: Needs work » Needs review
FileSize
10.24 KB

re: #87, putting the lock_renew code into lock_acquire makes sense to me - makes for even simpler code.

lock_wait() does have the problem described - that a process my wait a long time for a popular lock. So, maybe that needs to be documented? Adding new code comments and doxygen.

Inserting the navigation links is a somewhat separate operation, but yes, I see how an extreme situation where doing the menu rebuild took a long time and someone did something dumb like put that call in index.php you might fail to get the links rebuilt. In this re-roll I just extended the lock time a little and don't try to renew in menu_rebuild().

NOTE - I still think this is critical to backport to D6, so there is not an update function. You must do a fresh D7 install to create the table.

pwolanin’s picture

FileSize
10.55 KB

Damien pointed out a minor typo, and also suggested that we should take an optimistic approach to acquiring the lock - that way we avoid an extra select query in most cases (since we expect the lock generally to be available).

pwolanin’s picture

FileSize
10.53 KB

Logic is not quite right in the catch {} block. This should fix it.

c960657’s picture

@c960657 re: #87. Since a semaphore need not be a lock, we choose the generic name 'value'

The word "lock" is used all over in the patch. The column is used to hold the request id returned by _lock_id(), so I think the column name should reflect that.

pwolanin’s picture

@c960657 -yes, lock is all over the patch, but the 'semaphore' table is intended to be generic and NOT limited to use by this lock API.

c960657’s picture

Ah, okay. If the table is intended to hold more than just semaphores but various variables in general (as suggested by the table description), I suggest using another name, perhaps “variable_volatile”. In computer science, the word “semaphore” has a very specific meaning that doesn't match the current description IMO.

+  $schema['semaphore'] = array(
+    'description' => 'Table for storing locks and other flags that should not be cached by the variable system.',
pwolanin’s picture

Well, this is not necessarily volatile as much as it's non-cached. While I realize this is a bit of an abuse of the term semaphore, 'flag' is already used for other things. We could call it 'mutex' but that's still more specific than it's intended scope.

How about 'message' instead? I don't really care much.

Dave Reid’s picture

-1 to {semaphore}. Discussed on irc for {lock}. It doesn't conflict with any current module namespaces, isn't a SQL reserved word, and since I don't see this being generalized to warrant a general table name.

pwolanin’s picture

FileSize
10.56 KB

webchick says 'semaphore' is fine - so here's a re-roll just trying to clarify the table description. Only that one line changed.

pwolanin’s picture

re: #117 - eaton later pointed out that 'lock' is an Oracle reserved word.

webchick’s picture

Also, although lock is not a SQL-99 reserved keyword, it is in certain DBMSes.

Delty’s picture

D6 version of this patch?

jmpalomar’s picture

Subscribing.

markus_petrux’s picture

@pwolanin at #107: "A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure."

I know, but I was just trying to see this from a certain distance. Aside from this locking framework, which is a nice addition by itself, I think that maybe part of the menu system in Drupal could be re-worked to allow partial rebuilds. For example, when I launch a menu rebuild because I have added a link to admin/build/foo/bar, only that part of the menu could be updated. To allow other modules change that behavior, a new hook could be added to allow other say so. And well, this is just an idea. But I think it would worth thinking a bit more about it because even with a locking framework, the whole menu rebuild may take a lot of time, while other requests wait, and requests waiting may mean more Apache threads, more database connections, poor UX unless we can offer them cookies and tee or coffee, etc. :)

pwolanin’s picture

@markus - we save new links as a convenience - they are fundamentally different then the main work in the menu rebuild, which is to rebuild the router. LINKS != ROUTER ITEMS

markus_petrux’s picture

Sure, I know the difference, but the concern about the increased number of threads waiting and consuming Apache/PHP/memory resources still stands.

The locking framework helps to prevent race conditions that cause duplicate errors in DB, but this does not affect on time time required to rebuild the whole menu router, and the stuff related to links.

Anyway, my aim was just to mention this. Even when locks are used, rebuilding the whole thing is overkill when only a portion of the menu will really change, so there may be a way to optimize this, I think.

EvanDonovan’s picture

markus_petrux:
In #512962-25: Optimize menu_router_build() / _menu_router_save() chx suggests that after the way that menu local tasks are handled has been rewritten (see #484234: Big task cleanup: Remove type and tasks from hook_menu()), there might be the possibility of using CRUD operations for handling menu router items rather than the draconian system currently in place. I like this suggestion, too, but unfortunately lack the technical skills to work on it.

Damien Tournoud’s picture

Status: Needs review » Reviewed & tested by the community

Let's get this in!

febbraro’s picture

FileSize
9.98 KB

Here is a re-roll for D6 (based on 6.13) the system update hook number was in conflict with recent updates.

pwolanin’s picture

Thanks Frank, though the logic changed a bit in the latest D7, so we should mimic that.

febbraro’s picture

FileSize
10.01 KB

Thanks for pointing that out Peter.

Re-rolled after bloodying up my eyes trying to tease apart the differences, but I think I got it all. The only part I'm unsure about is the optimistic locking part (and using error suppression for D6 instead of catching a PDOException) so maybe pay special attention there.

pwolanin’s picture

Great! thanks. I will take a look and compare.

pwolanin’s picture

the D6 looks pretty good, but needs to use the PHP 4.4 compatible calls to microtime:

+    list($usec, $sec) = explode(' ', microtime());
+    $expire = (float)$usec + (float)$sec + $timeout;
sun’s picture

Status: Reviewed & tested by the community » Needs work
+ * requests should try to acquire a lock before proceeding.  For example,

No double spaces in comments please. It's the current standard, although grammatically correct.

+function lock_init() {
+  global $locks;

Why not a drupal_static() instead here? The entire static system was designed to (also) replace most of our globals. -1 for introducing new ones.

Additionally, and after reading a bit more of this patch, this function should probably reset each and everything, because we are in a new request.

+function _lock_id() {
+  static $lock_id;
+
+  if (!isset($lock_id)) {
+    // Assign a unique id.
+    $lock_id = uniqid(mt_rand(), TRUE);
+    // We only register a shutdown function if a lock is used.
+    register_shutdown_function('lock_release_all', $lock_id);
+  }
+  return $lock_id;
+}

This looks like there can be only one long running process in one request. 3D ?

+ * @return TRUE if the lock was acquired, FALSE if it failed.
+ */
+function lock_acquire($name, $timeout = 30.0) {

(and elsewhere) Description of @return should be on the next line, just like @param.

+function lock_may_be_available($name) {

I'm worried about the namespace of those functions. Having worked in many integration projects, it was really helpful that most of Drupal's functions are prefixed with 'drupal_' or some other, presumably unique namespace.

Josh Waihi’s picture

Status: Needs work » Reviewed & tested by the community

subscribing

pwolanin’s picture

@sun - are you reading the D6 or the D7 patch?

The D7 one uses:

+function _lock_id() {
+  $lock_id = &drupal_static(__FUNCTION__);
pwolanin’s picture

Status: Reviewed & tested by the community » Needs work
FileSize
10.89 KB

Re-roll of D6 patch. Cleaned up code comments per sun.

@sun - no, there can potentially be multiple locks held by the same process (each with a different name), though in general seems like something you'd avoid. A single lock (single name) can only be held by one process.

I don't understand your comment about resetting everything in function _lock_id()

pwolanin’s picture

Status: Needs work » Needs review
FileSize
10.58 KB

fixed code comments for D7 version.

@sun - re: namespace issues. menu.inc --> menu_X() functions, lock.inc --> lock_X() functions.

I don't really care a lot if Angie or Dries want to add 'drupal_' to these new functions, but we are essentially defining a pluggable interface (like cache_get/cache_set) so I'd prefer to stay with the more generic names. We so far I think we never make any drupal_X() functions pluggable.

sun’s picture

Thanks for re-rolling.

oh - ok, I think the introduction and explanation in @defgroup needs some improvement and clarification. I first understood that this was a locking system that works within one request, allowing to prevent duplicate invocations (and thereby resetting everything in the next request would have made sense). Reading a bit more through the actual functions, it seems there are two different types of locking, and developers have to invoke the wanted one (somehow). That should be clarified... or if I'm still not getting it right, then the entire explanation in @defgroup could use an overhaul. Possibly also providing example snippets using @code and @endcode.

Furthermore, I wondered why a locking system in lock.inc uses a table named 'semaphore'...?

We should additionally reserve the "lock" namespace in contrib. In case we keep the table name, reserving the "semaphore" namespace wouldn't hurt either. (just like we did for http://drupal.org/project/field)

pwolanin’s picture

@sun - this entire patch is about INTER-process locking. there is nothing at all about INTRA-process locking. Can you point to the part of the doc you find confusing? There is only one kind of lock and only one function to acquire one in the patch.

The 'semaphore' table is meant to be somewhat generic and available for other sorts of locking implementations.

febbraro’s picture

Peter, Thanks for the PHP4 compatibility pointer, the patch in #136 applies cleanly and works for me against 6.13.

Thanks for the re-roll.

Josh Waihi’s picture

Status: Needs review » Needs work

shouldn't $retry be set to TRUE at the start?

+    // Optimistically try to acquire the lock, then
+    // retry once if it fails.
+    $expire = microtime(TRUE) + $timeout;
+    $retry = FALSE;
+    do {
+      try {
+        db_insert('semaphore')
+          ->fields(array(
+            'name' => $name,
+            'value' => _lock_id(),
+            'expire' => $expire,
+          ))
+          ->execute();
+        $locks[$name] = TRUE;
+        $retry = FALSE;
+      }
+      catch (PDOException $e) {
+        // Suppress the error, and if this is our first pass through the loop,
+        // try to break the lock in case it's expired.
+        $retry = $retry ? FALSE : lock_may_be_available($name);
+      }
+    } while ($retry);

And can we please not introduce db_query into core when we don't need to:

+function lock_may_be_available($name) {
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+  if (!$lock) {
+    return TRUE;
+  }

I like the name 'lock' or 'locks' for a table name rather than 'semaphore' - its easier for people to understand.

Damien Tournoud’s picture

shouldn't $retry be set to TRUE at the start?

No.

And can we please not introduce db_query into core when we don't need to:
+function lock_may_be_available($name) {
+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+ if (!$lock) {
+ return TRUE;
+ }

Erm. What??

I like the name 'lock' or 'locks' for a table name rather than 'semaphore' - its easier for people to understand.

Semaphore is the perfect technical term for that. Semaphores can be used for other thing then purely locking.

Josh Waihi’s picture

I thought db_select would have been the preferred function over db_query?

Damien Tournoud’s picture

Status: Needs work » Reviewed & tested by the community

I thought db_select would have been the preferred function over db_query?

No (or at least not yet). db_query() is recommended (but not mandatory) for all the simple static queries, while db_select() is mandatory for all dynamic queries.

There have been enough bikeshed on this issue. It has been marked RTBC two times already. Let's this third time be the one.

sun’s picture

Status: Reviewed & tested by the community » Needs work

Given that I see many Drupal rockstars on this issue supporting the patch, I'm not questioning or trying to understand the reasoning behind the implementation.

However, the @defgroup really needs quite some clarification. At the very least, please add example snippets that outline how to use this locking system. Furthermore, the concept of having to lock certain operations is a new territory for many developers, so we really need to make sure the documentation takes the audience into account. So let's explain it, like you explain something entirely new to someone else.

Aside from that, the code of the actual functions could use some more inline comments about reasoning. For example:

+    do {
+      try {
+        db_insert('semaphore')
+          ->fields(array(
+            'name' => $name,
+            'value' => _lock_id(),
+            'expire' => $expire,
+          ))
+          ->execute();
+        $locks[$name] = TRUE;
+        $retry = FALSE;
+      }
+      catch (PDOException $e) {
+        // Suppress the error, and if this is our first pass through the loop,
+        // try to break the lock in case it's expired.
+        $retry = $retry ? FALSE : lock_may_be_available($name);
+      }
+    } while ($retry);

Wow. That really looks like some really dirty code. Most probably, we simply don't have another way to do stuff like that currently. But if that is really the case, there should really be a comment explaining why we need to do it in an awkward way like that.

+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

Small nitpick: Why not using single quotes here?

+function lock_wait($name, $delay = 30) {
+  while ($delay--) {
+    sleep(1);
+    if (lock_may_be_available($name)) {
+      return FALSE;
+    }
+  }
+  return TRUE;
+}

Every. Single. Line. Of. This. Function. Definitely. Needs. A. Comment.

Lastly, regarding menu_rebuild() we might have to turn the action into a "real" trigger/action, to solve issues like #306316: Regression: node_types_rebuild should rebuild menu... (separate issue though)

Josh Waihi’s picture

+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

so PDO inserts the the quotes around :name?

pwolanin’s picture

Status: Needs work » Needs review
FileSize
12.77 KB

@sun - we use double quotes for all DB queries as a standard afaik (though that may be a < D7 carry-over).

substantially increased code comments in the attached.

dmitrig01’s picture

Status: Needs review » Needs work
+ * this is an APi intested to support alternative implmentation, code using
spelling + capitalization

+    // We acquired a lock before, so try to renew it.
grammar

+    // This function shoudl only be called by a request that faled to get a
spelling

+    // be avaiable in menu_execute_active_handler() resulting in a 404.
spelling

Schema code style is wrong too (end parens should be on a new line)

pwolanin’s picture

Status: Needs work » Needs review
FileSize
13.08 KB

With code comment fixes from sun and dimitrig01

dmitrig01’s picture

Status: Needs review » Needs work

This is re: sun's current explanation (http://drupalbin.com/10524)

[7:46pm] dmitrig01: * then delete the current data in the database to insert the new afterwards. <-- new what
[7:46pm] dmitrig01: This is a cooperative, advisory lock system. Any long-running operation, <-- what is?
[7:46pm] dmitrig01: which could potentially be attempted in parallel by multiple requests *NEEDS A COMMA* should
[7:47pm] dmitrig01: try to acquire a lock before proceeding. <-- what's that?
[7:47pm] dmitrig01: To use this API, pick a unique name for the lock. <-- that sounds like that's the *only* thing you need to do
[7:48pm] dmitrig01: the thing never explains the concept of locks, it just jumps straight from the problem to "thisis a lock api"
[7:48pm] dmitrig01: and, "This is a cooperative, advisory lock system" makes no sense to me
[7:49pm] dmitrig01: explain the concept of acquiring and releasing locks, then say that every lock also needs a unique ID, then show the function example
[7:49pm] dmitrig01: * A function that has acquired a lock may attempt to renew a lock (extend the <-- why would it want to do that
[7:50pm] dmitrig01: * Failure to renew a lock is indicative that another request has acquired <-- that just flew straight over my head. why/how would a different request acquire a lock if the first already has acquired it?
[7:50pm] dmitrig01: * it may call lock_wait() <-- and what would that do?
sun’s picture

Thanks for re-rolling. The docs + comments make much more sense to me now! :)

In addition to dmitrig01's review, a few more nitpicks:

+    $success = (bool)db_update('semaphore')

(and elsewhere) There should be space between the casting statement and the value.

+    // Optimistically try to acquire the lock, then
+    // retry once if it fails.

This comment doesn't seem to wrap at 80 chars.

+        // then $retry is FALSE.  In this case, the insert must have failed

Please remove the double space in comment.

+        // Since this will break the lock in case it's expired.

(possibly elsewhere) We try to avoid abbreviations in comments, i.e. "it is expired".

+    // This function should only be called by a request that faled to get a
+    // lock, so we sleep first to given the parallel request a chance to finish
+    // and release the lock.

This function should only be called rather belongs into the PHPDoc description.

Typo in "faled" and "given".

+    if (lock_may_be_available($name)) {
+      // No longer need to wait.

Given the updated docs, the comment in here can probably vanish.

Though: Wouldn't it make sense to rename that function to lock_is_released() ?

+function lock_release($name) {
...
+  db_delete('semaphore')
+    ->condition('name', $name)
+    ->condition('value', _lock_id())

This cries for unit tests. (which should actually be very easy to implement)

+  if (!lock_acquire('menu_rebuild')) {
+    // Wait for another request that is already doing this work.
+    // We choose to block here since otherwise the router item may not 
+    // be available in menu_execute_active_handler() resulting in a 404.
+    lock_wait('menu_rebuild');
+    return FALSE;
+  }

Maybe that entire menu_rebuild() update should be deferred to another issue... anyway, I'd guess it would make more sense to inject a drupal_goto() in here, redirecting the user to the very same path he tried to access previously, continuously doing it all over again until the lock is released... Because, displaying a 403/404 to admins on well-known pages is a major WTF.

+  else {
+    variable_del('menu_rebuild_needed');
+  }
+  lock_release('menu_rebuild');

Shouldn't the lock be released within the else ?

+    'primary key' => array('name'),
+    );

Wrong indentation.

pwolanin’s picture

Status: Needs work » Needs review
FileSize
13.22 KB

@sun -

we call lock_wait(), so we should not see the 404. We *don't* want a looping goto - that's extremely bad..

The function name 'lock_may_be_available' describes EXACTLY the purpose of the function - see above discussion. it should not change.

The lock_release() MUST NOT be in the else - otherwise we would possibly be blocking out the next page request which needs to rebuild again (though ideally our shutdown function would take care of that).

@dimitri - the unique ID is an internal implementation detail that the user of the API doesn't need to know anything about. I don't think we need to explain the concept of locks much more - I added one more sentence about it.

@dimitri - this is not hardware locking - lock expiration or even bad coding could cause the current request's lock to be acquired by another request and then unavailable to be renewed.

added a little more explanation in addition to WS/spelling corrections from sun.

sun’s picture

Thanks for incorporating the improvements!

The lock_release() MUST NOT be in the else - otherwise we would possibly be blocking out the next page request which needs to rebuild again (though ideally our shutdown function would take care of that).

If it is that special, then it would make much sense to put this reasoning in code then.

The only other remaining issue is to add some tests for this functionality then.

pwolanin’s picture

FileSize
13.36 KB

The logic is simple - if you acquire a lock you should release it when you are done. I guess that's not in the top block anywhere - adding one more sentence.

pwolanin’s picture

FileSize
18.76 KB

Ok, so I have to conceded that adding tests was useful - turns out that the float to string and back casting would lead to intermittent failures when trying to break a lock since we were doing a = comparison. Altered the code so make sure the timeout is at least 0.001 seconds and then do a <= compare after adding 0.0001.

sun’s picture

Awesome!

Final issues: Can you squeeze some blank lines into the test (where logical "chapters" start/end)? Also, one of the last assertion messages contains the word/typo "befine".

pwolanin’s picture

FileSize
18.87 KB

code comment fixes per #156

sun’s picture

Status: Needs review » Reviewed & tested by the community

Thank you! Ready to go.

markus_petrux’s picture

Minor issue. There are 3 indentation spaces here:

+   return (bool) db_delete('semaphore')
+      ->condition('name', $name)
+      ->condition('value', $lock['value'])
+      ->condition('expire', 0.0001 + $expire, '<=')
+      ->execute();
pwolanin’s picture

FileSize
18.87 KB

fixed indent

Josh Waihi’s picture

go go gadget commit!

sun’s picture

+    'fields' => array(
+      'name' => array(
+        'description' => 'Primary Key: Unique name.',
+        'type' => 'varchar',
+        'length' => 255,
+        'not null' => TRUE,
+        'default' => ''),
+      'value' => array(
+        'description' => 'A value for the semaphore.',
+        'type' => 'varchar',
+        'length' => 255,
+        'not null' => TRUE,
+        'default' => ''),
+      'expire' => array(
+        'description' => 'A Unix timestamp with microseconds indicating when the semaphore should expire.',
+        'type' => 'float',
+        'size' => 'big',
+        'not null' => TRUE),
+      ),

Now really the last one! The closing array braces are slightly misplaced here.

febbraro’s picture

FileSize
12.88 KB

Re-roll of D6 version to take the float/string conversion into account. Also incorporated all of the lock.inc doc updates.

pwolanin’s picture

FileSize
18.89 KB

ok, this should fix the spacing on the schema.

coltrane’s picture

FileSize
18.89 KB

Patch still applies with offset, minor code comments follow, so patch is a soft reroll of 164.

Line 21 and 22 of lock.inc, "obtaining" and "operation" is misspelled:

By obtaiing a lock, one request notifies any other requests that a specific opertation is in progress which must not be executed in parallel.

Line 50 of lock.inc: extra space in "page request to proceed on the assumption that a parallel request completed" before "assumption"

awry’s picture

subscribe

coltrane’s picture

Status: Reviewed & tested by the community » Needs review

Couple more minor double-spacing on new sentences:

+ * that the operation in question be complete.  After lock_wait() returns,
+ * by setting the 'lock_inc' variable to an alternate include filepath.  Since
+    // request acquired the lock and set a new expire time.  We add a small
+ * specified delay in seconds is reached.  This should not be used with locks
coltrane’s picture

Status: Needs review » Needs work

Oops, should be CNW even though it's minor.

TC44’s picture

Subscribing. Will test the latest D6 patch.

pwolanin’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
17.63 KB

Removed double spaces.

xtfer’s picture

I have this running successfully on a patched 6.13 production environment for two weeks without problems - and our menu router errors have disappeared too.

TC44’s picture

Just wanted to report that I had one instance of menu router errors today after patching with #163 D6 patch (happened while saving a computed field). This is still far less than prior to patching, where I was getting them frequently. The instance today may have been pilot error on my part though.

febbraro’s picture

@TC44 Are you using admin_menu by any chance?

For everyone else, there is a "problem" with this locking mechanism and admin_menu that makes this not bulletproof as best I can tell.

admin_menu, in _admin_menu_rebuild_links() calls menu_router_build() directly not menu_rebuild() since it needs the menu data structure. This bypasses the locking mechanism and any concurrent request to rebuild the menu will result in the duplicate key error on insertion into menu_router. This does not have anything to do with the locking mechanism per se, however this patch contains the lock implementation that goes into menu_rebuild()

Questions: Should the lock protection be moved to help prevent this scenario? Should there be a different lock around the DELETE/INSERT statements inside _menu_router_rebuild() themselves? A different change all together? Should everyone just ignore me?

I know we can't control how modules are implemented, but we probably should put protections in where possible.

TC44’s picture

@fabbraro,

I'm not. Is it possible that the errors would show if drupal choked on my computed field code? It only happened the one time when I was saving, so I'm suspecting that was the case, since I was creating a computed field at the time, and it happened on the save.

Hasn't happened today at all, so I think the patch is working as it should, and maybe my case was just becuase of the bad comp field code?

pwolanin’s picture

@febbraro - I think you are mistaken. Note that there is a param to menu_router_build() that defaults to FALSE. in this case, admin_menu will only get the already-built router array. admin_menu only calls this after a regular menu_rebuild() call completes, so logically the router will always be in memory already. Note - this is all by design.

http://api.drupal.org/api/function/menu_router_build/6

TC44’s picture

I just got menu router errors again while viewing admin/build/modules. (it happened after testing a patch for custom breadcrumbs, which was spitting some other errors - not sure if it's related, but I'm going to uninstall cb and check again).

febbraro’s picture

@pwolanin: I'm probably missing something here (it's been a long day)

Is it possible that:
* request 1 calls menu_rebuild()
* which calls menu_router_rebuild()
* which triggers admin_menu_menu_alter()
* and sets admin_menu_rebuild_links to true
* starts to rebuild the router
* just as request 2 comes in for a page which calls admin_menu_footer()
* which checks admin_menu_rebuild_links and determines that it is supposed to rebuild the menu
* (but the first request is not done yet so there is no already build router array)
* then you will have two requests in _menu_router_rebuild() battling it out for control of the universe.

Like I said, I could just have crap for brains right now, but I don't see how it is protected against in every case. (or maybe I'm just trying to rationalize a scenario that makes me see why I'm still getting the error)

Thanks for putting up with my theorizing.

pwolanin’s picture

@febbraro - in general, that's possible - but admin_menu in specific is coded so that it will not do this. In general, _menu_router_rebuild() is supposed to be private and should not be called. Essentially the scenario you outline is what's often happening now fairly often without the lock - so you will get a pile of SQL errors.

webchick’s picture

Wow. I have to say, this is a really excellent patch. Even though it's 3am I was able to follow along with the code well and understand everything it's doing. It's extremely well-documented! I am hard-pressed to find anything wrong with it at all (I mentioned a few minor things below).

However, Dries in his replies here has seemed non-plussed about committing it. I'm not sure I feel comfortable doing so when he's not around. :\ Anyone have any other sources that would show that Dries changed his mind since #24 and #75?

Here are my comments, at any rate. They're all pretty minor, so I'll leave at RTBC. I haven't read the entire issue yet, so apologies if this repeats anything.

+++ includes/lock.inc
@@ -0,0 +1,252 @@
+function lock_init() {

I guess this could potentially be interpreted as a hook_init() if you had a "lock" module installed. Maybe rename to lock_initiate() or lock_start() or something. Probably not a big deal though.

+++ includes/lock.inc
@@ -0,0 +1,252 @@
+  $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();

Query should use single quotes, not double quotes.

+++ modules/simpletest/tests/lock.test
@@ -0,0 +1,58 @@
+    // Let another request acquire the lock.
+    $this->drupalGet('system-test/lock-acquire');
+    $this->assertText($lock_acquired, t('Lock acquired by the other request.'), t('Lock'));
+    // The other request finished and should have released its lock.
+    $this->assertTrue(lock_acquire('system_test_lock_acquire'), t('Lock acquired by this request.'), t('Lock'));
+    // This request holds the lock, the other request cannot get it.
+    $this->drupalGet('system-test/lock-acquire');
+    $this->assertText($lock_not_acquired, t('Lock not acquired by the other request.'), t('Lock'));
+    lock_release('system_test_lock_acquire');

Could you clarify this just a bit more? I don't quite follow how the first time system-test/lock-acquire is requested it completes the lock and the second time it doesn't.

Also, menu rebuild is definitely a good use case for this. Is cron another?

This review is powered by Dreditor.

pwolanin’s picture

@webchick - thanks for the feedback. Did we switch the standard for quoting queries?

Cron, block rehash, and several other places in core would certainly be candidates for converting to this or uding this - So, I'd expect a few small follow-up patches. The two conversions here are proof-of-function and also two of the most serious problems where we need the locking.

The clarity of commenting is likely due to thorough vetting above by sun :-)

Berdir’s picture

Another problem where this would *really* help is #412808: Handling of missing files and functions inside the registry, that is currently a *huge* DX issue and it would be great if we could safely rebuild the registry in such a case.

donquixote’s picture

I think febbraro made a very valid point.

The protections built into admin_menu (the current version) usually prevent simultaneous menu rebuilds. Admin menu relies on menu_rebuild always being called before admin_menu_build, so the menu should be already in cache. But that's not what we call a robust solution!

menu_router_build is a public function, and as such it should have its own locking protection.

That said, things around menu_rebuild might see some big changes anyway:
- in D7, menu_router_build does no longer write anything to the DB.
- the new admin_menu does not use menu_router_build anymore (see http://drupal.org/node/334120).
- in D6, menu_router_build erases the entire menu_router table, then inserts the new contents. Usually that's totally overkill, and it leaves the system with an empty menu_router table for some seconds.
- in D7 this process happens in menu_router_save, which is called directly in menu_rebuild. It is faster, because the inserts are done in one big query.
- There is a D6 patch out there (see http://drupal.org/node/512962#comment-1870358) which replaces this with a diff-based table rebuilding (scan for differences, write the differences to the DB. In many cases, this means NO write operations). This patch is nice for itself, but there might be some conflicts with the things done in D7.
- The other part of menu_rebuild (_menu_navigation_links_rebuild) needs a speed tweak as well, it's far too slow.

For the future I could imagine a hook_menu_router_update($menu, $changes), that would be invoked with every menu_rebuild, and could be implemented by any module to "listen" to menu router changes. _menu_navigation_links_rebuild could then be moved into a hook implementation in menu.module. Anything about menu_links can then be moved out of core menu.inc.

pwolanin’s picture

http://drupal.org/node/2497 and all D6 code suggests that we use "" for SQL?

Damien Tournoud’s picture

@pwolanin: for D7, we are trying to stick with single quotes, now that there is no need for double quotes anymore. That should be documented.

pwolanin’s picture

FileSize
17.66 KB

re-roll with the very minor changes suggested by webchick

TC44’s picture

Getting massive amounts of duplicate entry/menu router errors again today, even with #163 D6 patch. Seem to get more frequent over time (and happen now when even just viewing a node).

webchick’s picture

Can anyone who worked on this patch respond to TC44's test results? Could they be related to this patch, or are they more likely caused by something else?

donquixote’s picture

@TC44 (#186)
If you have xdebug installed, could you check where and how often menu_router_build() is called during the offending request? If you find menu_router_build being called by admin_menu, can you check execution time for this function?

My guess is that it's the problem explained in #177. As said, menu_router_build in D6 needs special protection, in addition to the protection for menu_rebuild. In D7 that's less of a problem, because the harmful things happen in _menu_router_save which is called directly by menu_rebuild.

pwolanin’s picture

@TC44 - if you are seeing this just viewing a node, then you almost certainly have either hacked core or have a really bad contrib module that's doing something it shouldn't. Can you diff your core install to CVS or a clean download?

Perhaps some contrib module is calling into one of the internal functions directly? We could, perhaps, add code to acquire the a lock at each step, though this is generally unneede overhead for anyone using the APi right.

pwolanin’s picture

admin_menu 6.x-1.x only calls this in hook_footer(), though it uses at this point a variable for passing the state, so there are some possible issues in terms of variable caching. admin_menu 6.x-3.x is doing somehting odd in terms of examining the path - are you using that module?

As above - the D7 patch makes some API changes by splitting the build form the save that should make this a non-issue regardless for D7.

TC44’s picture

@donquixote, @webchick, @pwolanin,

Thanks for the responses.

I do find that it happens more when I'm updating a view, then viewing a node. I don't have xdebug installed, but I can install and do some checking on that. I can verify that core is not hacked on my installation. I think I've kept it as clean as possible with the exeption of an exposed form filters override and theme overrides, and the patch applied here.

I did read another post related to this where the user found it happened after reinstalling views, after doing testing by reinststalling modules one by one.. but I haven't done a step by step reinstall to test in the same manner. I do make use of views and cck quite heavily though. I also usually have both firefox and safari open, so I'm sometimes doing quick refreshes in both, which sometimes leads to the errors.

Just to be clear, I was getting probably the same frequency now as before patching. The patch did seem to stop the errors for the first day though. It may very well be related to a specific module, but I'm not sure which one. I do vigorous backups at almost every step, so I could always restore the db from various points and test, but that could be a long process.

Let me figure out how to install xdebug so I can provide some more info..

Thanks

pwolanin’s picture

A, so it likely whatever Views is doing - but in any cased that's D6 not D7.

TC44’s picture

I don't want to say that for sure, as it happens sometimes when saving regular nodes as well, so it views could be a part of it, but not necessarily the cause.

donquixote’s picture

I can imagine that views ajax calls generally cause more concurrency than the usual plain requests. On the other hand, a views ajax call should not call menu_router_build().

@TC44:
If you want to get some quick results without xdebug, you could add a drupal_set_message() into your menu_router_build(), and let it print the current $_GET['q'] and debug_backtrace(), and what else comes to your mind.
Unlike print_r or var_dump, drupal_set_message() will not kill your ajax calls. Instead, the message is buffered for the next chance to be printed as a regular drupal message.

Anyway, I think it's quite obvious that menu_router_build in D6 needs a lock.

----

Looking for menu_rebuild or menu_router_build in D6 views:

- views_ui_edit_view_form_submit() has a menu_rebuild()
- views_ui_enable_page() has a menu_rebuild()
- views_ui_disable_page() has a menu_rebuild()
- view::delete() has a menu_rebuild()

and all the rest of views does a lot of menu-related things.

if the patch does its job right, menu_rebuild() should be generally protected. But they can still be disrupted by unprotected calls to menu_router_build(), most likely caused by admin_menu.module, as pointed out above.

sun’s picture

Since there seems to be some confusion about admin_menu in this issue: Yes, 1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out). 3.x does not invoke menu_rebuild() anymore (aside from its built-in menu link to trigger a menu rebuild) and instead 100% relies on the regular menu system to output the menu links.

donquixote’s picture

sun (#195):
"1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out)."

It invoked menu_router_build, not menu_rebuild.
And yes, in a usual request, menu_router_build will be called by a regular menu_rebuild, before admin_menu calls menu_router_build for a second time.
But in #177 you can read how it can happen that menu_router_build() is called without a previous call to menu_router.

For those who don't believe it, have a look into the code.

<?php
function admin_menu_footer($main = 0) {
  if (!user_access('access administration menu') || admin_menu_suppress(FALSE)) {
    return;
  }

  // Check for the flag indicating that we need to rebuild.
  if (variable_get('admin_menu_rebuild_links', FALSE)) {
    module_load_include('inc', 'admin_menu');
    _admin_menu_rebuild_links();
    variable_del('admin_menu_rebuild_links');
  }
?>

Usually, 'admin_menu_rebuild_links' has been set to 'true' in the same request. But, in some cases it might be that another request set this variable to true. This is where we get into trouble.

<?php
function _admin_menu_rebuild_links() {
  // Get the newly rebuilt menu.
  $menu = menu_router_build();
?>

As we see, menu_router_build is called with no further if-clauses.

And yes, the new admin_menu works differently. But still, as menu_router_build is a public function, it should be lock protected, don't you think?

-----------

pwolanin (#178):
"In general, _menu_router_rebuild() is supposed to be private and should not be called."

Yes, but menu_router_build() is public.

-------

sun (#195):
You are right that the new version of admin_menu will solve this issue (it might cause other side effects, I don't know). If it turns out that no other D6 module relies on the current way menu_router_build() works in D6, we could backport the change from D7, which turns menu_router_build into a peaceful function. If we do this, we don't need a lock any more for menu_router_build, the lock for menu_rebuild will be enough.

TC44’s picture

Just a follow up.. I'm suspecting my issues were related to the "Content Type Administration by Organic Group" module.

I didn't get to xdebug yet, but I've just implemented #4 change as suggested here: http://drupal.org/node/374730

No further menu router errors so far, but it's only been 30min of testing. Still, they were happening approx every 5th node view, node edit or view edit prior.

EvanDonovan’s picture

Yes, there is a known issue with the Content Type Administration by Organic Group contributed module that will render your site virtually unusable. Please read #330135: menu_router_build call in hook_init is a serious performance hit. I would strongly recommend that until that issue is resolved, that you disable the module and test again.

TC44’s picture

@EvanDonovan,

Thanks for the reply and the link. I did read that thread previously. I've been running perfectly since patching without a single memery router error since. I've been working on my site and testing fairly heavily over the weekend, so I think this is a good sign. So for CTABOG, the D6 patch provided in this thread plus change #4 here http://drupal.org/node/374730 has so far solved the problem for me. And yes, in my case CTABOG was certainly the cause of my problems.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Alright, I spent another 30 minutes looking at this patch and decided to commit it as is. It is solid code and documentation, despite the fact that I might have done some things slightly differently. I don't think one would have been better than the other, so I didn't feel like holding this back. Thanks for all the work.

I'd like to see us add an entry to CHANGELOG.txt though.

moshe weitzman’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Needs work

Awesome work! Can we get a final D6 patch for Gabor? This is murdering D6 sites.

febbraro’s picture

@moshe

The D6 patch from #163 is still working for core
http://drupal.org/node/251792#comment-1831346

Or should it contain the extra lock as mentioned in #196 to protect against rouge modules? :p
http://drupal.org/node/251792#comment-1903500

Was there something else that needed to be included?

pwolanin’s picture

Status: Needs work » Needs review
moshe weitzman’s picture

Could someone start D7 patches for using the API in "Cron, block rehash, and several other places in core ...". Put links here so we can help with reviews.

pwolanin’s picture

c960657’s picture

mrfelton’s picture

subscribing. In need of the a final D6 port.

gpk’s picture

gpk’s picture

pixelpreview@gmail.com’s picture

subscribing for final patch to d6

azinck’s picture

subscribe -- looking for the D6 patch

Starminder’s picture

subscribe - please needs some help on this one

andypost’s picture

#163 changes D6 schema so need a Gabor decision about backporting.

gpk’s picture

Apart from minor code style issues, the patch at #163 (if it still applies) should still work for D6, as it did on August 17 (#202).

IMO the problem with rogue modules (#202) can be addressed if and when it happens. #163 is a big enough improvement to make this core-worthy.

What might help Gabor, apart from code style fixes, is people confirming that this works on a variety of live sites and doesn't cause any other problems. #202 suggests it works OK, problems *were* reported at #186 but did turn out to be a rogue module (#198, #199 - Content Type Administration by Organic Group, this has now been fixed #374730: Site down due to this module?). I will attempt to get this running on a live site shortly.

gpk’s picture

Adding tag per #200.

gpk’s picture

Version: 6.x-dev » 7.x-dev
Status: Needs review » Needs work

Unless I'm mistaken, 7.x system.install is missing the update routine to install the new {semaphore} table.

Berdir’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs work » Needs review

*If* this is added to D6, D7 does not need a update function.

gpk’s picture

I would say, *if* this is added to D6, and *if* we require everyone to upgrade to the final 6.x release before upgrading to 7.x (which we usually don't AFAIK, although it is often recommended), then D7 does not need an update function..

andypost’s picture

Version: 6.x-dev » 7.x-dev

Common practice is add D7 upgrade path, then try to backport it o D6, then change D7 upgrade path

Status: Needs review » Needs work

The last submitted patch failed testing.

EvanDonovan’s picture

@gpk: I have used the patch from #214 on a large (50,000+ nodes) Drupal site, but only on the test install, not the one that was actively being visited.

I have done various things to force menu_rebuild, such as saving the modules page or saving views. I have also done these things more than once at the same time to see if I could cause the database to get into an inconsistent state. I was not able to do so.

However, I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock. If it gets committed to Drupal 6 core, though, I will use it, simply to avoid having to keep applying my hack. I'll just have to see if the performance implications are really bad - especially in cases such as when people try to save multiple views at once.

nonsie’s picture

Subscribe

gpk’s picture

@221:
>I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock

I've still not used it in anger ... yes, surprised to hear that it seems slower, as you say possibly your table lock doesn't last for as long as the 'menu_rebuild' lock which does more than just lock {menu_router}. And again lock_wait() only checks the lock every second, perhaps it should do so more often.

>see if I could cause the database to get into an inconsistent state. I was not able to do so.
Looking at the code is there a theoretical opportunity for rows to be left out of {menu_router}?

Suppose I have a (disabled) module aaa that defines a menu path.

1. Some other operation takes place that causes a menu rebuild
2. I enable module aaa and the menu rebuild then invoked happens to clash with the one from 1.
3. I don't acquire the lock in menu_rebuild() and so I have to wait (for up to 30s) for the lock to become available, i.e. for the other menu_rebuild to finish
4. My menu_rebuild() returns, without having done anything (i.e. the assumption is that the other rebuild has done what is necessary).

But - will hook aaa_menu() have been invoked? The list of menu hooks to call is determined early in http://api.drupal.org/api/function/menu_router_build/6, which is itself called right at the top of http://api.drupal.org/api/function/menu_rebuild/6

Thoughts?

Viybel’s picture

FileSize
12.88 KB

Patch #163 updated for Drupal 6.14: with incremented function system_update_6054()

Don't forget to run update.php after rolling the patch.

Mike_Waters’s picture

FYI for those interested, a pressflow branch with this patch has been created. See https://answers.launchpad.net/pressflow/+question/85119
(Thanks go to David Strauss from fourkitchens)

Is this locking system going to be included in a future revision of D6, or will we have to wait until D7 to see it?
There is an item on the D6 issue queue, but it has been marked as a duplicate of this (which is not in the D6 issue queue)

Adam S’s picture

For the record, I started having this problem, found this issue queue yesterday and I don't have access to the command line on the server with the site. So I downloaded Drupal 6.14, put in my local directory with wamp. Then I downloaded Eclipse which made it easy to do the patch in #224.

I don't know anything about anything when it comes to web development. I so proud of myself right now. Didn't know I could do all that. Thanks guys you're all a big help.

BTW, my site, since I added the patch, is running so much faster!

Carlos Miranda Levy’s picture

Works like a charm.

Finally fixed with it what was driving me insane (although not sure what was more insane, the modules page slow speed or the dozens of open threads on how to fix it...)

Might want to mention here to run update.php after installing the patch :)

achilles085’s picture

hope others may find this thread easily

there are other thread regarding errors in menu_router but this is the only patch worked for me.

Thanks for the patch...it works great!

gaele’s picture

Re: patch in #224:

After the patch and before hitting 'Update' in update.php I get big red error messages like this:

Table 'drupal.semaphore' doesn't exist query: SELECT expire, value FROM semaphore WHERE name = 'locale_cache_nl' in
/includes/lock.inc at line 154.

I know where it's coming from, but this will scare most users.

JamesK’s picture

Did you run update.php as people have mentioned I don't even know how many times?

Carlos Miranda Levy’s picture

Table semaphore is created by the patch when you run update.php

Carlos Miranda Levy’s picture

Is it safe to use this patch on Drupal 6 with Cacherouter (cache_router - http://www.drupal.org/project/cache_router ) module enabled?

SamRose’s picture

tested patch at http://drupal.org/node/251792#comment-2125124 working on multiple drupal 6.14 sites

JamesK’s picture

Version: 7.x-dev » 6.14
Status: Needs work » Reviewed & tested by the community

I think ratbc is justified. It would be great to get this committed.

gaele’s picture

Re #230: this error message appears *while running* update.php, on the pages before the actual update happens.

pimousse98’s picture

@gaele: you might want to put your site in maintenance mode while updating modules/applying patches.

Everyone else: do we know if this patch will be included in a future release? just judging by the number of threads about menu_router, it would be worthwhile.

gpk’s picture

@230, 236: I think gaele's point is that when an administrator upgrades their site they will see the error message during the update process, and this will worry some of them.

andypost’s picture

Version: 6.14 » 6.x-dev

Upgrade path patch for D7 posted in #563106: Cannot upgrade from Drupal 6 to Drupal 7 - meta issue #42

CHANGELOG.txt entry already in under Task handling - this should be added for on-going D6 patch too.

chrism2671’s picture

I applied this patch under D6 but it doesn't seem to have helped.

Does this apply specifically to MyISAM tables as I altered the menu tables to InnoDB to try and reduce locking problems.

Status: Reviewed & tested by the community » Needs review

chrism2671 requested that failed test be re-tested.

pwolanin’s picture

For Drupal 6: usleep() doesn't work on windows for PHP < 5.0 Thus, sleep() is the only option and we must lock for increments of 1 sec. For D7, we could possibly switch to usleep() so that the (for example) we retry every 100 ms.

dalin’s picture

Status: Needs review » Reviewed & tested by the community

The patch in #224 combined with wrapping menu_rebuild() in a transaction made all my headaches go away =)

RTBC anyone?

rmjiv’s picture

This looks really interesting. One minor thing I noticed was a comment typo in lock.inc: "By obtaiing a lock, one request".

rfay’s picture

rfay

JamesK’s picture

Status: Reviewed & tested by the community » Needs work

And yet another Drupal release passes without this being committed.
Re-roll anyone?

pwolanin’s picture

does it need a re-roll?

JamesK’s picture

Status: Needs work » Reviewed & tested by the community

My bad. I just checked and no re-roll needed.

zapscribbles’s picture

Had this problem for ages, will try patch from comment #224 and see if it works. Thanks guys!

crea’s picture

Subscribing.

alanburke’s picture

Status: Reviewed & tested by the community » Needs review
gpk’s picture

Status: Needs review » Reviewed & tested by the community

@250: if the change of status was intentional, pls explain!! :)

miro_dietiker’s picture

Subscribing... This is a nightmare for complex sites...

pwolanin’s picture

@miro_dietiker - are you using this patch on any of those sites?

nonsie’s picture

Using patch from #224, works like a charm

TC44’s picture

Can anyone verify that the #224 patch works with Drupal 6.15?

I just patched and now get the following error on every page:

Parse error: syntax error, unexpected '<' in /home/mysite/public_html/includes/lock.inc on line 235

in the lock.inc file, line 235 is <?php

/**
* @} End of "defgroup locks".
*/
<?php
// $Id$

gpk’s picture

@255: looks like the new file lock.inc has not been properly created. Or perhaps the lock.inc file was already there and now the content is duplicated? The last line should be line 234 */

TC44’s picture

gpk, thanks for the reply. Thanks, yes it was already there. I got a bit confused and tried to patch over the previous patch.. I think I've got it sorted now.

miro_dietiker’s picture

pwolanin, we're using the referred patch from #224 now. It works perfectly as long as you "like" patching the core ;-)

However - upgrading to this mechanism will temporarily add an error message on every single page request due to missing semaphore table. Is there any improvement that makes sense?

EvanDonovan’s picture

I think that since the error message goes away after update.php is run, it's not a big deal. The release notes for the next Drupal 6 release could warn site admins of this, so it wouldn't freak them out.

David Strauss’s picture

A slightly modified version of this locking framework has been merged into Pressflow 6.

miro_dietiker’s picture

David Strauss, If there is any reason to "slightly modify" this patch, please explain asap. As we all think this functionality is clean and we hope ready enough to be part of the next Drupal 6 release.

David Strauss’s picture

@miro_dietiker

(1) The chosen time for running lock_init() is after bootstrap initializes variables. Pressflow uses locking for variables, so lock initialization had to move up.
(2) Pressflow cannot simply add hook_update() functions to core without breaking the ability to move back to Drupal. So, we factored the semaphore schema into a module.

As for integration into Drupal 6, Gabor told me on IRC (awhile back) that he would have limited interest in applying this major change to Drupal 6 if Pressflow 6 were offering equivalent functionality.

andypost’s picture

@David Strauss good news for pressflow but sad news for drupal 6

crea’s picture

@Gabor:
Unless Pressflow becomes official part of Drupal I don't understand how this problem being fixed in Pressflow can be excuse for not having this patch in Drupal. This is not some kind of performance improvement, this is a bugfix for race conditions.

domidc’s picture

subscribing

EvanDonovan’s picture

Just a note for anyone seeking this patch via Pressflow: it's not yet in the latest tarball on the Pressflow Launchpad page. Instead, you'll need to create a tarball yourself using Bazaar (bzr export pressflow.tar.gz lp:pressflow).

@Gabor: I also believe that this patch should be committed to 6.x, regardless of its availability in Pressflow, since it fixes a bug that occurs for many ordinary Drupal users.

rfay’s picture

Seconding @EvanDonovan: Many small sites see errors due to this repeatedly. It's quite ugly. It usually doesn't break anything, but it should definitely be fixed.

mrfelton’s picture

Status: Reviewed & tested by the community » Needs work

This patch does not add the lock to _menu_router_build which also sufferers from the problems that this patch aims to solve. (see. http://drupal.org/node/246653#comment-2507330).

q0rban’s picture

subscribing

David Strauss’s picture

Status: Needs work » Reviewed & tested by the community

@mrfelton _menu_router_build() is only called by menu_router_build(), which is only called by menu_rebuild(), which locks. I don't see a problem.

mrfelton’s picture

@David Strauss: The actual problem appears to belong in the admin_menu module, which calls menu_router_build directly, as pointed out in http://drupal.org/node/246653#comment-2507498. Sorry for the confusion.

bryancasler’s picture

I am also getting the following after applying patch #224. However it is showing up when I generate certain pages.

* user warning: Table '394720_beta3.semaphore' doesn't exist query: SELECT expire, value FROM semaphore WHERE name = 'menu_rebuild' in /mnt/stor2-wc1-dfw1/394720/www.ivawbeta.org/web/content/includes/lock.inc on line 154.
* user warning: Table '394720_beta3.semaphore' doesn't exist query: SELECT expire, value FROM semaphore WHERE name = 'menu_rebuild' in /mnt/stor2-wc1-dfw1/394720/www.ivawbeta.org/web/content/includes/lock.inc on line 154.

bryancasler’s picture

I just applied this patch (http://drupal.org/node/246653#comment-1482178) trying to resolve my problem in #275 and so far the error has gone away. I will follow up if the error shows up again.

gpk’s picture

@275: you need to run update.php after applying #224.

bryancasler’s picture

Thanks, I ran update.php and got...

--
The following queries were executed
system module
Update #6054

* CREATE TABLE {semaphore} ( `name` VARCHAR(255) NOT NULL DEFAULT '', `value` VARCHAR(255) NOT NULL DEFAULT '', `expire` DOUBLE NOT NULL, PRIMARY KEY (name), INDEX expire (expire) ) /*!40100 DEFAULT CHARACTER SET UTF8 */
--

Doesn't seem like any errors popped up. Thanks again!

kardave’s picture

Subscribing

locomo’s picture

subscribing

protoplasm’s picture

subscribing

bcobin’s picture

subscribing

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Committed the patch to Drupal 6. Thanks for all your hard work in fixing these bugs and introducing elegant solutions reusable elsewhere in Drupal.

q0rban’s picture

Great! Will this be in 6.16?

andypost’s picture

Version: 6.x-dev » 7.x-dev
Category: bug » task
Status: Fixed » Needs work

Let's back to 7.x and change upgrade path.

EDIT interlinking #131 [#563106]

andypost’s picture

Status: Needs work » Needs review
FileSize
636 bytes

Seems just need to track update numbering

pwolanin’s picture

There are a number of other places in D7 and D6 where locking should likely be applied - not sure if we need a meta issue to track them.

David Strauss’s picture

Good. Now I can roll back the Pressflow-specific locking API. However, I did have to reorder the lock initialization in the bootstrap process to support locking regeneration of the variable cache item, a common source of stampedes on large sites. Pressflow also does the menu rebuild transactionally, but I double D6 wants that change.

Witch’s picture

+1

Witch’s picture

Does the D6 Ppatch http://drupal.org/node/251792#comment-2125124 work with D6.15?

And does Drupal 6.16 already include the patch? Shal we wait until D6.16 is released?

Any persepctive would be nice, because i used the patch with D6.14 an it worked fine. After updating to D6.15 the errors appearing again.

iva2k’s picture

subscribe

AdrianB’s picture

Subscribing.

bryancasler’s picture

I also need to know if the D6 Patch http://drupal.org/node/251792#comment-2125124 will work with D6.15?

gpk’s picture

@293: it probably will work, or else just download 6.x-dev. Either way don't forget to run update.php.

bryancasler’s picture

I'm confused about your dev comment? Could you explain why it would be more likely to work with the dev than 6.15

dalin’s picture

this patch is already a part of 6.x-dev and will be a part of 6.16 whenever that is released. So you can either patch your existing 6.15 installation, switch to 6.x-dev, or wait till 6.16. Either way remember to run update.php.

Witch’s picture

I tried this patch with D6.15. It wont work.

gpk’s picture

@297: Not clear what you mean by "It wont work". If the patch won't apply (because some of the affected files have changed in the meantime) then just upgrade your site to 6.x-dev which already includes the patch.

Witch’s picture

I got an php error on my testsite. Maybe some files were incorrect patched.

I think it is not a good idea to use the dev version of drupal on a production site?! I just wait until D 6.16 got released and hope that the error wont appear often.

gpk’s picture

@299: did you run update.php? You will get errors until you do (and you will even get them when first going to update.php, per #229 above).

>not a good idea to use the dev version of drupal on a production site
As with any upgrade you need to test it first. However even an official Drupal release can introduce new bugs or conflicts with contrib modules, though these are usually fixed pretty quickly. For a mature Drupal version like 6.x there is a small risk that running dev will introduce a new bug, hence the need to test, but lots of people do run the dev version precisely to get certain bug fixes which were causing problems on their site.

Witch’s picture

Running update.php was not possible, because of this php error. I did not write the error down. :\

I just should take more time to analyse this.

JamesK’s picture

Patch on #224 works for 6.15...

kerberos’s picture

Subscribing...

srobert72’s picture

Subscribing

Vacilando’s picture

Subscribing.

tomceek’s picture

Subscribing

andypost’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs review » Fixed

My patch from #286 is not needed because there's no need

Dave Reid’s picture

Status: Fixed » Reviewed & tested by the community

We always need to make sure they stay in sync, so yes there is a need.

Dave Reid’s picture

Version: 6.x-dev » 7.x-dev
andypost’s picture

@Dave as I see - all empty updates are skipped...

freescholar’s picture

Version: 7.x-dev » 6.16
Issue tags: -drupal.org upgrade

Thank you so much for fixing this one!
I feel safe now ~~~

christefano’s picture

Version: 6.16 » 7.x-dev
Issue tags: +drupal.org upgrade
webchick’s picture

Help me out a little here; what exactly is RTBC for D7?

andypost’s picture

@webchick My patch from #286 do we leave empty hook_update_N with comments? Dave think we should - comments #307-#308

gpk’s picture

Do we now *require* (rather than just advise) people to update to the latest 6.x release (or 6.x-dev even) before upgrading to 7.x? If not then system_update_6054() needs to be included in full, no? The other post-6.0 updates are included: http://api.drupal.org/api/search/7/system_update_6.

andypost’s picture

@gpk For this purpose we have "updates-6.x-extra" section to mirror extra updates that happen after 6.0

mrfelton’s picture

I think this is related, and pretty bad. 35% of my slow queries caused by lookups to the semaphore table!

tom@db2:~$ ./bin/mysqlsla-2.03 --log-type=slow slow-log.1 
Report for slow logs: slow-log.1
28 queries total, 13 unique
Sorted by 't_sum'
Grand Totals: Time 202 s, Lock 0 s, Rows sent 244.52k, Rows Examined 246.98k


______________________________________________________________________ 001 ___
Count         : 10  (35.71%)
Time          : 106 s total, 10.6 s avg, 2 s to 15 s max  (52.48%)
  95% of Time : 91 s total, 10.111111 s avg, 2 s to 15 s max
Lock Time (s) : 0 total, 0 avg, 0 to 0 max  (0.00%)
  95% of Lock : 0 total, 0 avg, 0 to 0 max
Rows sent     : 0 avg, 0 to 0 max  (0.00%)
Rows examined : 0 avg, 0 to 0 max  (0.00%)
Database      : 
Users         : 
	concern@ 192.168.1.90 : 100.00% (10) of query, 92.86% (26) of all users

Query abstract:
SELECT expire, value FROM semaphore WHERE name = 'S';

Query sample:
SELECT expire, value FROM semaphore WHERE name = 'menu_rebuild';

EDIT: just to note that this is now the number 1 slow query on the site, averaging at 10 seconds to run it!!

gpk’s picture

Beats me how a simple SELECT expire, value FROM semaphore WHERE name = 'menu_rebuild'; can take 10s when there is probably only 1 entry in the table and the 'name' column has an index (http://api.drupal.org/api/function/system_update_6054/6) - it's the primary key!!

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Ok, committed #286 to HEAD. I think that makes this issue fixed.

Since this issue is already two pages, mrfelton, could you please open a separate issue to discuss your slow query findings, and cross-link it here? It'd be much easier to look into that way.

Gábor Hojtsy’s picture

arhak’s picture

I haven't upgraded any site to 6.16 because I have a bad feeling regarding this issue (which is not the only one I have seen for D6.16 so far)

and this comment #246653-122: Duplicate entry in menu_router: Different symptoms, same issue makes me wonder how accurate was to bet for a custom implementation of a locking framework when legacy DBMS have addressed built-in atomic transactions for so long

arhak’s picture

rolling back the table lock permission requirement would be easier than dealing with these:
* http://drupal.org/project/issues/drupal?text=Duplicated+entry&status=All...
* #732024: Updating from 6.1 user warning: Table 'semaphore' doesn't exist

moreover, it might be optional:
- either automatic test for table lock or a site-wide setting to declare "I have granted table lock permission over my database"
- database API would follow different approaches depending on table lock being availabe
- not having table lock would imply: deal with "Duplicated entry" errors and/or kindly wait until your site finishes an operation before triggering another heavy task
- and fix those spots where more than one menu_rebuild is triggered within the same request

dalin’s picture

I don't think I agree. Currently all of the reported issues with 6.16 appear to be one of:
- not running update.php.
- upgrading the codebase but forgetting includes/lock.inc.
- contrib modules invoking t() before Drupal is fully bootstrapped.
- contrib modules trying to do a full bootstrap during hook_boot().

arhak’s picture

it usually happens that I'm wrong about many thing
I hope this one to be one of them
but not quite sure
what would the following? an edge case?

function menu_rebuild() {
  if (!lock_acquire('menu_rebuild')) {
    // Wait for another request that is already doing this work.
    // We choose to block here since otherwise the router item may not 
    // be avaiable in menu_execute_active_handler() resulting in a 404.
    lock_wait('menu_rebuild');
    return FALSE;
  }

where is in core if (menu_rebuild()) { ... } ELSE { ... }?

what happens when the lock wasn't acquired at all?

  // Rebuild if we know it's needed, or if the menu masks are missing which
  // occurs rarely, likely due to a race condition of multiple rebuilds.
  if (variable_get('menu_rebuild_needed', FALSE) || !variable_get('menu_masks', array())) {
    menu_rebuild();
  }
  if ($router_item = menu_get_item($path)) {

how was the "menu_router gone" solved?
but leaving an inaccurate menu_router exists when it couldn't be rebuilt?
what would be the implications of that?

arhak’s picture

lets say the last http request failed to rebuild the menu_router, and from there no other request is made through http, but xmlrpc instead
would the menu_router be accurately rebuilt?

EDIT: typo

arhak’s picture

I just keep reading the code @#324 and ask myself
"race condition"? how come?
wasn't it addressed through a locking framework?

gpk’s picture

@326: This may be the same edge case I was thinking about in #251792-223: Implement a locking framework for long operations. I was planning to do a test to see if I could "break" it (i.e. get to a situation where {menu_router} is not 100% up-do-date but the menu_rebuild_needed flag is cleared) but have not found the time yet. I suggest you (or I) open a new issue to study this.

q0rban’s picture

I am very concerned about #317. @mrfelton, can you provide details about your setup?

mrfelton’s picture

q0rban: The day that I reported that entry from my slow query log, was the same day that I performed the upgrade to 6.15. It doesn't feature in the logs from the days since then, so I can only assume that this was something that happened during the update process. I just upgraded to 6.16 today, and there is an entry for the semaphore table in todays logs, but not as severe as what I reported last time. There is only one entry, and it took about 2 seconds:

Count         : 1  (12.50%)
Time          : 2 s total, 2 s avg, 2 s to 2 s max  (10.00%)
Lock Time (s) : 0 total, 0 avg, 0 to 0 max  (0.00%)
Rows sent     : 0 avg, 0 to 0 max  (0.00%)
Rows examined : 0 avg, 0 to 0 max  (0.00%)
Database      : 
Users         : 
	concern@ 192.168.1.90 : 100.00% (1) of query, 87.50% (7) of all users

Query abstract:
SELECT expire, value FROM semaphore WHERE name = 'S';

Query sample:
SELECT expire, value FROM semaphore WHERE name = 'menu_rebuild';

I'll will keep an eye on the logs over the next few days.

kbahey’s picture

@arhak

For what its worth: I upgraded a few very high traffic sites to Drupal 6.16, and there is no ill effect so far for this patch (been several days). Make sure you run update.php and that is it.

arhak’s picture

@#330 that's good news!

nevertheless, it deserves testing with crippled environments (slow servers), most of all, if they previously got duplicated entries when doing simultaneous heavy tasks, and most of all, if they got menu_router so broken than site was down
did you previously have any experience like that with your sites?
I don't think "very high traffic sites" would match "slow servers", right?

Damien Tournoud’s picture

@arhak: the locking framework implemented here seems very solid to me. It is designed to be both more flexible and more portable then a database lock. More flexible because depending on the use cases you might want to do something else then just waiting for the lock to become available (for menu_rebuild(), as you can see, we just wait for the process to be complete, and do not start a new rebuild if one was already in progress), and because we want locks that do not prevent the data from being read. More portable because, as you know, mandatory locks are not available are not available to all database engines and all environments.

I did some testing on the D6 implementation, and it seems that the this is working as intended. Here is the test strategy I set up:

  • disable page cache and hit the frontpage of your website with ab, with a reasonably high concurrency (10-20)
  • set the menu_rebuild_needed variable to TRUE

If you disabled the locking, you will enjoy a bunch of duplicate key errors (about 2500 in my test). If you enabled the locking, everything goes smoothly (absolutely no message appears in the watchdog).

If you have specific issues about this locking implementation, please open a separate issue, and let this one die.

arhak’s picture

first of all, I'm using this issue to avoid opening a new one having no good foundations for it
obviously this (300+) issue is near its death, so I'm not ruining anything
besides last minute comments are proper documentation so no body would ever ask "why isn't the locking framework in contrib"/"why such API change being code frozen" kind of question

I'm doing the last (lets say stupid) questions, and having a rationale by answer was all I was looking for

#332 (plus #327?) would be enough

you do intended a different locking approach than DBMS, and I should stop being knocked out by those approaches being different, since they are intentional (just would recommend a lot of tests simulating race conditions by making different requests to brake in the middle of their execution, like having a timeout)
also this locking strategy is uncommon in other systems, and proper documentation would be critical to avoid misuse from contrib mods

Damien Tournoud’s picture

@arhak: there are a thousand lines of comments per line of code in lock.inc, please open a new issue if you feel that a specific part of this documentation needs improvement.

The locking framework has been developed for Drupal 7, it has a full test case there (that you can help improve if you feel that some areas are not tested enough). It has then be backported to Drupal 6 in Pressflow (and deployed on theeconomist.com), before being included in mainstream core.

This type of advisory locking is not at all a new approach. I know that, for example Typo3, has the same kind of framework.

crea’s picture

@arhak

first of all, I'm using this issue to avoid opening a new one having no good foundations for it

Given that majority of people don't have problems with this patch and especially since it's already committed, you have very good foundation to open separate issue. Also there are lot of people subscribed to this issue and every time you post here this issue goes to the top of subscribed issues. Remember there's no way to unsubscribe currently and it annoys people.

rfay’s picture

I'm still seeing the menu_router duplicate issue in 6.16. I think we were supposed to have that taken care of by this commit (per #283). I may have lost track of which issue was supposed to fix it where, but it's not fixed for me in 6.16. If I'm in the wrong place, send me off to the right place, please:

Duplicate entry &#039;admin/content/node-type/simplenews/groups/%/remove&#039; for key 1 query: INSERT INTO menu_router (path, load_functions, to_arg_functions, access_callback, access_arguments, page_callback, page_arguments, fit, number_parts, tab_parent, tab_root, title, title_callback, title_arguments, type, block_callback, description, position, weight, file) VALUES (&#039;admin/content/node-type/simplenews/groups/%/remove&#039;, &#039;a:1:{i:5;N;}&#039;, &#039;&#039;, &#039;user_access&#039;, &#039;a:1:{i:0;s:24:\&quot;administer content types\&quot;;}&#039;, &#039;drupal_get_form&#039;, &#039;a:3:{i:0;s:23:\&quot;fieldgroup_remove_group\&quot;;i:1;s:10:\&quot;simplenews\&quot;;i:2;i:5;}&#039;, 125, 7, &#039;&#039;, &#039;admin/content/node-type/simplenews/groups/%/remove&#039;, &#039;Edit group&#039;, &#039;t&#039;, &#039;&#039;, 4, &#039;&#039;, &#039;&#039;, &#039;&#039;, 0, &#039;&#039;) in /home/randyfay/montclairdrive.org/includes/menu.inc on line 2457.
kevinquillen’s picture

I have still seen the issue with the menu not fully rebuilding in 6.16. The only thing I can think of is that we use the admin module. Any thoughts? Anyone else see this happen in 6.16?

EvanDonovan’s picture

@stripped your speech: Yes, admin module, at least in some versions, calls the private function inside menu_rebuild(), sidestepping the locking framework. Don't know of other potential causes at this time.

kevinquillen’s picture

Interesting. I will take a look. Know of a way around that for the time being?

yanyan729’s picture

Status: Fixed » Needs review

core-lock-patch-0.patch queued for re-testing.

q0rban’s picture

Status: Needs review » Fixed

@yanyan729, why?

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

kevinquillen’s picture

How can I adjust the beta1 of admin module to account for this? I can't get an answer there, and I'd like to fix this.

bcobin’s picture

This one bit me again today on a production site - and I'm not using the admin module. Happens rarely, but it does happen. Guess this isn't fixed for 6.16... :-(

gpk’s picture

@344, 343, 336: since a locking framework has been implemented per the title of this issue which is therefore rightly closed, and to avoid everyone getting deluged with emails for each follow-up comment, I'd suggest carrying on further discussion in a new topic or in #246653: Duplicate entry in menu_router: Different symptoms, same issue.

Would also be worth grepping your codebase to check if (probably a contrib module) something is invoking http://api.drupal.org/api/function/menu_router_build/6 or http://api.drupal.org/api/function/_menu_navigation_links_rebuild/6 directly, which could cause this problem. Although menu_router_build() doesn't look like a private function, it should now be treated as such.

@343: the problem is with admin_menu module invoking menu_router_build() directly, not admin module.

kenorb’s picture

kevinquillen’s picture

Is anyone still working on this issue? It happens a lot on a few sites with 6.16 still.

plach’s picture

I'm experiencing this too. Subscribing.

gpk’s picture

@347, 348: if you are using admin menu module 6.x-1.x series then I suggest heading over to #246653: Duplicate entry in menu_router: Different symptoms, same issue since the problem with admin menu (which causes the "Duplicate entry xxxxxxx for key 1 query: INSERT INTO menu_router ..." error message) 6.x-1.x has not been fixed.

This issue is closed.. please post any followups in a new issue where it can be given proper attention and to save on the emails :)

kevinquillen’s picture

?? I'm not using admin_menu module. Only the admin module.

gpk’s picture