Implement a locking framework for long operations
| Project: | Drupal |
| Version: | 6.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
| Issue tags: | drupal.org upgrade |
Drupal currently face several concurrency issues (see #249185 for variable caching, #238760 for menu rebuilding, ...). All relatively long operations (cache building, ...) are potentially affected by two different types of issues:
- database inconsistency: when the operation rebuilds a whole table, using the DELETE-then-INSERT pattern, requests served during the operation will see an inconsistent table (to demonstrate the issue, simply reload twice the module page in Drupal 6.x/7.x);
- race conditions: several operations can run at the same time, thus leading to unneeded duplication of effort.
The first issue can be solved by reducing the window of inconsistency (that's what is suggested in #238760 for session_write(), or in #230029 for node_save()), or by using transactions (that's one of the point of the new database layer).
The second issue cannot be solved by simply using transactions. You need some sort of locking (or semaphores), as discussed in #248739 and #238760. Without locking, these long operations can lead to performance hits (because several instances of the operation are run in parallel), and to space hits (because the result of the operation is saved several times in the database, thus in its binary log). About this last point, one of the busiest Drupal site in France (France 24) was forced to disabled locale caching because its invalidation lead to several hundred megabytes of binary log data.
The attached patch implements a pluggable soft locking framework for Drupal. Two points worth noting:
- Two implementations are available: an APC-based one, which is fast but implies that there is an unique web node; a slower database-based one, that should work across a cluster of nodes. High-performances implementations for a cluster could be made, for example based on Memcached or on a multicast bus.
- All long standing operations should be locked: the patch only demonstrates how to do this for
menu_rebuild()andlocale()cache refresh. Several locking strategies can be considered: acquire the lock and give up the operation if it fails (that's the implemented strategy for locale cache refresh), acquire the lock and wait for the other operation to complete (menu_rebuild()), or acquire the lock and retry acquiring until success.
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| core-lock-patch-0.patch | 11.02 KB | Idle | Unable to apply patch core-lock-patch-0.patch | View details | Re-test |

#1
As a side note: the APC-based implementation requires at least APC 3.0.17, because of a bug of
apc_add()(bug #12499).#2
We can't have a dependency in core on APC. Not all PHP installations have it, so we can't rely on that.
So, this leaves the db implementation as the candidate for core.
This uses the variable table as a lock. I don't like this idea. Creating a new table for that is better than overloading the variable table doubling as a lock mechanism. And if you make this table cache_lock, which inherits the schema from "cache", it would automatically be eligible for use in APC, memcached, and XCache if you have the cache_router module and assign a bin for cache_lock.
#3
Setting this to CNW.
#4
Patch still applied.
#5
subscribing
#6
BTW broken link to "the new database layer" which is actually linking this issue #251792
#7
@kbahey i think that various backends can be programmed including one that (as I understand) is already in there that don't depend on APC.
#8
Nice proposal!
Some remarks though: What happens if menu_rebuild() never finishes, for instance because Apache segfaults? When that happens the whole site will be down until the admin manually removes the lock. Could be solved by specifying a timeout with lock_acquire()
Is a global $locks variable necessary? Maybe use a class for this? Not sure what the policy on globals is in Drupal core.
What is $lock_id for?
Using cache_router or somesuch for this sounds like a great plan, using the DB for locking on a (very) busy site won't be fun :D
#9
#251792: Implement a locking framework for long operations was a duplicate. As identified in that bug, those cache rebuilding operation can simply kill any moderate traffic Drupal 6 site. This is probably one of the biggest scalability issue we have now.
#10
While I agree in principal there is one thing I don't like:
--- includes/menu.inc 23 Apr 2008 20:01:47 -0000 1.269+++ includes/menu.inc 27 Apr 2008 10:50:51 -0000
@@ -1636,6 +1636,11 @@ function menu_cache_clear_all() {
* is different and leaves stale data in the menu tables.
*/
function menu_rebuild() {
+ if (!lock_acquire('menu_rebuild')) {
+ // Wait for the other request to do our work
+ lock_wait('menu_rebuild');
+ return;
+ }
I don't think we should wait on a lock for menu_rebuild() just skip the rebuild since someone else is already handling it. The other thing is that I think doing a menu_cache_clear_all() before we even run the db operations is dumb. I think you should lock, rebuild from db, clear the cache and set it at the same time so at least other clients will get the stale menu and not try to rebuild every time.
#11
@slantview: the idea was that during the menu rebuild the whole menu table is in an inconsistent state: something calling menu_rebuild() is also likely to query the menu table right after that (especially in menu administration pages). Plus, a menu_rebuild() should be a fairly quick operation (supposing that we don't run hundreds of them in parallel... of course).
I like your suggestion to clear the cache *after* the operation, and I think it's quite complementary: if a request has called menu_rebuild() explicitly, it should wait for the other rebuild to complete, but if a request has not called menu_rebuild(), it should use the old cached data.
#12
As much as it might look like an APi change, I think we really need to backport some version of the to 6.x to deal with several painful race conditions.
#13
This looks like a poor choice:
$lock_id = md5(uniqid());Also, it seems a little wonky to have a bunch of extra global variables here - is there a way to avoid that?
#14
Very nice work. subscribe.
#15
I worked on the patch a little and implemented for www.divx.com. I also am including a D7.x-dev patch and a patch for D6.8 since that is what we are running on our site in case we want to back port this.
I didn't like having an APC specific technology lock.inc, so I renamed lock-db.inc to lock.inc, and changed the get'ers and set'ers to cache_get and cache_set. Now we have a replaceable lock_inc variable so we can replace the locks with whatever we want (including nothing) if we are so inclined. The cache_get and cache_set allow us to use any backend technology that we want for storage for the locking system.
I added a new table into system.install $schema called "cache_lock" that allows us to use database storage by default, but easily replace it with something else.
We use my module "Cache Router" for storage on divx.com and simply replace the cache_lock bin with a memcache backed bin and we added an additional bin to the memcached servers.
All in all, I really believe that this is helping our site. We can now clear cache, save menus and have them rebuilt, etc even at peak traffic periods and there is little to no impact on performance. Previously we would have the site crawling if we did any of that stuff.
There is something else I should note too so this patch doesn't get all the praise for fixing our specific problems. We also had previously converted all the tables over to INNODB, and we converted the menu_router table back to MyISAM as well as all of the search tables. Everything else is running INNODB.
Best,
Steve Rude
#16
Aww crap, forgot the -N flag on diff.
Here is a re-roll including the lock.inc. Kinda important. :)
#17
We need a fix for D6, and I'm not convinced it will be the same for D7.
Your usage of cache_get/cache_set simply will work; but doesn't offer enough guarantee:
<?php+ // try to acquire the lock
+ $lock = cache_get($name, 'cache_lock');
+ if ($lock === FALSE) {
+ $locks[$name] = TRUE;
+ cache_set($name, $locks[$name], 'cache_lock', CACHE_TEMPORARY);
+ return TRUE;
+ }
+ else {
+ return FALSE;
+ }
?>
^ This is asynchronous. Under a load of several hundred requests per second split on several web nodes, who knows how many process can acquire this lock?
Compare to my DB implementation in the original patch:
<?php+ // try to acquire the lock
+ if (@db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", 'lock:' . $name, $lock_id)) {
+ $locks[$name] = TRUE;
+ return TRUE;
+ }
+ else {
+ return FALSE;
+ }
?>
^ This is perfectly atomic.
I fully agree that we need to link cache and lock (it simply makes sense). But, we need to add new primitives to cache.inc for this, that alternative cache implementation may not implement: this would be an API change.
So, would other agree to go for this imperfect cache_get()/cache_set() solution for D6 and implement something better in D7?
#18
@pwolanin: as you have seen, this is a very old patch, before I got fully involved in core development. There are also lots of code style errors in there.
#19
Damien says this will most probably be needed for the d.o upgrade so we can help performance and avoid some issues.
#20
After a discussion with Gábor, we agreed that for D6 having a (pluggable) implementation defaulting to an (atomic) db lock would be better. I makes little sense to replace a race condition by another.
If you want to use memcache as your cache implementation, simply implement the (atomic) memcache lock implementation that match your needs.
#21
The implementations for D6 and D7 will differ much. I'm dedicating this issue to D6, and I opened a new issue with a proposal for D7: #358663: Implement a locking framework and make caching pluggable.
#22
By the way, this one is a bug.
#23
Subscribing.
#24
We can solve #238760: menu_router table truncated and site goes down without using locks or API changes. Instead of flushing the menu cache in request A, and rebuilding it in request B, we can rebuild the menu cache in request A and replace the old menu -- I think.
#25
@Dries - I don't think you are correct - we are deleting and rebuilding the menu in the same request in most cases. The problem is a race condition. Since we are using a variable whose value is cached for the duration of the request.
We also have a period of time where one process may be trying to build a new menu, and another process sets the flag that a rebuild is needed since the router is missing.
We could read/update the variables table directly to help eliminate the cached variable race condition, but I'm not sure we can solve it cleanly wihout APi changes.
I have been seeing this error even on a no-load localhost, so I think it's pretty critical to fix.
#26
per: http://drupal.org/node/369548#comment-1245494
I think we could also help the menu rebuild issue some by not trying to cache the whole router as a serialized blob.
#27
Here's a start which attempts to do away with caching the big router blob. That alone should help reduce the race conditions by speeding the process.
#28
whoops - that patch missed menu.inc
#29
And a full db lock implementation - building off Damien's original, but simplifying it as much as possible.
#30
ok, so to check the effect, I enabled devel module (plus minor fix to devel_menu_link_alter())
I allowed anonymous users to access this devel path which forces a menu rebuild:
ab -n100 -c5 http://127.0.0.1/drupal-6/devel/menu/reset
With the patch, I got just three 404 errors in the log, I assume from when a request came in and the router table was empty.
Without the patch, there are 87 pages of SQL errors in the dblog, from running the same ab. The errors have the form:
Duplicate entry 'taxonomy/autocomplete' for key 1 query: INSERT INTO menu_router (path, load_functions, to_arg_functions, access_callback, access_arguments, page_callback, page_arguments, fit, number_parts, tab_parent, tab_root, title, title_callback, title_arguments, type, block_callback, description, position, weight, file) VALUES ('taxonomy/autocomplete', '', '', 'user_access', 'a:1:{i:0;s:14:\"access content\";}', 'taxonomy_autocomplete', 'a:0:{}', 3, 2, '', 'taxonomy/autocomplete', 'Autocomplete taxonomy', 't', '', 4, '', '', '', 0, 'modules/taxonomy/taxonomy.pages.inc') in /www/drupal-6/includes/menu.inc on line 2388#31
whoops - patch missed bootstrap.inc. Also fixing code comments, typos, etc.
Also, the removal of the router blob caching can certainly be in a separate patch, but I think reading/writing that much data is contributing to the race condition in menu rebuilds, and hence related to the central issue of concern here.
#32
Looks great. I noticed a typo in a function name: local_break($name, $now);
#33
With that typo fixed.
#34
added more doxygen.
#35
Two nitpicks:
<?php+/**
+ * Helper function. Get this request's unique id.
+ */
?>
and
<?php+/**
+ * Helper function to store the menu router if we have it in memory..
+ */
?>
The lock part looks very good. I'm a little bit worried about changes to the menu rebuild mechanism, especially because some API changes are involved:
<?php- drupal_alter('menu_link', $item, $menu);
+ drupal_alter('menu_link', $item);
?>
#36
Yes - it's a minor API change, but I honestly have no idea why we are passing $menu into that hook.
The few contrib modules that implement it with both params do not use the $menu:
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/i18n/i18nme...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/minutes/min...
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/devel/devel...
So, we could avoid breaking them outright (see attached patch), but it's still a minor API change since $menu may now be NULL.
#37
neclimdul points out to me: it's a stretch but you could want to lock, release and then lock again. the second lock would fail.
So, I think using a global is the only sensible thing here (back to something closer to Damien's original).
#38
I'm really glad to see this happening. I've needed lock capability in Drupal in the past.
Comments upon first reading the patch; note that I have not read any of the history in this issue, as no one else will have when they read the code after it is committed.
Locking semantics are always subtle and need to be documented. lock.inc should include a @defgroup with an overview and theory of operation. Among other things, state clearly that this is a cooperative, advisory lock system.
The answer to all my comments below may be "the patch is correct." If so, my comment becomes, "please add a comment explaining why."
<?php+function lock_acquire($name, $timeout = 10.0) {
+ global $locks;
+
+ if (isset($locks[$name])) {
+ return (bool)db_result(db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id()));
?>
I think this says: If I previously acquired a lock, check the db anyway because someone else might have broken it. In that case, why wouldn't you also clear $locks[$name]?
<?php+ return (bool)@db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());
?>
Are we guaranteed that db_query("UPDATE") returns rows modified?
In the unlikely case that $expire equals the previously set time, this will return false even though the lock is held for the requested amount of time.
<?php+ if (!lock_acquire('menu_rebuild')) {
+ // Wait for another request to do our work
+ lock_wait('menu_rebuild');
+ return;
+ }
+
?>
This looks odd. We just waited up to 20 seconds for the lock to clear. It might be available, but we aren't going to try to grab it and do the work? Doesn't our caller deserve to know that we failed? I see something in the phpdoc for menu_rebuild() that it may leave the operation for the next page request, but it says that happens only during update.php or install.php.
#39
First pass at addressing Barry's comments.
#40
I reviewed this patch and it looks mostly good to me. I would be willing to test this out on divx.com before it goes into core if we can get it to patch (reviewed and tested by community) state.
The only thing I saw was that 1 second seems like a very long amount of time to wait. I was thinking 100ms might be a little more realistic. Typically a 200ms page load time is what I consider "good" for most pages, however a 2-4 second page load time would be what I consider "unacceptable". I know these are preferences, but I would like to hear others input on this. Is it too much to ask for a usleep and microseconds instead of sleep(1). One second is quite a long time in web page render time. I know that this is a sensitive topic since too many checks can end up with many, many checks and be heavy on a processor, but 1 second could cause many, many 404s while this is rebuilding and that is almost worse in my mind. If you are doing 400 requests per second or more, this could end up with hundreds or thousands of 404s while rebuilding. This can be especially damaging if the request is done by a web page crawler (e.g. google) and now you have a valid page that is showing as 404 in a search engine.
The other thing is that I don't believe we should accept this patch without a test set that includes verifying that locking actually works for acquire, release, wait and timeout. Also that the lock cannot be taken by two systems simultaneously (obviously).
Thoughts?
#41
@slantview - note that this sleep() whoud not affect most page views - only when there is a blocking lock.
usleep() only works on Widows as of PHP 5.0.0, so we could considering using it for D7, but not D6 (which is the target of this patch).
http://www.php.net/manual/en/function.usleep.php
#42
Subscribe
#43
@pwolanin - I do understand it would not affect most page views, but if that blocking lock happens regularly (e.g. menu rebuilding) than it could affect a lot more page views than you would want to have to sleep for 1 second while we wait for the rebuild. I don't think it's a show stopper, but could be filed under "nice to have".
#44
@slantview - the alternative (current broekn situation) is that all those page views ALSO try to execute a menu rebuild which is slow and costly and probably takes more than the 1 sec.
#45
@pwolanin - I understand the current situation, that is the one that broke our site for launch. I don't want to be a PITA, so I'll drop the issue, it may be something we want to test and adjust accordingly. I'll try to put in some time testing it and see if it really makes a difference.
As far as the windows issue, we could add something like if (function_exists('usleep')) { if we really thought there would be a gain on systems that supported it.
I'll do some testing and see if it actually matters.
#46
Subscribing.
#47
subscribing
#48
subscribing
#49
We should separate out the code of eliminate the caching of the menu router - that should be faster bug fix to get in and will be helpful regardless for performance.
#50
<?php+function lock_renew($name, $timeout = 20.0) {
+ if (lock_acquire($name)) {
+ list($usec, $sec) = explode(' ', microtime());
+ $expire = (float)$usec + (float)$sec + $timeout;
+ db_query("UPDATE {semaphore} SET expire = %f WHERE name = '%s' AND value = '%s'", $expire, $name, _lock_id());
+ return (bool)db_affected_rows();
+ }
+ return FALSE;
+}
?>
The lock_acquire() doesn't seem to be necessary here, because the UPDATE query is atomic.
<?php+function lock_wait($name, $delay = 30) {
[...]
+ lock_break($name, $now);
[...]
+}
+
+function lock_break($name, $now) {
+ db_query("DELETE FROM {semaphore} WHERE name = '%s' AND expire < %f", $name, $now);
+}
?>
This is probably prone to race conditions. Why not simply get the value of the lock in lock_wait() and do a lock_break($name, $value)?
#51
subscribe
#52
#53
subscribe
#54
subscribe
#55
subscribe, I'll try to review.
#56
needs a re-roll to remove the menu router stuff, plus to take Damien's comments into consideration.
#57
subscribing
#58
Here's a straight re-roll removing the menu router elements (which are at http://drupal.org/node/317775#comment-1438040) but not yet taking Damien's comments into account.
#59
subscribing
#60
subscribe
#61
This takes account of Damien's comments, but we should probably add the locking for the locale rebuild too.
#62
after more discussion with Damien about making the API very clear.
#63
adding
return (bool)db_affected_rows();to function lock_may_be_available()#64
I stress-tested that patch, and it appears to be rock solid.
I added back the locale cache rebuild locking that I originally implemented in the original patch, and tested that. Here is an extract of the test log... you see that nearly 80 requests failed to acquire a lock, and that only one of them did the actual rebuilding.
Seems ok to go in for me... I did the last round of review with Peter on the IRC, and the API appears very sane (we will build on that for D7 for sure).
#65
And with the lock.inc file ;)
#66
subscribing
#67
straight port to D7 - lock.inc not DBTNGified yet.
#68
I have been trying to figure out why my install is slow.
I found the following issues:
Caching the entire menu, http://drupal.org/node/317775#comment-1492104
Implement a locking framework for long operations: http://drupal.org/node/251792#comment-1503770
But the system.install system_update_6050 function gets redefined in each
I mention this because they both might get committed which might cause problems.
Are they incompatible approaches?
Chris
#69
@activelyOUT - don't worry, whichever is committed first (probably the router patch) we'll re-roll the other to bump the update number.
#70
On the assumption that this will get into D6 (hence it there will be a 6xxx update function to add the table) this patch omits an update function for D7, so you'll need a fresh install to test.
We might need to add locking to the modules_rebuild too - on D7 I can get a PDO error by reloading quickly the admin/build/modules page even with this patch:
SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'themes/garland/garland.info' for key 1 in system_theme_data() (line 1167 of /Users/Shared/www/drupal-7/modules/system/system.module).#71
Updated D6 patch a little to match some tweaks in the D7 patch.
#72
I tested the D7 patch and code reviewed. My one pause regarded the different approaches that locale and menu_rebuild take toward acquiring locks. More investigation showed that locale is implementing a 'i'll take a lock if available, otherwise proceed' approach. menu is taking a 'i need a lock and will wait for it' approach. so this patch nicely illustrates the flexibility of the API. RTBC.
#73
+ if (!db_query("SELECT 1 FROM {semaphore} WHERE name = '%s' AND value = '%s'", $name, _lock_id())->fetchField()) {That query is only 50% converted to DBTNG :)
#74
This patch fixes that query to use proper DBTNG placeholders.
#75
Dries requested infos on how the lack of a locking framework effects drupal.org.
The answer is: Currently not in a visible way. Last year we had a serious bug that triggered a race condition and took the site down several times. The actual problem was in memcached, but I think the problem had been mich easier to debug had the original user locked the cache insert.
I think that it is important that we can protect Drupal installs against race conditions.
#76
Well, the lock patch only kicks in when you enable a module or do some locale operation. enable/disable modules is much more frequent on a new site than a mature one like drupal.org. further, we know what modules to use whereas most sites try lots of modules out. thirdly, when we do try stuff out we do so on dev boxes so drupal.org would not feel those delays. drupal.org is one of the worst sites for judging the benefits of this patch.
#77
There are other things that cause the locks to kick in such as using the book module and updating a node, and any site that updates one of the menus at any point. This could be rare or frequent depending on the site. Unfortunately, you won't see it on d.o very much. We might feel the pain when we finally cutover to the new design however.
#78
@slantview - I expect there are indeed other places we will want to apply this sort of locking, but we at least need this code in place to move forward.
#79
Nice work.
+ $lock_id = uniqid(mt_rand());$_SERVER['REMOTE_ADDR'] . $_SERVER['REMOTE_PORT'] . $_SERVER['SERVER_ADDR']?+function lock_may_be_available($name) {+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+ if (!$lock) {
+ return TRUE;
+ }
+ list($usec, $sec) = explode(' ', microtime());
+ $now = (float)$usec + (float)$sec;
+ if ($now > $lock['expire']) {
+ return (bool)db_delete('semaphore')
+ ->condition('name', $name)
+ ->condition('value', $lock['value'])
+ ->execute();
+ }
+ return FALSE;
+}
->condition('value', $lock['value'])will fix that, I think.+ 'indexes' => array('expire' => array('expire')),AFAICT the few lines in lock_init() could just as well be done inside the if block in _lock_id(). This would prevent lock_release_all() from being called in every non-cached request.
+function lock_wait($name, $delay = 30) {+
+ while ($delay--) {
I am not sure I understand whether this is intended to go into D7 also, but here are some D7-only comments:
+ list($usec, $sec) = explode(' ', microtime());+ $expire = (float)$usec + (float)$sec + $timeout;
$expire = microtime(TRUE) + $timeout.+ static $lock_id;#80
#81
@c960657 - from those links seems we should just set more_entropy to TRUE. Since this is only called once per page load, I don't think we need to worry too much about speed.
re: the possible race for db_delete, I assume you meant
->condition('expire', $lock['expire'])rather than->condition('value', $lock['value'])?For D7, we likely should uses the new static cache API.
#82
I did :-)
#83
@c960657: indeed we need a condition on both the expire and the lock value.
#84
I think this addresses the concerns in #79.
#85
oops - somehow the diff missed the schema.
#86
If we can move code out of lock_init(), thats desirable. That way lock.inc will only get included and parsed when it is needed. or maybe thats gonna happen anyway.
Code looks good to me, and tests pass.
#87
Assume there is a lock that is very “popular”, i.e. it is often aquired as soon as it has been released. If the lock is held when lock_wait() is called, the function keeps waiting until no thread has the lock, i.e. other threads may acquire and release the lock several times before the function returns. Wouldn't it be more useful if it returned already when the lock_id that held the lock, when the function was entered, has been released, even if a new lock has been acquired by another thread (in other words, select 'value' at the top of the function and return as soon as that value has disappeared from the database)? This would be useful in events where you want to make sure that a given function has been called at least once by some thread before returning, e.g. use cases like menu_rebuild().
+ 'value' => array(+ 'description' => 'A value for the semaphore.',
function menu_rebuild() {- variable_del('menu_rebuild_needed');
+ if (!lock_acquire('menu_rebuild')) {
+ // Wait for another request that is already doing this work.
+ lock_wait('menu_rebuild');
+ return FALSE;
+ }
+ if (!lock_acquire('menu_rebuild')) {+ // Wait for another request that is already doing this work.
+ lock_wait('menu_rebuild');
+ if (!lock_acquire('menu_rebuild')) {
+ lock_wait('menu_rebuild');
+ return FALSE;
+ }
+ }
+function lock_acquire($name, $timeout = 20.0) {+ global $locks;
+
+ if (isset($locks[$name])) {
+ // We acquired a lock before, so check whether our lock is intact.
+ if (!db_query("SELECT 1 FROM {semaphore} WHERE name = :name AND value = :value", array(':name' => $name, ':value' => _lock_id()))->fetchField()) {
- _menu_clear_page_cache();+ if (lock_renew('menu_rebuild')) {
+ _menu_navigation_links_rebuild($menu);
+ ...
+ }
#88
I would be interested in (and would be glad to test) the D6 port of this feature, as stated in Duplicate entry in menu_router: ... - comment #49.
PS: Subscribe.
#89
subscribe
#90
I also would like to test the D6 port of this feature, as per #246653: Duplicate entry in menu_router: Different symptoms, same issue & #238760: menu_router table truncated and site goes down. Which patch on here is good to test for D6?
I'm currently using the patch in #246653-47: Duplicate entry in menu_router: Different symptoms, same issue, so that panels and views can work properly for me, but it is a non-optimal solution.
#92
The last submitted patch failed testing.
#93
subscribe
#94
To anyone contemplating #91, this script introduces huge security holes [and is thus unpublished now chx]: for instance, you don't want a robot to discover the page and begin hitting up the database with a bunch of menu rebuilds. The easiest way to accomplish this is to revisit the Modules page (at admin/build/modules). Or you could make a quick module and do something like the following, even adding a cron job to rebuild the menu periodically (making sure not to activate it until 3AM in the morning or other non-peak time). Then you can just disable this module when and if a patch fixing the original problem goes through.
I personally don't recommend using either this approach or the approach in #91. If you are unable to visit administrative pages in the first place, just log in as user 1. And then, again, just visit /admin/build/modules, which will rebuild your menu system.
<?php
function lock_bandaid_menu() {
$items['admin/build/lock-bandaid'] = array(
'page callback' => 'lock_bandaid_page',
'access callback' => 'user_access',
'access arguments' => array('administer menu'),
'title' => 'Menu Bandaid',
'description' => 'Rebuild the menu system.'
);
return $items;
}
function lock_bandaid_page() {
menu_rebuild();
return t('Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')));
}
function lock_bandaid_cron() {
if (variable_get('lock_bandaid_cron_timestamp', 0) <= time()) {
menu_rebuild();
watchdog('lock_bandaid', 'Menu configuration tables have been rebuilt. See status of !patch.', array('!patch' => l(t('the db/menu locking patch'), 'http://drupal.org/node/251792')), WATCHDOG_NOTICE, l(t('Menu Bandaid'), 'admin/build/lock-bandaid'));
variable_set('lock_bandaid_cron_timestamp', time() + 3600 * 24);
}
}
?>
I haven't actually tested this, but just offer it as a more secure alternative to the former.
#95
@aaron: I get what you mean. But if the {menu_rebuild} table is completely trashed, you won't be able to visit admin/build/modules, nor will you be able to visit the login page to log in as uid 1.
I think the best solution to the concern you raise is to delete the menu_rebuild.php from your server & re-upload it if needed.
Using .htpassword to protect the file would be a more advanced solution.
#96
I agree with #95, since it's an infallible and quick solution
always taking into account that the script shouldn't remain in the live site much longer that needed (and it will be only needed once on production sites, maybe on development site it might be more actively required)
I also used to use another solution, which consisted on having an SQL query at hand to fix the menu_router entry for admin/build/modules and then be able to navigate to that page which would rebuild the menu_router (analogous for devel's functions), but in long time terms I used to forgot to carry on with that query, but always remember how to recreate a quick php script invoking the menu rebuild.
#97
I have the same issue here with 6.12. I've tried to implement the most recent patch for D6 from #71 but I always get an error when applying the patch. If I disable Organic Groups the problem goes away since OG rebuilds the menu_router table very aggressively. Is there or will there be an updated D6 patch to fix this?
#98
#99
subscribe
#100
IMO the whole "renew" thing is unnatural, it's odd code.
OTOH why attempting to wait?
what wait for?
waiting for 20 secs when timeout probably is set to 30 secs?
IMHO it should be attempt to lock the semaphore down
if got it do the job
else don't do it (maybe write it down to make another attempt in subsequent requests)
I have implemented a locking module for Drupal 6.x, right now I'm testing it and tuning it, but it works by now, everything seems to be fine having 10-50 pages making AJAX requests every 10 seconds to the same semaphore for a (real use case) job
that seems to be a lot of requests
nevertheless, I don't think this would be the ideal solution to the problem. It is just a solution. And it might be it for a while.
there are some known issues to think about and discuss, but they won't show up in a long while (for example, integer auto-increments stands for 4 thousands millions of rows, which will be a long while until getting there)
would you like to review this module of mine?
should I upload it to this thread?
I think it might (at least) light you up (if not truly useful)
#101
@arhak:
So if I'm understanding you correctly, you're creating a 6.x module to deal with the locking on the menu_rebuild? Or is it just a generalized locking framework, that could possibly be applied to this specific issue?
It might be best if you posted it as a project on Drupal.org (if you have CVS access). That way, we could have a more extended discussion about it over there, rather than lengthening this thread even further. Also, having it as an actual project, rather than an attachment in a long issue, means that people like me who would appreciate a locking solution for Drupal right now would be better able to benefit from it. (Since patches could be applied to it, new versions released, etc.)
Anyway, it's great that you're working on this!
#102
I'm not sure this has any bearing on the surrounding issue, but I thought it would be better to post it than not.
What I have found is that, after a number of hours testing and general messing about with various modules that I have running on my development site, this problem goes away and the speed of my site returns to normal if I disable the "OG Content Type Admin" module (6.x-1.0.) - http://drupal.org/project/og_content_type_admin.
I have a bunch of modules running (Views, OG, Panels, CCK, etc.) and don't seem to have any issues unless I enable this module. Anyone else run this module and see the issue disapper if disabled?
#103
+1 to #10 (slantview)
#104
#103, #10:
And return false if the menu rebuilding was skipped. The info might be useful to whatever module triggered the menu rebuilding.
#105
I'm not sure if it has been discussed elsewhere, but there's another approach to minimize these lengthy operations.
Many places in core and contrib modules trigger a full menu rebuild just to update a few paths, or even just update, add or remove one single path. It looks like the menu system could be extended in some way, so that it is possible to just rebuild a part of the whole thing.
For example, I'm CCK and I need to rebuild the menu when the label of a field has been changed, just because the label is used for subtabs in the admin/content side. Well, it would be nice if there was a method to just rebuild admin/content/*, or even just the link it being updated.
And so on... I'm sure many places in core (as well as in contrib) could take advantage of *partial* menu rebuilds.
#106
@markus_petrux,
please check Rewrite the function menu_rebuild. The topic starter is a bit too simplified, better read the replies.
Things to consider:
- It has been said that menu_links can grow very big, and should not be loaded in memory all at once.
- I don't know how big menu_router can become, but the current menu_rebuild algorithm does use PHP arrays in the size of the menu_router table. So I guess it is not a problem.
There is a bit of a controversy how this fits in the roadmap for the menu system. chx has argued that other issues should be fixed first, to clean up the db structure:
- #484234: Big task cleanup: remove type and tasks from hook_menu
- #191360: UB Usability: menu parent choosers
- Look into your DB, table "menu_router", column "tab_root", which associates different menu_router rows with each other. The idea (as I understand) is to remove this column from menu_router, and store the association somewhere else (in menu_links, or in a separate table).
- On the long run, the menu system should get a CRUD interface. I don't know how long this will take, and I don't know the implications on existing modules.
My opinion:
I think an optimization of menu_rebuild is possible without a change in the DB structure. Later, when the above issues are fixed, we can still do whatever is in the roadmap.
#107
A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure.
#108
@c960657 re: #87. Since a semaphore need not be a lock, we choose the generic name 'value'
re: "It seems a bit odd trying to acquire a semaphore you already hold (or that you previously held but did not release). "
The point of this is that some other thread may have broken/stolen the lock, so you need to check.
#109
re: #87, putting the lock_renew code into lock_acquire makes sense to me - makes for even simpler code.
lock_wait() does have the problem described - that a process my wait a long time for a popular lock. So, maybe that needs to be documented? Adding new code comments and doxygen.
Inserting the navigation links is a somewhat separate operation, but yes, I see how an extreme situation where doing the menu rebuild took a long time and someone did something dumb like put that call in index.php you might fail to get the links rebuilt. In this re-roll I just extended the lock time a little and don't try to renew in menu_rebuild().
NOTE - I still think this is critical to backport to D6, so there is not an update function. You must do a fresh D7 install to create the table.
#110
Damien pointed out a minor typo, and also suggested that we should take an optimistic approach to acquiring the lock - that way we avoid an extra select query in most cases (since we expect the lock generally to be available).
#111
Logic is not quite right in the catch {} block. This should fix it.
#112
The word "lock" is used all over in the patch. The column is used to hold the request id returned by _lock_id(), so I think the column name should reflect that.
#114
@c960657 -yes, lock is all over the patch, but the 'semaphore' table is intended to be generic and NOT limited to use by this lock API.
#115
Ah, okay. If the table is intended to hold more than just semaphores but various variables in general (as suggested by the table description), I suggest using another name, perhaps “variable_volatile”. In computer science, the word “semaphore” has a very specific meaning that doesn't match the current description IMO.
+ $schema['semaphore'] = array(+ 'description' => 'Table for storing locks and other flags that should not be cached by the variable system.',
#116
Well, this is not necessarily volatile as much as it's non-cached. While I realize this is a bit of an abuse of the term semaphore, 'flag' is already used for other things. We could call it 'mutex' but that's still more specific than it's intended scope.
How about 'message' instead? I don't really care much.
#117
-1 to {semaphore}. Discussed on irc for {lock}. It doesn't conflict with any current module namespaces, isn't a SQL reserved word, and since I don't see this being generalized to warrant a general table name.
#118
webchick says 'semaphore' is fine - so here's a re-roll just trying to clarify the table description. Only that one line changed.
#119
re: #117 - eaton later pointed out that 'lock' is an Oracle reserved word.
#120
Also, although lock is not a SQL-99 reserved keyword, it is in certain DBMSes.
#121
D6 version of this patch?
#122
Subscribing.
#123
@pwolanin at #107: "A partial rebuild doesn't make any sense in the Drupal context, because we have no way of knowing (without doing the full rebuild) which if any modules add a new new path to any place in the router structure."
I know, but I was just trying to see this from a certain distance. Aside from this locking framework, which is a nice addition by itself, I think that maybe part of the menu system in Drupal could be re-worked to allow partial rebuilds. For example, when I launch a menu rebuild because I have added a link to admin/build/foo/bar, only that part of the menu could be updated. To allow other modules change that behavior, a new hook could be added to allow other say so. And well, this is just an idea. But I think it would worth thinking a bit more about it because even with a locking framework, the whole menu rebuild may take a lot of time, while other requests wait, and requests waiting may mean more Apache threads, more database connections, poor UX unless we can offer them cookies and tee or coffee, etc. :)
#124
@markus - we save new links as a convenience - they are fundamentally different then the main work in the menu rebuild, which is to rebuild the router. LINKS != ROUTER ITEMS
#125
Sure, I know the difference, but the concern about the increased number of threads waiting and consuming Apache/PHP/memory resources still stands.
The locking framework helps to prevent race conditions that cause duplicate errors in DB, but this does not affect on time time required to rebuild the whole menu router, and the stuff related to links.
Anyway, my aim was just to mention this. Even when locks are used, rebuilding the whole thing is overkill when only a portion of the menu will really change, so there may be a way to optimize this, I think.
#126
markus_petrux:
In #512962-25: Rewrite the function menu_rebuild (EDIT: esp. menu_router_build) chx suggests that after the way that menu local tasks are handled has been rewritten (see #484234: Big task cleanup: remove type and tasks from hook_menu), there might be the possibility of using CRUD operations for handling menu router items rather than the draconian system currently in place. I like this suggestion, too, but unfortunately lack the technical skills to work on it.
#127
Let's get this in!
#128
Here is a re-roll for D6 (based on 6.13) the system update hook number was in conflict with recent updates.
#129
Thanks Frank, though the logic changed a bit in the latest D7, so we should mimic that.
#130
Thanks for pointing that out Peter.
Re-rolled after bloodying up my eyes trying to tease apart the differences, but I think I got it all. The only part I'm unsure about is the optimistic locking part (and using error suppression for D6 instead of catching a PDOException) so maybe pay special attention there.
#131
Great! thanks. I will take a look and compare.
#132
the D6 looks pretty good, but needs to use the PHP 4.4 compatible calls to microtime:
+ list($usec, $sec) = explode(' ', microtime());+ $expire = (float)$usec + (float)$sec + $timeout;
#133
+ * requests should try to acquire a lock before proceeding. For example,No double spaces in comments please. It's the current standard, although grammatically correct.
+function lock_init() {+ global $locks;
Why not a drupal_static() instead here? The entire static system was designed to (also) replace most of our globals. -1 for introducing new ones.
Additionally, and after reading a bit more of this patch, this function should probably reset each and everything, because we are in a new request.
+function _lock_id() {+ static $lock_id;
+
+ if (!isset($lock_id)) {
+ // Assign a unique id.
+ $lock_id = uniqid(mt_rand(), TRUE);
+ // We only register a shutdown function if a lock is used.
+ register_shutdown_function('lock_release_all', $lock_id);
+ }
+ return $lock_id;
+}
This looks like there can be only one long running process in one request. 3D ?
+ * @return TRUE if the lock was acquired, FALSE if it failed.+ */
+function lock_acquire($name, $timeout = 30.0) {
(and elsewhere) Description of @return should be on the next line, just like @param.
+function lock_may_be_available($name) {I'm worried about the namespace of those functions. Having worked in many integration projects, it was really helpful that most of Drupal's functions are prefixed with 'drupal_' or some other, presumably unique namespace.
#134
subscribing
#135
@sun - are you reading the D6 or the D7 patch?
The D7 one uses:
+function _lock_id() {+ $lock_id = &drupal_static(__FUNCTION__);
#136
Re-roll of D6 patch. Cleaned up code comments per sun.
@sun - no, there can potentially be multiple locks held by the same process (each with a different name), though in general seems like something you'd avoid. A single lock (single name) can only be held by one process.
I don't understand your comment about resetting everything in function _lock_id()
#137
fixed code comments for D7 version.
@sun - re: namespace issues. menu.inc --> menu_X() functions, lock.inc --> lock_X() functions.
I don't really care a lot if Angie or Dries want to add 'drupal_' to these new functions, but we are essentially defining a pluggable interface (like cache_get/cache_set) so I'd prefer to stay with the more generic names. We so far I think we never make any drupal_X() functions pluggable.
#138
Thanks for re-rolling.
oh - ok, I think the introduction and explanation in @defgroup needs some improvement and clarification. I first understood that this was a locking system that works within one request, allowing to prevent duplicate invocations (and thereby resetting everything in the next request would have made sense). Reading a bit more through the actual functions, it seems there are two different types of locking, and developers have to invoke the wanted one (somehow). That should be clarified... or if I'm still not getting it right, then the entire explanation in @defgroup could use an overhaul. Possibly also providing example snippets using @code and @endcode.
Furthermore, I wondered why a locking system in lock.inc uses a table named 'semaphore'...?
We should additionally reserve the "lock" namespace in contrib. In case we keep the table name, reserving the "semaphore" namespace wouldn't hurt either. (just like we did for http://drupal.org/project/field)
#139
@sun - this entire patch is about INTER-process locking. there is nothing at all about INTRA-process locking. Can you point to the part of the doc you find confusing? There is only one kind of lock and only one function to acquire one in the patch.
The 'semaphore' table is meant to be somewhat generic and available for other sorts of locking implementations.
#140
Peter, Thanks for the PHP4 compatibility pointer, the patch in #136 applies cleanly and works for me against 6.13.
Thanks for the re-roll.
#141
shouldn't $retry be set to TRUE at the start?
+ // Optimistically try to acquire the lock, then+ // retry once if it fails.
+ $expire = microtime(TRUE) + $timeout;
+ $retry = FALSE;
+ do {
+ try {
+ db_insert('semaphore')
+ ->fields(array(
+ 'name' => $name,
+ 'value' => _lock_id(),
+ 'expire' => $expire,
+ ))
+ ->execute();
+ $locks[$name] = TRUE;
+ $retry = FALSE;
+ }
+ catch (PDOException $e) {
+ // Suppress the error, and if this is our first pass through the loop,
+ // try to break the lock in case it's expired.
+ $retry = $retry ? FALSE : lock_may_be_available($name);
+ }
+ } while ($retry);
And can we please not introduce db_query into core when we don't need to:
+function lock_may_be_available($name) {+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
+ if (!$lock) {
+ return TRUE;
+ }
I like the name 'lock' or 'locks' for a table name rather than 'semaphore' - its easier for people to understand.
#142
No.
Erm. What??
Semaphore is the perfect technical term for that. Semaphores can be used for other thing then purely locking.
#143
I thought db_select would have been the preferred function over db_query?
#144
No (or at least not yet). db_query() is recommended (but not mandatory) for all the simple static queries, while db_select() is mandatory for all dynamic queries.
There have been enough bikeshed on this issue. It has been marked RTBC two times already. Let's this third time be the one.
#145
Given that I see many Drupal rockstars on this issue supporting the patch, I'm not questioning or trying to understand the reasoning behind the implementation.
However, the @defgroup really needs quite some clarification. At the very least, please add example snippets that outline how to use this locking system. Furthermore, the concept of having to lock certain operations is a new territory for many developers, so we really need to make sure the documentation takes the audience into account. So let's explain it, like you explain something entirely new to someone else.
Aside from that, the code of the actual functions could use some more inline comments about reasoning. For example:
+ do {+ try {
+ db_insert('semaphore')
+ ->fields(array(
+ 'name' => $name,
+ 'value' => _lock_id(),
+ 'expire' => $expire,
+ ))
+ ->execute();
+ $locks[$name] = TRUE;
+ $retry = FALSE;
+ }
+ catch (PDOException $e) {
+ // Suppress the error, and if this is our first pass through the loop,
+ // try to break the lock in case it's expired.
+ $retry = $retry ? FALSE : lock_may_be_available($name);
+ }
+ } while ($retry);
Wow. That really looks like some really dirty code. Most probably, we simply don't have another way to do stuff like that currently. But if that is really the case, there should really be a comment explaining why we need to do it in an awkward way like that.
+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();Small nitpick: Why not using single quotes here?
+function lock_wait($name, $delay = 30) {+ while ($delay--) {
+ sleep(1);
+ if (lock_may_be_available($name)) {
+ return FALSE;
+ }
+ }
+ return TRUE;
+}
Every. Single. Line. Of. This. Function. Definitely. Needs. A. Comment.
Lastly, regarding menu_rebuild() we might have to turn the action into a "real" trigger/action, to solve issues like #306316: Regression: node_types_rebuild should rebuild menu... (separate issue though)
#146
+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();so PDO inserts the the quotes around :name?
#147
@sun - we use double quotes for all DB queries as a standard afaik (though that may be a < D7 carry-over).
substantially increased code comments in the attached.
#148
<?php
+ * this is an APi intested to support alternative implmentation, code using
spelling + capitalization
+ // We acquired a lock before, so try to renew it.
grammar
+ // This function shoudl only be called by a request that faled to get a
spelling
+ // be avaiable in menu_execute_active_handler() resulting in a 404.
spelling
?>
Schema code style is wrong too (end parens should be on a new line)
#149
With code comment fixes from sun and dimitrig01
#150
This is re: sun's current explanation (http://drupalbin.com/10524)
[7:46pm] dmitrig01: * then delete the current data in the database to insert the new afterwards. <-- new what[7:46pm] dmitrig01: This is a cooperative, advisory lock system. Any long-running operation, <-- what is?
[7:46pm] dmitrig01: which could potentially be attempted in parallel by multiple requests *NEEDS A COMMA* should
[7:47pm] dmitrig01: try to acquire a lock before proceeding. <-- what's that?
[7:47pm] dmitrig01: To use this API, pick a unique name for the lock. <-- that sounds like that's the *only* thing you need to do
[7:48pm] dmitrig01: the thing never explains the concept of locks, it just jumps straight from the problem to "thisis a lock api"
[7:48pm] dmitrig01: and, "This is a cooperative, advisory lock system" makes no sense to me
[7:49pm] dmitrig01: explain the concept of acquiring and releasing locks, then say that every lock also needs a unique ID, then show the function example
[7:49pm] dmitrig01: * A function that has acquired a lock may attempt to renew a lock (extend the <-- why would it want to do that
[7:50pm] dmitrig01: * Failure to renew a lock is indicative that another request has acquired <-- that just flew straight over my head. why/how would a different request acquire a lock if the first already has acquired it?
[7:50pm] dmitrig01: * it may call lock_wait() <-- and what would that do?
#151
Thanks for re-rolling. The docs + comments make much more sense to me now! :)
In addition to dmitrig01's review, a few more nitpicks:
+ $success = (bool)db_update('semaphore')(and elsewhere) There should be space between the casting statement and the value.
+ // Optimistically try to acquire the lock, then+ // retry once if it fails.
This comment doesn't seem to wrap at 80 chars.
+ // then $retry is FALSE. In this case, the insert must have failedPlease remove the double space in comment.
+ // Since this will break the lock in case it's expired.(possibly elsewhere) We try to avoid abbreviations in comments, i.e. "it is expired".
+ // This function should only be called by a request that faled to get a+ // lock, so we sleep first to given the parallel request a chance to finish
+ // and release the lock.
rather belongs into the PHPDoc description.
Typo in "faled" and "given".
+ if (lock_may_be_available($name)) {+ // No longer need to wait.
Given the updated docs, the comment in here can probably vanish.
Though: Wouldn't it make sense to rename that function to lock_is_released() ?
+function lock_release($name) {...
+ db_delete('semaphore')
+ ->condition('name', $name)
+ ->condition('value', _lock_id())
This cries for unit tests. (which should actually be very easy to implement)
+ if (!lock_acquire('menu_rebuild')) {+ // Wait for another request that is already doing this work.
+ // We choose to block here since otherwise the router item may not
+ // be available in menu_execute_active_handler() resulting in a 404.
+ lock_wait('menu_rebuild');
+ return FALSE;
+ }
Maybe that entire menu_rebuild() update should be deferred to another issue... anyway, I'd guess it would make more sense to inject a drupal_goto() in here, redirecting the user to the very same path he tried to access previously, continuously doing it all over again until the lock is released... Because, displaying a 403/404 to admins on well-known pages is a major WTF.
+ else {+ variable_del('menu_rebuild_needed');
+ }
+ lock_release('menu_rebuild');
Shouldn't the lock be released within the else ?
+ 'primary key' => array('name'),+ );
Wrong indentation.
#152
@sun -
we call lock_wait(), so we should not see the 404. We *don't* want a looping goto - that's extremely bad..
The function name 'lock_may_be_available' describes EXACTLY the purpose of the function - see above discussion. it should not change.
The lock_release() MUST NOT be in the else - otherwise we would possibly be blocking out the next page request which needs to rebuild again (though ideally our shutdown function would take care of that).
@dimitri - the unique ID is an internal implementation detail that the user of the API doesn't need to know anything about. I don't think we need to explain the concept of locks much more - I added one more sentence about it.
@dimitri - this is not hardware locking - lock expiration or even bad coding could cause the current request's lock to be acquired by another request and then unavailable to be renewed.
added a little more explanation in addition to WS/spelling corrections from sun.
#153
Thanks for incorporating the improvements!
If it is that special, then it would make much sense to put this reasoning in code then.
The only other remaining issue is to add some tests for this functionality then.
#154
The logic is simple - if you acquire a lock you should release it when you are done. I guess that's not in the top block anywhere - adding one more sentence.
#155
Ok, so I have to conceded that adding tests was useful - turns out that the float to string and back casting would lead to intermittent failures when trying to break a lock since we were doing a = comparison. Altered the code so make sure the timeout is at least 0.001 seconds and then do a <= compare after adding 0.0001.
#156
Awesome!
Final issues: Can you squeeze some blank lines into the test (where logical "chapters" start/end)? Also, one of the last assertion messages contains the word/typo "befine".
#157
code comment fixes per #156
#158
Thank you! Ready to go.
#159
Minor issue. There are 3 indentation spaces here:
<?php+ return (bool) db_delete('semaphore')
+ ->condition('name', $name)
+ ->condition('value', $lock['value'])
+ ->condition('expire', 0.0001 + $expire, '<=')
+ ->execute();
?>
#160
fixed indent
#161
go go gadget commit!
#162
+ 'fields' => array(+ 'name' => array(
+ 'description' => 'Primary Key: Unique name.',
+ 'type' => 'varchar',
+ 'length' => 255,
+ 'not null' => TRUE,
+ 'default' => ''),
+ 'value' => array(
+ 'description' => 'A value for the semaphore.',
+ 'type' => 'varchar',
+ 'length' => 255,
+ 'not null' => TRUE,
+ 'default' => ''),
+ 'expire' => array(
+ 'description' => 'A Unix timestamp with microseconds indicating when the semaphore should expire.',
+ 'type' => 'float',
+ 'size' => 'big',
+ 'not null' => TRUE),
+ ),
Now really the last one! The closing array braces are slightly misplaced here.
#163
Re-roll of D6 version to take the float/string conversion into account. Also incorporated all of the lock.inc doc updates.
#164
ok, this should fix the spacing on the schema.
#165
Patch still applies with offset, minor code comments follow, so patch is a soft reroll of 164.
Line 21 and 22 of lock.inc, "obtaining" and "operation" is misspelled:
Line 50 of lock.inc: extra space in "page request to proceed on the assumption that a parallel request completed" before "assumption"
#166
subscribe
#167
Couple more minor double-spacing on new sentences:
+ * that the operation in question be complete. After lock_wait() returns,+ * by setting the 'lock_inc' variable to an alternate include filepath. Since+ // request acquired the lock and set a new expire time. We add a small+ * specified delay in seconds is reached. This should not be used with locks#168
Oops, should be CNW even though it's minor.
#169
Subscribing. Will test the latest D6 patch.
#170
Removed double spaces.
#171
I have this running successfully on a patched 6.13 production environment for two weeks without problems - and our menu router errors have disappeared too.
#172
Just wanted to report that I had one instance of menu router errors today after patching with #163 D6 patch (happened while saving a computed field). This is still far less than prior to patching, where I was getting them frequently. The instance today may have been pilot error on my part though.
#173
@TC44 Are you using admin_menu by any chance?
For everyone else, there is a "problem" with this locking mechanism and admin_menu that makes this not bulletproof as best I can tell.
admin_menu, in _admin_menu_rebuild_links() calls menu_router_build() directly not menu_rebuild() since it needs the menu data structure. This bypasses the locking mechanism and any concurrent request to rebuild the menu will result in the duplicate key error on insertion into menu_router. This does not have anything to do with the locking mechanism per se, however this patch contains the lock implementation that goes into menu_rebuild()
Questions: Should the lock protection be moved to help prevent this scenario? Should there be a different lock around the DELETE/INSERT statements inside _menu_router_rebuild() themselves? A different change all together? Should everyone just ignore me?
I know we can't control how modules are implemented, but we probably should put protections in where possible.
#174
@fabbraro,
I'm not. Is it possible that the errors would show if drupal choked on my computed field code? It only happened the one time when I was saving, so I'm suspecting that was the case, since I was creating a computed field at the time, and it happened on the save.
Hasn't happened today at all, so I think the patch is working as it should, and maybe my case was just becuase of the bad comp field code?
#175
@febbraro - I think you are mistaken. Note that there is a param to menu_router_build() that defaults to FALSE. in this case, admin_menu will only get the already-built router array. admin_menu only calls this after a regular menu_rebuild() call completes, so logically the router will always be in memory already. Note - this is all by design.
http://api.drupal.org/api/function/menu_router_build/6
#176
I just got menu router errors again while viewing admin/build/modules. (it happened after testing a patch for custom breadcrumbs, which was spitting some other errors - not sure if it's related, but I'm going to uninstall cb and check again).
#177
@pwolanin: I'm probably missing something here (it's been a long day)
Is it possible that:
* request 1 calls menu_rebuild()
* which calls menu_router_rebuild()
* which triggers admin_menu_menu_alter()
* and sets admin_menu_rebuild_links to true
* starts to rebuild the router
* just as request 2 comes in for a page which calls admin_menu_footer()
* which checks admin_menu_rebuild_links and determines that it is supposed to rebuild the menu
* (but the first request is not done yet so there is no already build router array)
* then you will have two requests in _menu_router_rebuild() battling it out for control of the universe.
Like I said, I could just have crap for brains right now, but I don't see how it is protected against in every case. (or maybe I'm just trying to rationalize a scenario that makes me see why I'm still getting the error)
Thanks for putting up with my theorizing.
#178
@febbraro - in general, that's possible - but admin_menu in specific is coded so that it will not do this. In general, _menu_router_rebuild() is supposed to be private and should not be called. Essentially the scenario you outline is what's often happening now fairly often without the lock - so you will get a pile of SQL errors.
#179
Wow. I have to say, this is a really excellent patch. Even though it's 3am I was able to follow along with the code well and understand everything it's doing. It's extremely well-documented! I am hard-pressed to find anything wrong with it at all (I mentioned a few minor things below).
However, Dries in his replies here has seemed non-plussed about committing it. I'm not sure I feel comfortable doing so when he's not around. :\ Anyone have any other sources that would show that Dries changed his mind since #24 and #75?
Here are my comments, at any rate. They're all pretty minor, so I'll leave at RTBC. I haven't read the entire issue yet, so apologies if this repeats anything.
+++ includes/lock.inc@@ -0,0 +1,252 @@
+function lock_init() {
I guess this could potentially be interpreted as a hook_init() if you had a "lock" module installed. Maybe rename to lock_initiate() or lock_start() or something. Probably not a big deal though.
+++ includes/lock.inc@@ -0,0 +1,252 @@
+ $lock = db_query("SELECT expire, value FROM {semaphore} WHERE name = :name", array(':name' => $name))->fetchAssoc();
Query should use single quotes, not double quotes.
+++ modules/simpletest/tests/lock.test@@ -0,0 +1,58 @@
+ // Let another request acquire the lock.
+ $this->drupalGet('system-test/lock-acquire');
+ $this->assertText($lock_acquired, t('Lock acquired by the other request.'), t('Lock'));
+ // The other request finished and should have released its lock.
+ $this->assertTrue(lock_acquire('system_test_lock_acquire'), t('Lock acquired by this request.'), t('Lock'));
+ // This request holds the lock, the other request cannot get it.
+ $this->drupalGet('system-test/lock-acquire');
+ $this->assertText($lock_not_acquired, t('Lock not acquired by the other request.'), t('Lock'));
+ lock_release('system_test_lock_acquire');
Could you clarify this just a bit more? I don't quite follow how the first time system-test/lock-acquire is requested it completes the lock and the second time it doesn't.
Also, menu rebuild is definitely a good use case for this. Is cron another?
This review is powered by Dreditor.
#180
@webchick - thanks for the feedback. Did we switch the standard for quoting queries?
Cron, block rehash, and several other places in core would certainly be candidates for converting to this or uding this - So, I'd expect a few small follow-up patches. The two conversions here are proof-of-function and also two of the most serious problems where we need the locking.
The clarity of commenting is likely due to thorough vetting above by sun :-)
#181
Another problem where this would *really* help is #412808: Handling of missing files and functions inside the registry, that is currently a *huge* DX issue and it would be great if we could safely rebuild the registry in such a case.
#182
I think febbraro made a very valid point.
The protections built into admin_menu (the current version) usually prevent simultaneous menu rebuilds. Admin menu relies on menu_rebuild always being called before admin_menu_build, so the menu should be already in cache. But that's not what we call a robust solution!
menu_router_build is a public function, and as such it should have its own locking protection.
That said, things around menu_rebuild might see some big changes anyway:
- in D7, menu_router_build does no longer write anything to the DB.
- the new admin_menu does not use menu_router_build anymore (see http://drupal.org/node/334120).
- in D6, menu_router_build erases the entire menu_router table, then inserts the new contents. Usually that's totally overkill, and it leaves the system with an empty menu_router table for some seconds.
- in D7 this process happens in menu_router_save, which is called directly in menu_rebuild. It is faster, because the inserts are done in one big query.
- There is a D6 patch out there (see http://drupal.org/node/512962#comment-1870358) which replaces this with a diff-based table rebuilding (scan for differences, write the differences to the DB. In many cases, this means NO write operations). This patch is nice for itself, but there might be some conflicts with the things done in D7.
- The other part of menu_rebuild (_menu_navigation_links_rebuild) needs a speed tweak as well, it's far too slow.
For the future I could imagine a hook_menu_router_update($menu, $changes), that would be invoked with every menu_rebuild, and could be implemented by any module to "listen" to menu router changes. _menu_navigation_links_rebuild could then be moved into a hook implementation in menu.module. Anything about menu_links can then be moved out of core menu.inc.
#183
http://drupal.org/node/2497 and all D6 code suggests that we use "" for SQL?
#184
@pwolanin: for D7, we are trying to stick with single quotes, now that there is no need for double quotes anymore. That should be documented.
#185
re-roll with the very minor changes suggested by webchick
#186
Getting massive amounts of duplicate entry/menu router errors again today, even with #163 D6 patch. Seem to get more frequent over time (and happen now when even just viewing a node).
#187
Can anyone who worked on this patch respond to TC44's test results? Could they be related to this patch, or are they more likely caused by something else?
#188
@TC44 (#186)
If you have xdebug installed, could you check where and how often menu_router_build() is called during the offending request? If you find menu_router_build being called by admin_menu, can you check execution time for this function?
My guess is that it's the problem explained in #177. As said, menu_router_build in D6 needs special protection, in addition to the protection for menu_rebuild. In D7 that's less of a problem, because the harmful things happen in _menu_router_save which is called directly by menu_rebuild.
#189
@TC44 - if you are seeing this just viewing a node, then you almost certainly have either hacked core or have a really bad contrib module that's doing something it shouldn't. Can you diff your core install to CVS or a clean download?
Perhaps some contrib module is calling into one of the internal functions directly? We could, perhaps, add code to acquire the a lock at each step, though this is generally unneede overhead for anyone using the APi right.
#190
admin_menu 6.x-1.x only calls this in hook_footer(), though it uses at this point a variable for passing the state, so there are some possible issues in terms of variable caching. admin_menu 6.x-3.x is doing somehting odd in terms of examining the path - are you using that module?
As above - the D7 patch makes some API changes by splitting the build form the save that should make this a non-issue regardless for D7.
#191
@donquixote, @webchick, @pwolanin,
Thanks for the responses.
I do find that it happens more when I'm updating a view, then viewing a node. I don't have xdebug installed, but I can install and do some checking on that. I can verify that core is not hacked on my installation. I think I've kept it as clean as possible with the exeption of an exposed form filters override and theme overrides, and the patch applied here.
I did read another post related to this where the user found it happened after reinstalling views, after doing testing by reinststalling modules one by one.. but I haven't done a step by step reinstall to test in the same manner. I do make use of views and cck quite heavily though. I also usually have both firefox and safari open, so I'm sometimes doing quick refreshes in both, which sometimes leads to the errors.
Just to be clear, I was getting probably the same frequency now as before patching. The patch did seem to stop the errors for the first day though. It may very well be related to a specific module, but I'm not sure which one. I do vigorous backups at almost every step, so I could always restore the db from various points and test, but that could be a long process.
Let me figure out how to install xdebug so I can provide some more info..
Thanks
#192
A, so it likely whatever Views is doing - but in any cased that's D6 not D7.
#193
I don't want to say that for sure, as it happens sometimes when saving regular nodes as well, so it views could be a part of it, but not necessarily the cause.
#194
I can imagine that views ajax calls generally cause more concurrency than the usual plain requests. On the other hand, a views ajax call should not call menu_router_build().
@TC44:
If you want to get some quick results without xdebug, you could add a drupal_set_message() into your menu_router_build(), and let it print the current $_GET['q'] and debug_backtrace(), and what else comes to your mind.
Unlike print_r or var_dump, drupal_set_message() will not kill your ajax calls. Instead, the message is buffered for the next chance to be printed as a regular drupal message.
Anyway, I think it's quite obvious that menu_router_build in D6 needs a lock.
----
Looking for menu_rebuild or menu_router_build in D6 views:
- views_ui_edit_view_form_submit() has a menu_rebuild()
- views_ui_enable_page() has a menu_rebuild()
- views_ui_disable_page() has a menu_rebuild()
- view::delete() has a menu_rebuild()
and all the rest of views does a lot of menu-related things.
if the patch does its job right, menu_rebuild() should be generally protected. But they can still be disrupted by unprotected calls to menu_router_build(), most likely caused by admin_menu.module, as pointed out above.
#195
Since there seems to be some confusion about admin_menu in this issue: Yes, 1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out). 3.x does not invoke menu_rebuild() anymore (aside from its built-in menu link to trigger a menu rebuild) and instead 100% relies on the regular menu system to output the menu links.
#196
sun (#195):
"1.x invoked menu_rebuild() _after_ it has been rebuilt (as pwolanin already pointed out)."
It invoked menu_router_build, not menu_rebuild.
And yes, in a usual request, menu_router_build will be called by a regular menu_rebuild, before admin_menu calls menu_router_build for a second time.
But in #177 you can read how it can happen that menu_router_build() is called without a previous call to menu_router.
For those who don't believe it, have a look into the code.
<?php
function admin_menu_footer($main = 0) {
if (!user_access('access administration menu') || admin_menu_suppress(FALSE)) {
return;
}
// Check for the flag indicating that we need to rebuild.
if (variable_get('admin_menu_rebuild_links', FALSE)) {
module_load_include('inc', 'admin_menu');
_admin_menu_rebuild_links();
variable_del('admin_menu_rebuild_links');
}
?>
Usually, 'admin_menu_rebuild_links' has been set to 'true' in the same request. But, in some cases it might be that another request set this variable to true. This is where we get into trouble.
<?phpfunction _admin_menu_rebuild_links() {
// Get the newly rebuilt menu.
$menu = menu_router_build();
?>
As we see, menu_router_build is called with no further if-clauses.
And yes, the new admin_menu works differently. But still, as menu_router_build is a public function, it should be lock protected, don't you think?
-----------
pwolanin (#178):
"In general, _menu_router_rebuild() is supposed to be private and should not be called."
Yes, but menu_router_build() is public.
-------
sun (#195):
You are right that the new version of admin_menu will solve this issue (it might cause other side effects, I don't know). If it turns out that no other D6 module relies on the current way menu_router_build() works in D6, we could backport the change from D7, which turns menu_router_build into a peaceful function. If we do this, we don't need a lock any more for menu_router_build, the lock for menu_rebuild will be enough.
#197
Just a follow up.. I'm suspecting my issues were related to the "Content Type Administration by Organic Group" module.
I didn't get to xdebug yet, but I've just implemented #4 change as suggested here: http://drupal.org/node/374730
No further menu router errors so far, but it's only been 30min of testing. Still, they were happening approx every 5th node view, node edit or view edit prior.
#198
Yes, there is a known issue with the Content Type Administration by Organic Group contributed module that will render your site virtually unusable. Please read #330135: menu_router_build call in hook_init is a serious performance hit. I would strongly recommend that until that issue is resolved, that you disable the module and test again.
#199
@EvanDonovan,
Thanks for the reply and the link. I did read that thread previously. I've been running perfectly since patching without a single memery router error since. I've been working on my site and testing fairly heavily over the weekend, so I think this is a good sign. So for CTABOG, the D6 patch provided in this thread plus change #4 here http://drupal.org/node/374730 has so far solved the problem for me. And yes, in my case CTABOG was certainly the cause of my problems.
#200
Alright, I spent another 30 minutes looking at this patch and decided to commit it as is. It is solid code and documentation, despite the fact that I might have done some things slightly differently. I don't think one would have been better than the other, so I didn't feel like holding this back. Thanks for all the work.
I'd like to see us add an entry to CHANGELOG.txt though.
#201
Awesome work! Can we get a final D6 patch for Gabor? This is murdering D6 sites.
#202
@moshe
The D6 patch from #163 is still working for core
http://drupal.org/node/251792#comment-1831346
Or should it contain the extra lock as mentioned in #196 to protect against rouge modules? :p
http://drupal.org/node/251792#comment-1903500
Was there something else that needed to be included?
#203
#204
Could someone start D7 patches for using the API in "Cron, block rehash, and several other places in core ...". Put links here so we can help with reviews.
#205
here are some follow-up issues:
#553600: Apply locking framework to cron runs
#553602: Apply locking framework to block rehashing
#553610: Apply locking framework to theme registry rebuilds
#553614: Apply locking framework to fetching/processing update status
Please file more if you see other place that might need this.
#206
Another follow-up issue:
#553952: Apply locking framework to image generation
#207
subscribing. In need of the a final D6 port.
#208
Another duplicate: #583806: Duplicate entry INSERT INTO menu_router in ../includes/menu.inc on line 2423..
#209
And #552670: Duplicate Entry INSERT INTO Menu_router.
#210
subscribing for final patch to d6
#211
subscribe -- looking for the D6 patch
#212
subscribe - please needs some help on this one
#213
#163 changes D6 schema so need a Gabor decision about backporting.
#214
Apart from minor code style issues, the patch at #163 (if it still applies) should still work for D6, as it did on August 17 (#202).
IMO the problem with rogue modules (#202) can be addressed if and when it happens. #163 is a big enough improvement to make this core-worthy.
What might help Gabor, apart from code style fixes, is people confirming that this works on a variety of live sites and doesn't cause any other problems. #202 suggests it works OK, problems *were* reported at #186 but did turn out to be a rogue module (#198, #199 - Content Type Administration by Organic Group, this has now been fixed #374730: Site down due to this module?). I will attempt to get this running on a live site shortly.
#215
Adding tag per #200.
#216
Unless I'm mistaken, 7.x system.install is missing the update routine to install the new {semaphore} table.
#217
*If* this is added to D6, D7 does not need a update function.
#218
I would say, *if* this is added to D6, and *if* we require everyone to upgrade to the final 6.x release before upgrading to 7.x (which we usually don't AFAIK, although it is often recommended), then D7 does not need an update function..
#219
Common practice is add D7 upgrade path, then try to backport it o D6, then change D7 upgrade path
#220
The last submitted patch failed testing.
#221
@gpk: I have used the patch from #214 on a large (50,000+ nodes) Drupal site, but only on the test install, not the one that was actively being visited.
I have done various things to force menu_rebuild, such as saving the modules page or saving views. I have also done these things more than once at the same time to see if I could cause the database to get into an inconsistent state. I was not able to do so.
However, I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock. If it gets committed to Drupal 6 core, though, I will use it, simply to avoid having to keep applying my hack. I'll just have to see if the performance implications are really bad - especially in cases such as when people try to save multiple views at once.
#222
Subscribe
#223
@221:
>I noticed that this patch is distinctly slower than using db_lock_tables/db_unlock_tables, probably because it does its work at a higher level and because of the way the locking framework waits to acquire a lock
I've still not used it in anger ... yes, surprised to hear that it seems slower, as you say possibly your table lock doesn't last for as long as the 'menu_rebuild' lock which does more than just lock {menu_router}. And again lock_wait() only checks the lock every second, perhaps it should do so more often.
>see if I could cause the database to get into an inconsistent state. I was not able to do so.
Looking at the code is there a theoretical opportunity for rows to be left out of {menu_router}?
Suppose I have a (disabled) module aaa that defines a menu path.
1. Some other operation takes place that causes a menu rebuild
2. I enable module aaa and the menu rebuild then invoked happens to clash with the one from 1.
3. I don't acquire the lock in menu_rebuild() and so I have to wait (for up to 30s) for the lock to become available, i.e. for the other menu_rebuild to finish
4. My menu_rebuild() returns, without having done anything (i.e. the assumption is that the other rebuild has done what is necessary).
But - will hook aaa_menu() have been invoked? The list of menu hooks to call is determined early in http://api.drupal.org/api/function/menu_router_build/6, which is itself called right at the top of http://api.drupal.org/api/function/menu_rebuild/6
Thoughts?
#224
Patch #163 updated for Drupal 6.14: with incremented function system_update_6054()
Don't forget to run update.php after rolling the patch.
#225
FYI for those interested, a pressflow branch with this patch has been created. See https://answers.launchpad.net/pressflow/+question/85119
(Thanks go to David Strauss from fourkitchens)
Is this locking system going to be included in a future revision of D6, or will we have to wait until D7 to see it?
There is an item on the D6 issue queue, but it has been marked as a duplicate of this (which is not in the D6 issue queue)
#226
For the record, I started having this problem, found this issue queue yesterday and I don't have access to the command line on the server with the site. So I downloaded Drupal 6.14, put in my local directory with wamp. Then I downloaded Eclipse which made it easy to do the patch in #224.
I don't know anything about anything when it comes to web development. I so proud of myself right now. Didn't know I could do all that. Thanks guys you're all a big help.
BTW, my site, since I added the patch, is running so much faster!
#227
Works like a charm.
Finally fixed with it what was driving me insane (although not sure what was more insane, the modules page slow speed or the dozens of open threads on how to fix it...)
Might want to mention here to run update.php after installing the patch :)
#228
hope others may find this thread easily
there are other thread regarding errors in menu_router but this is the only patch worked for me.
Thanks for the patch...it works great!
#229
Re: patch in #224:
After the patch and before hitting 'Update' in update.php I get big red error messages like this:
Table 'drupal.semaphore' doesn't exist query: SELECT expire, value FROM semaphore WHERE name = 'locale_cache_nl' in /includes/lock.inc at line 154.
I know where it's coming from, but this will scare most users.
#230
Did you run update.php as people have mentioned I don't even know how many times?
#231
Table semaphore is created by the patch when you run update.php
#232
Is it safe to use this patch on Drupal 6 with Cacherouter (cache_router - http://www.drupal.org/project/cache_router ) module enabled?
#233
tested patch at http://drupal.org/node/251792#comment-2125124 working on multiple drupal 6.14 sites
#234
I think ratbc is justified. It would be great to get this committed.
#235
Re #230: this error message appears *while running* update.php, on the pages before the actual update happens.
#236
@gaele: you might want to put your site in maintenance mode while updating modules/applying patches.
Everyone else: do we know if this patch will be included in a future release? just judging by the number of threads about menu_router, it would be worthwhile.
#237
@230, 236: I think gaele's point is that when an administrator upgrades their site they will see the error message during the update process, and this will worry some of them.
#238
Upgrade path patch for D7 posted in #563106: Cannot upgrade from Drupal 6 to Drupal 7 - meta issue #42
CHANGELOG.txt entry already in under Task handling - this should be added for on-going D6 patch too.
#239
I applied this patch under D6 but it doesn't seem to have helped.
Does this apply specifically to MyISAM tables as I altered the menu tables to InnoDB to try and reduce locking problems.
#240
chrism2671 requested that failed test be re-tested.
#241
For Drupal 6: usleep() doesn't work on windows for PHP < 5.0 Thus, sleep() is the only option and we must lock for increments of 1 sec. For D7, we could possibly switch to usleep() so that the (for example) we retry every 100 ms.