I have a scenario here.. When hosting-cron is killed (by a reboot, for example), it doesn't have time to cleanup after itself and leaves a semaphore in the variable table that keeps it from running properly after reboot.
The symptom is as such:
dispatching queues [0.56 sec, 29.16 MB] [notice]
queue cron already running [0.57 sec, 30.36 MB] [notice]
The workaround is to manually update the system table:
mysql> update variable set value="i:0;" where name = 'hosting_queue_cron_running';
I think this should be handled gracefully: when drush is killed, it should send rollback signals, if possible. Short of that, we should at least make sure certain cleanup hooks can be ran on interruptions such as this.
This could be documented in the FAQ in the meantim.
Comments
Comment #1
j0nathan commentedSubscribing.
Comment #2
omega8cc commentedHere is a tiny patch which could help to prevent overloading system with too many crons fired up at once, which is one of the possible reasons of problems with not released semaphore, but it is not a solution, of course (you submitted that issue about semaphore which I should submit a few weeks ago probably). http://github.com/omega8cc/hostmaster/commit/fd4c5413b47c86b4b8cd3d32179...
Comment #3
omega8cc commentedIn the meantime we could also add a simple how-to (FAQ), like:
When cron for your sites stopped working, it is possible that for some reason (system overload, broken site, timeout etc), Aegir failed to release a sites cron semaphore. To release it, use this simple recipe:
Comment #4
Anonymous (not verified) commentedIt's been added to the FAQ at least.
Comment #5
omega8cc commentedAttached patch should fix this issue. It is a simple port from core
function drupal_cron_run().Comment #6
omega8cc commentedThis patch just helped also with locked cron for tasks, as expected, since it will release any *running old semaphore. Please review it/test etc to make sure it doesn't break anything in the meantime.
Comment #7
omega8cc commentedTested on a few "locked" hostmasters and works fine.
Marking as RTBC.
Comment #8
DanielJohnston commentedSubscribe. I've run into this a few times now, looking forward to it popping up in the next beta.
Comment #9
Anonymous (not verified) commentedSo I was just talking in IRC and mused whether this code should be directly in dispatch.hosting.inc just before the check for 'already running' is performed.
If the code is in hosting_get_queues(), it will be run a lot of times in other places, even when the queue summary block loads in the frontend, and I think that's a lot of mechanism, despite a small amount of variable_get/del stuff, just to load the page.
If it's to allow the dispatch to run, it should be specific to dispatch, that's just my opinion, but I am welcome to others telling me it's not that costly an operation.
Comment #10
omega8cc commentedI agree. The correct patch: http://gitorious.org/aegir/hostmaster/commit/ce39134436f2b1c2fbd3b3bde05...
Comment #11
omega8cc commentedThe minor result of moving this check to the dispatch.hosting.inc is that it will require two cron runs after the locked semaphore will reach the 3600s limit before it will start the sites cron again, because the check runs after
$queues = hosting_get_queues();so the$info['running']is true at first attempt (which is obvious probably).Comment #12
anarcat commentedI reviewed your patch, but I think we can improve this so that we check the process table. To do this, we need:
1. store the process ID when dispatching
2. check the process ID for existence
Comment #13
DanielJohnston commentedDoes this still need work? I thought it was meant to have been fixed in more recent releases.
Comment #14
juliangb commentedSubscribe
Comment #15
crea commentedSubscribing
Comment #16
playfulwolf commentedany progress? got the same problem :(
Comment #17
playfulwolf commentedsorry, wrong window!!!!!!
Comment #18
playfulwolf commentedsorry again
Comment #19
steven jones commentedLooking at this issue as part of office hours.
Comment #20
steven jones commentedLooking at the code, this really should be fixed. If you still have issues with this on 1.7 or higher, please re-open and we'll take a look.
Comment #21
steven jones commentedActually 'fixed' is probably a better status here.