I have a scenario here.. When hosting-cron is killed (by a reboot, for example), it doesn't have time to cleanup after itself and leaves a semaphore in the variable table that keeps it from running properly after reboot.

The symptom is as such:

dispatching queues [0.56 sec, 29.16 MB]                                 [notice]
queue cron already running [0.57 sec, 30.36 MB]                         [notice]

The workaround is to manually update the system table:

mysql> update variable set value="i:0;" where name = 'hosting_queue_cron_running';

I think this should be handled gracefully: when drush is killed, it should send rollback signals, if possible. Short of that, we should at least make sure certain cleanup hooks can be ran on interruptions such as this.

This could be documented in the FAQ in the meantim.

Comments

j0nathan’s picture

Subscribing.

omega8cc’s picture

Here is a tiny patch which could help to prevent overloading system with too many crons fired up at once, which is one of the possible reasons of problems with not released semaphore, but it is not a solution, of course (you submitted that issue about semaphore which I should submit a few weeks ago probably). http://github.com/omega8cc/hostmaster/commit/fd4c5413b47c86b4b8cd3d32179...

omega8cc’s picture

In the meantime we could also add a simple how-to (FAQ), like:

When cron for your sites stopped working, it is possible that for some reason (system overload, broken site, timeout etc), Aegir failed to release a sites cron semaphore. To release it, use this simple recipe:

$ su -s /bin/bash - aegir
$ cd /path/to/hostmaster/sites/domain
$ drush vdel hosting_queue_cron_running -y
Anonymous’s picture

It's been added to the FAQ at least.

omega8cc’s picture

Status: Active » Needs review
StatusFileSize
new1.18 KB

Attached patch should fix this issue. It is a simple port from core function drupal_cron_run().

omega8cc’s picture

This patch just helped also with locked cron for tasks, as expected, since it will release any *running old semaphore. Please review it/test etc to make sure it doesn't break anything in the meantime.

omega8cc’s picture

Status: Needs review » Reviewed & tested by the community

Tested on a few "locked" hostmasters and works fine.

Marking as RTBC.

DanielJohnston’s picture

Subscribe. I've run into this a few times now, looking forward to it popping up in the next beta.

Anonymous’s picture

So I was just talking in IRC and mused whether this code should be directly in dispatch.hosting.inc just before the check for 'already running' is performed.

If the code is in hosting_get_queues(), it will be run a lot of times in other places, even when the queue summary block loads in the frontend, and I think that's a lot of mechanism, despite a small amount of variable_get/del stuff, just to load the page.

If it's to allow the dispatch to run, it should be specific to dispatch, that's just my opinion, but I am welcome to others telling me it's not that costly an operation.

omega8cc’s picture

omega8cc’s picture

The minor result of moving this check to the dispatch.hosting.inc is that it will require two cron runs after the locked semaphore will reach the 3600s limit before it will start the sites cron again, because the check runs after $queues = hosting_get_queues(); so the $info['running'] is true at first attempt (which is obvious probably).

anarcat’s picture

Status: Reviewed & tested by the community » Needs work

I reviewed your patch, but I think we can improve this so that we check the process table. To do this, we need:

1. store the process ID when dispatching
2. check the process ID for existence

DanielJohnston’s picture

Does this still need work? I thought it was meant to have been fixed in more recent releases.

juliangb’s picture

Subscribe

crea’s picture

Subscribing

playfulwolf’s picture

any progress? got the same problem :(

playfulwolf’s picture

Status: Needs work » Closed (fixed)

sorry, wrong window!!!!!!

playfulwolf’s picture

Status: Closed (fixed) » Needs work

sorry again

steven jones’s picture

Assigned: Unassigned » steven jones

Looking at this issue as part of office hours.

steven jones’s picture

Assigned: steven jones » Unassigned
Status: Needs work » Closed (cannot reproduce)

Looking at the code, this really should be fixed. If you still have issues with this on 1.7 or higher, please re-open and we'll take a look.

steven jones’s picture

Status: Closed (cannot reproduce) » Fixed

Actually 'fixed' is probably a better status here.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

  • Commit b1e410c on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
    #931550 - release old cron semaphore if it exists
    

  • Commit b1e410c on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
    #931550 - release old cron semaphore if it exists