At Koumbit we have platforms with around 200 sites in them. When we batch migrate those, the queue gets totally clogged and the server loads often passees the critical threshold set in the backend and the tasks starts failing miserably.
We have this every thursday. It seems the main problem is the tarring and untarring of backups, combined with having too many tasks ran in parallel.
One workaround we are trying now is to run only one task at a time. Another I would like to try would be to run the tasks queue serially instead of in parallel, which would give the server a better chance at recovering from big tasks loads.
Comments
Comment #1
anarcat CreditAttribution: anarcat commentedI committed a simple fix in 2.x:
http://drupalcode.org/project/hostmaster.git/commitdiff/3cda87bc2984e4ea...
The idea here is to try to run the queue serially. It may mean that we end up with concurrent tasks again: if the N tasks per run take more than the delay X between cron runs, then M+1 tasks are going to be running in parallel at the next dispatch, where M is the number of tasks that are still running from previous runs. For example, if i have 5 tasks per run every minute, if those 3 tasks take more than one minute, then 2 will have been finished, 1 will be running and 2 queued. The 1 task running will be skipped by the next run, but the two others will be taken up, and so 2 tasks will be running in parallel. On the next run, if those tasks are still not finished, that will be 3 tasks.
As a comparison, in the current scenario, all tasks will be started in parallel. So we'll have, in the first run, 5 tasks running in parallel. If we're optimistic and we say that they perform as well as serially and 2 tasks succeed, we'll still have 3 tasks running in parallel, and we'll start 2 more again, to get to our full blown 5 tasks in parallel.
While that works well for verifies and so on, for migrates it is a catastrophe.
While writing this, I realized that the above patch, by itself would make the dispatcher potentially run tasks twice if less than N-1 out of N tasks (e.g. 4 out of 5) would have been finished by the time of the next cron run. The following patch ensures that non-queued tasks are not ran by hosting-task, so they will be skipped by the previous dispatch if the next one picks it up:
http://drupalcode.org/project/hostmaster.git/commitdiff/7dbf534a35eb4755...
This all seems fairly wrong though. Maybe I should be looking at steven's dispatcher instead of messing around like this. One problem here is that the concept of "N tasks every X time" changes radically, whereas it used to mean "run a maximum of N tasks in parallel", now it *really* means run "N tasks every X time", or more precisely, run "a maximum of N tasks every X time".
Feedback very welcome. We'll test those two patches in production tomorrow.
Comment #2
omega8cc CreditAttribution: omega8cc commentedI found it too dangerous long time ago and the only really secure method to force the tasks to run in sequence and not parallel is:
drush vset --always-set hosting_queue_tasks_items 1
It is important also because even platform verify task (which should be rather secure to run in parallel) can cause Nginx frozen if you have Nginx cache enabled and it receives many parallel config reload requests).
Of course we should really get rid of that crazy tar + gzip method used on migrate/clone, anyway (and use it only for remote stuff + backup task).
Comment #3
anarcat CreditAttribution: anarcat commentedi reverted the first patch. i'll try the daemon from darthsteven instead, see https://drupal.org/project/hosting_queue_runner and related issues.
the second patch broke jenkins, but that's because we're deprecating a feature, we need to fix jenkins. obviously, this can't go in 1.x.
Comment #4
Steven Jones CreditAttribution: Steven Jones commentedSo we have a pretty cool queue runner over here:
http://drupal.org/project/hosting_queue_runner
which runs the queue serially, and executes tasks within a few seconds of them being created, basically solving this issue. So we should totally just have it in Aegir core for 2.x.
Note that the queue runner only executes the task queue and not the other queues so you still need the pesky crontab entry, and the dispatcher being started up every N minutes, but could we factor that out into a runner too? I suspect that we can't, and we might have to recommend to people to install both.
Comment #5
ergonlogic+1, hosting_queue_runner makes a world of difference efficiency-wise. No more waiting or having to run "drush @hostmaster hosting-tasks"
Comment #6
anarcat CreditAttribution: anarcat commentedComment #7
Steven Jones CreditAttribution: Steven Jones commentedComment #8
Steven Jones CreditAttribution: Steven Jones commentedComment #9
anarcat CreditAttribution: anarcat commentedThis still needs to be merged in 2.x. It also needs to be ported to drush 5:
See the patch in #1433406: Make hosting (specifically hosting-dispatch) work with Drush 5 for inspiration.
Comment #10
anarcat CreditAttribution: anarcat commentedI merged this as a submodule in 2.x.
I used the following howto to merge the code in, with history: https://wiki.koumbit.net/FormationGitAvanc%C3%A9e#Subtrees
Further changes can be merged in with a similar technique.