At Koumbit we have platforms with around 200 sites in them. When we batch migrate those, the queue gets totally clogged and the server loads often passees the critical threshold set in the backend and the tasks starts failing miserably.

We have this every thursday. It seems the main problem is the tarring and untarring of backups, combined with having too many tasks ran in parallel.

One workaround we are trying now is to run only one task at a time. Another I would like to try would be to run the tasks queue serially instead of in parallel, which would give the server a better chance at recovering from big tasks loads.

Comments

anarcat’s picture

Status: Active » Needs review

I committed a simple fix in 2.x:

http://drupalcode.org/project/hostmaster.git/commitdiff/3cda87bc2984e4ea...

The idea here is to try to run the queue serially. It may mean that we end up with concurrent tasks again: if the N tasks per run take more than the delay X between cron runs, then M+1 tasks are going to be running in parallel at the next dispatch, where M is the number of tasks that are still running from previous runs. For example, if i have 5 tasks per run every minute, if those 3 tasks take more than one minute, then 2 will have been finished, 1 will be running and 2 queued. The 1 task running will be skipped by the next run, but the two others will be taken up, and so 2 tasks will be running in parallel. On the next run, if those tasks are still not finished, that will be 3 tasks.

As a comparison, in the current scenario, all tasks will be started in parallel. So we'll have, in the first run, 5 tasks running in parallel. If we're optimistic and we say that they perform as well as serially and 2 tasks succeed, we'll still have 3 tasks running in parallel, and we'll start 2 more again, to get to our full blown 5 tasks in parallel.

While that works well for verifies and so on, for migrates it is a catastrophe.

While writing this, I realized that the above patch, by itself would make the dispatcher potentially run tasks twice if less than N-1 out of N tasks (e.g. 4 out of 5) would have been finished by the time of the next cron run. The following patch ensures that non-queued tasks are not ran by hosting-task, so they will be skipped by the previous dispatch if the next one picks it up:

http://drupalcode.org/project/hostmaster.git/commitdiff/7dbf534a35eb4755...

This all seems fairly wrong though. Maybe I should be looking at steven's dispatcher instead of messing around like this. One problem here is that the concept of "N tasks every X time" changes radically, whereas it used to mean "run a maximum of N tasks in parallel", now it *really* means run "N tasks every X time", or more precisely, run "a maximum of N tasks every X time".

Feedback very welcome. We'll test those two patches in production tomorrow.

omega8cc’s picture

I found it too dangerous long time ago and the only really secure method to force the tasks to run in sequence and not parallel is:

drush vset --always-set hosting_queue_tasks_items 1

It is important also because even platform verify task (which should be rather secure to run in parallel) can cause Nginx frozen if you have Nginx cache enabled and it receives many parallel config reload requests).

Of course we should really get rid of that crazy tar + gzip method used on migrate/clone, anyway (and use it only for remote stuff + backup task).

anarcat’s picture

Status: Needs review » Needs work

i reverted the first patch. i'll try the daemon from darthsteven instead, see https://drupal.org/project/hosting_queue_runner and related issues.

the second patch broke jenkins, but that's because we're deprecating a feature, we need to fix jenkins. obviously, this can't go in 1.x.

Steven Jones’s picture

Title: batch migrates can easily DOS (denial of service) the aegir server on big platforms » Pull the queue runner into Aegir 2
Version: 6.x-0.4-alpha3 » 6.x-1.2
Category: bug » feature

So we have a pretty cool queue runner over here:

http://drupal.org/project/hosting_queue_runner

which runs the queue serially, and executes tasks within a few seconds of them being created, basically solving this issue. So we should totally just have it in Aegir core for 2.x.

Note that the queue runner only executes the task queue and not the other queues so you still need the pesky crontab entry, and the dispatcher being started up every N minutes, but could we factor that out into a runner too? I suspect that we can't, and we might have to recommend to people to install both.

ergonlogic’s picture

+1, hosting_queue_runner makes a world of difference efficiency-wise. No more waiting or having to run "drush @hostmaster hosting-tasks"

anarcat’s picture

Issue tags: +aegir-2.0
Steven Jones’s picture

Version: 6.x-1.2 » 6.x-2.x-dev
Issue tags: -aegir-2.0
Steven Jones’s picture

Issue tags: +AUX Project
anarcat’s picture

This still needs to be merged in 2.x. It also needs to be ported to drush 5:

aegir@marcos:~/hostmaster-6.x-2.x/profiles/hostmaster$ drush @hostmaster hosting-queue-runner
PHP Fatal error:  Call to undefined function drush_backend_invoke() in /home/anarcat/src/hosting_queue_runner/hosting_queue_runner.drush.inc on line 95

Fatal error: Call to undefined function drush_backend_invoke() in /home/anarcat/src/hosting_queue_runner/hosting_queue_runner.drush.inc on line 95
Drush command terminated abnormally due to an unrecoverable error.                                                                                                                                                                  [error]
Error: Call to undefined function drush_backend_invoke() in /home/anarcat/src/hosting_queue_runner/hosting_queue_runner.drush.inc, line 95

See the patch in #1433406: Make hosting (specifically hosting-dispatch) work with Drush 5 for inspiration.

anarcat’s picture

Status: Needs work » Fixed

I merged this as a submodule in 2.x.

I used the following howto to merge the code in, with history: https://wiki.koumbit.net/FormationGitAvanc%C3%A9e#Subtrees

Further changes can be merged in with a similar technique.

Automatically closed -- issue fixed for 2 weeks with no activity.

  • Commit 3cda87b on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    try to run the queue serially to fix DOS problems, see #1189556
    
    
  • Commit 7dbf534 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    do not run an unqueued task unless we force it, otherwise we may run...
  • Commit 3f99176 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    Revert "try to run the queue serially to fix DOS problems, see #1189556...

  • Commit 3cda87b on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    try to run the queue serially to fix DOS problems, see #1189556
    
    
  • Commit 7dbf534 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    do not run an unqueued task unless we force it, otherwise we may run...
  • Commit 3f99176 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    Revert "try to run the queue serially to fix DOS problems, see #1189556...