We just upgraded all managed Aegir instances to recent 1.2 release, which includes this commit: http://drupalcode.org/project/hostmaster.git/commit/9d16d8a and started receiving complaints about no longer running cron for sites with enabled Wget method. It worked perfectly stable with HEAD before this commit, and now, randomly, on different servers and Aegir instances, it no longer works.

I suspect there is something wrong with this change, but not sure how to debug this, because it happens on some servers only, without any visible reason.

Comments

anarcat’s picture

Do you see something in the watchdog?

Is it just the site or also aegir that believe the cron hasn't ran?

omega8cc’s picture

When running it manually, it always displays weird info for hosting-cron: found running tasks, starting 1 out of 1 items:

dispatching queues [0.31 sec, 22.44 MB]                                                                                                [notice]
found  running tasks, starting 1 out of 1 items [0.32 sec, 24.01 MB]                                                                   [notice]
Calling system((/usr/local/bin/php /var/aegir/drush/drush.php --php='/usr/local/bin/php'  --items='1' --quiet --root='/data/disk/o1/aegir/distro/001' --uri='o1.linode.us.host8.biz' hosting-cron --backend  2>&1 &) > /dev/null);
found 0 running tasks, starting 1 out of 1 items [0.34 sec, 24.01 MB]                                                                  [notice]
Calling system((/usr/local/bin/php /var/aegir/drush/drush.php --php='/usr/local/bin/php'  --items='1' --quiet --root='/data/disk/o1/aegir/distro/001' --uri='o1.linode.us.host8.biz' hosting-tasks --backend  2>&1 &) > /dev/null);
Command dispatch complete [0.36 sec, 23.97 MB]                                                                                         [notice]
 Timer  Cum (sec)  Count  Avg (msec) 
 page   0.196      1      195.8      

Peak memory usage was 24.18 MB [0.36 sec, 23.97 MB]                                                                                    [memory]
o1@linode:~$ 

And this on fresh install, with one site created.

omega8cc’s picture

This time it is also Aegir. It doesn't run the cron for sites for, say, 1 day, and it correctly displays on every site node that the cron was run 1 day ago - of course also for the hostmaster site.

anarcat’s picture

Can you try running hosting-cron --debug to see what happens there?

omega8cc’s picture

Hah, it shows more:

o1@server:~/aegir/distro/004/sites/aegir.domain.com$ drush hosting-cron --debug
Bootstrap to phase 0. [0.03 sec, 1.37 MB]                            [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drush() [0.03 sec, 1.51 MB] [bootstrap]
Bootstrap to phase 5. [0.12 sec, 5.72 MB]                                                  [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_root() [0.12 sec, 5.72 MB]                 [bootstrap]
Loading drushrc "/data/disk/o1/aegir/distro/004/drushrc.php" into "drupal" scope. [0.12    [bootstrap]
sec, 5.72 MB]
Initialized Drupal 6.22 root directory at /data/disk/o1/aegir/distro/004 [0.15 sec, 6.69      [notice]
MB]
Drush bootstrap phase : _drush_bootstrap_drupal_site() [0.15 sec, 6.69 MB]                 [bootstrap]
Initialized Drupal site aegir.domain.com at sites/aegir.domain.com [0.15 sec,     [notice]
6.69 MB]
Loading drushrc "/data/disk/o1/aegir/distro/004/sites/aegir.domain.com/drushrc.php"  [bootstrap]
into "site" scope. [0.15 sec, 6.7 MB]
Drush bootstrap phase : _drush_bootstrap_drupal_configuration() [0.17 sec, 7.03 MB]        [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_database() [0.18 sec, 7.06 MB]             [bootstrap]
Successfully connected to the Drupal database. [0.18 sec, 7.06 MB]                         [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_full() [0.18 sec, 7.32 MB]                 [bootstrap]
Bootstrap to phase 6. [0.31 sec, 14.03 MB]                                                 [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_login() [0.31 sec, 14.03 MB]               [bootstrap]
Successfully logged into Drupal as Anonymous (uid=0) [0.31 sec, 14.03 MB]                  [bootstrap]
Found command: hosting-cron (commandfile=hosting) [0.31 sec, 14.03 MB]                     [bootstrap]
Warning: The command callback function hosting_run_queue has a deprecated name.  It must      [notice]
begin with drush_hosting_.  Skipping hook functions. [0.31 sec, 14.04 MB]
running cron on URL <em>http://somedomain.com/cron.php</em> [0.35 sec, 15.43      [notice]
MB]
WD hosting_cron: cron failed on site somedomain.com with error Not Found [0.42    [notice]
sec, 15.5 MB]
Command dispatch complete [0.43 sec, 15.48 MB]                                                [notice]
 Timer  Cum (sec)  Count  Avg (msec) 
 page   0.263      1      262.62     

Peak memory usage was 15.55 MB [0.43 sec, 15.48 MB]                                           [memory]
o1@server:~/aegir/distro/004/sites/aegir.domain.com$ 
omega8cc’s picture

Now, after I run drush hosting-cron --debug on this server with over 110 sites hosted, the cron queue unlocked and runs after 2 days and 2 hours being locked. I will monitor the web server log to see when/if it stops again.

anarcat’s picture

Status: Active » Postponed (maintainer needs more info)

Okay, let us know how that goes.

For the record, ever since we implemented that patch at koumbit, cron has ran reliably everywhere and we at least had consistency between the frontend and the sites when cron wouldn't run for some reason.

omega8cc’s picture

OK, that was quick.

When the queue hits the first site which is switched to offline mode (so error 403), it got stuck again and tries to run the cron *only for this site*, on every new run, forever:

o1@server:~/aegir/distro/004/sites/aegir.domain.com$ drush hosting-cron --debug
Bootstrap to phase 0. [0.03 sec, 1.37 MB]                            [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drush() [0.48 sec, 1.51 MB] [bootstrap]
Bootstrap to phase 5. [0.57 sec, 5.72 MB]                                                                                        [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_root() [0.57 sec, 5.72 MB]                                                       [bootstrap]
Loading drushrc "/data/disk/o1/aegir/distro/004/drushrc.php" into "drupal" scope. [0.57 sec, 5.72 MB]                            [bootstrap]
Initialized Drupal 6.22 root directory at /data/disk/o1/aegir/distro/004 [0.6 sec, 6.69 MB]                                         [notice]
Drush bootstrap phase : _drush_bootstrap_drupal_site() [0.6 sec, 6.69 MB]                                                        [bootstrap]
Initialized Drupal site aegir.domain.com at sites/aegir.domain.com [0.6 sec, 6.69 MB]                                   [notice]
Loading drushrc "/data/disk/o1/aegir/distro/004/sites/aegir.domain.com/drushrc.php" into "site" scope. [0.6 sec, 6.7 MB]   [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_configuration() [0.62 sec, 7.03 MB]                                              [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_database() [0.63 sec, 7.06 MB]                                                   [bootstrap]
Successfully connected to the Drupal database. [0.63 sec, 7.06 MB]                                                               [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_full() [0.63 sec, 7.32 MB]                                                       [bootstrap]
Bootstrap to phase 6. [0.79 sec, 14.03 MB]                                                                                       [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drupal_login() [0.79 sec, 14.03 MB]                                                     [bootstrap]
Successfully logged into Drupal as Anonymous (uid=0) [0.79 sec, 14.03 MB]                                                        [bootstrap]
Found command: hosting-cron (commandfile=hosting) [0.79 sec, 14.03 MB]                                                           [bootstrap]
Warning: The command callback function hosting_run_queue has a deprecated name.  It must begin with drush_hosting_.  Skipping       [notice]
hook functions. [0.79 sec, 14.04 MB]
running cron on URL <em>http://protectedsite.cc/cron.php?cron_key=OCELkvSon_CxBedIRAnAyUztyb5jERWOg3zs-JuEvl8</em> [0.84 sec,    [notice]
15.43 MB]
WD hosting_cron: cron failed on site protectedsite.cc with error Forbidden [0.87 sec, 15.53 MB]                                  [notice]
Command dispatch complete [0.87 sec, 15.48 MB]                                                                                      [notice]
 Timer  Cum (sec)  Count  Avg (msec) 
 page   0.249      1      248.66     

Peak memory usage was 15.57 MB [0.9 sec, 15.48 MB]                                                                                  [memory]
o1@server:~/aegir/distro/004/sites/aegir.domain.com$ 
anarcat’s picture

Assigned: Unassigned » anarcat
Status: Postponed (maintainer needs more info) » Active

I see, this is indeed what the code does.

If you try to run more than one task at a time, a single broken site will not block the queue anymore, so that's a first workaround. Oddly enough, disabling the site will also fix that problem.

Note that this was a solution to another fairly annoying bug (but not as critical): #1197048: cron uses external command; cron only run if canonical hostname exists.

I'll work on a patch now.

anarcat’s picture

Status: Active » Needs review

Please try commit [f46c7ce63d761ee933d548dab462dfa850086789] I just pushed to head, it should fix the problem.

omega8cc’s picture

Status: Needs review » Needs work

This doesn't help at all. The queue is still locked and runs the same site's cron with the same error 403, every minute.

I don't think it is related to the number of crons set to run at once. There is no such setting for cron queue in Aegir anyway, only for tasks queue. In this case it normally runs cron for 20 sites at every run, when I set the frequency to 1 minute. Yet, it doesn't help.

I believe that we need to properly handle any non-200 response to fix this issue.

anarcat’s picture

Status: Needs work » Needs review

I committed a new fix for this.

omega8cc’s picture

Status: Needs review » Reviewed & tested by the community

This commit http://drupalcode.org/project/hostmaster.git/patch/cb0c2d2 fixes the issue.

Now it just runs further after hitting any site with 403 response.

Thanks!

anarcat’s picture

Status: Reviewed & tested by the community » Fixed

merged into 1.x.

anarcat’s picture

Issue tags: +hotfix

For those stumbling upon this problem in 1.2, you can apply this patch directly:

http://drupalcode.org/project/hostmaster.git/patch/734d49b07e4dc174bcd07...

Automatically closed -- issue fixed for 2 weeks with no activity.

  • Commit f46c7ce on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    #1222208 fix cron to not block on broken sites
    
    we explode the listing...
  • Commit cb0c2d2 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    #1222208 really fix cron
    
    that was a stupid mistake - break would just...

  • Commit f46c7ce on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    #1222208 fix cron to not block on broken sites
    
    we explode the listing...
  • Commit cb0c2d2 on 7.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-588728-views-integration, dev-1403208-new_roles, dev-helmo-3.x by anarcat:
    #1222208 really fix cron
    
    that was a stupid mistake - break would just...