When cron.php is called, the harvesting kicks in and sends you to harvester/schedule/list after harvesting. In my opinion, harvesting should start after cron is finished, or at least don't hijack the process.

/admin/reports/status warns me that cron hasn't run in weeks :(

Comments

BarisW’s picture

Version: 6.x-2.x-dev »
pkiraly’s picture

Assigned: Unassigned » pkiraly
pkiraly’s picture

Hi Baris,

we know that this is a problem. We are looking for solutions, but we still did not find a good one. Our experiment so far:
- supercron module - it did not work fine
- MT (multithreaded) cron module - it works only on *nix, and with pathing Drupal Core
- simply copy and change the cron.php to cron-xc.php (or something like that), which would launch another cron-like hook

If you have another idea, we are ready to try.

BarisW’s picture

Hi Peter,

I´m no Drupal DEV guru, there are other ninjas out here that could help you :)
Would it be an idea to let cron.php open up the harvesting page in a new window, so the cron can continue doing its tasks and exit when it´s done?

mtwesley’s picture

Baris,

The problem isn't exactly with cron, but with Drupal's Batch API. There seems to be two ways to do batch:

  1. Single HTTP request on the page load
  2. Multiple HTTP requests with a special "batch.php" page

The first would timeout unless we embarked in the questionable practice of infinitely extending the PHP script. The second has one nasty problem: batch_process() calls drupal_goto(), which immediately redirects to the batch page (with the long progress bar). So, cron is interrupted and never fully runs.

I don't think opening up in a new window is possible, as the script is running on the server-side, PHP doesn't have real support for multiple threads, and both fopen() and cURL methods to run another script would still execute on the request. However, there may be a similar solution by using tricky "asynchronous PHP" methods; or really just trying to load a page in the background.

OPTION 1: As Peter mentioned, one solution is to distribute our own PHP script and have the admin configure crontab on UNIX/Linux or schedule tasks with Windows to launch it periodically, along with cron.php.

OPTION 2: Not sure if this works, but it might. From hook_cron(), we could try opening a socket to the host and port where the Drupal site is running, writing a simple GET request to the server by requesting the script that would start the batch in the background, refusing to wait for a response, and eventually closing the socket. The script that starts the batch could do "ignore_user_abort(true)" and "set_time_limit(0)", which should let it continue to run even when the socket is closed. Then it could sleep for 5 seconds and launch batch_process().

I'd vouch for Option 1, since it feels more natural, but Option 2 would be cool if it works.

BarisW’s picture

Hmm, so the problem is that the harvest can take a long time?

I'd say that it should only use the batch.php page on the first run. So after I set up a repository, I could kick the harvesting and wait a night to harvest all articles. After this time we should only get the new or changed articles, which shouldn't take as long as a full harvest.

Maybe you could use the same method as Simplenews, Pathauto or Drupal search do: let the admin choose how many articles have to be harvested on cron run (say max 100). It's up to the admin to decide how many times a day cron is called. And by limiting the amount of articles, it would maybe be possible to not use the batch page?

I really need this to work, as our site is already live :0

dynamind’s picture

Status report stated that cron hadn't run for 4 weeks! I ran it manually, but after an hour or so the harvester exited with an out-of-memory error. I was not able to restart cron afterwards, and the status report still says cron hasn't run.

We are getting complaints because our archives are not up to date. Our only option right now seems to harvest manually.