I got a set of operations that're handled using the Batch API. It works like a charm when called manually.

Now I need to do what my Batch API implementation does during cron run. I'm trying to do this with hook_cron, using this code:

<?php
function my_module_cron() {
  watchdog('Test Raf', "Cron hook from my_module...");
  my_module_batch_init_start();
  watchdog('Test Raf', "Finished batch...");
}
?>

my_module_batch_init_start() is the first of a set of 3 batches to run. That function declares the first batch, defines all operations with the right parameters, along with the finish function to call and the finish page to go to (which sets off the second batch job) and kicks it all off.

Now, when I run cron, I get this (turned around, so the first entry appears at the top):
Cron hook from my_module...
Cron run exceeded the time limit and was aborted.

Ehhhhmmm... That's not what's supposed to happen when using the Batch API...

I did some research online.

Some guy had a similar problem and suggested using the Drupal_Queue module (see his post here: http://svendecabooter.be/blog/performing-batch-data-operations-on-cron-r...)

Another result showed that it worked just fine for another person (see here: http://drupal.org/node/616516 ). That guy fixed the problem by setting a finished call and function. I got that, though.

So I get conflicting results on whether or not this is possible...

What I'm doing during the batch operation, is syncing data between two sites using their API, the Drupal site with my module being the middle man passing everything on from one point to the other and back. There's no fancy-shmancy Javascript or other client-based stuff going on (except maybe the progress bar the Batch API sets by default, which didn't seem to be a problem for the guy in the second result). It's all just API calls and more API calls.

Does anybody here know if, in my case, the first guy's right and the second one's not, or if it's possible to get the thing working like the second guy did, and if so, how? Cause right now, I'm hoping I can get it working using hook_cron. Rewriting the entire batch into another API is going back half of my module's development.

Comments

Raf’s picture

Crap... I guess I'll have to rewrite alot...

Found commentary of John VanDyk on this (here: http://tinyurl.com/39hsov3 -- URL shortened because it's ridiculously long). Basically, he makes it come down to this:

  • Batch API is for when you want to do bulk operations with an interface
  • Queue is for when you want to do bulk automatically

Of course, if VanDyk says so, it must be true (and I actually mean that. No reading sarcastic undertones or such!)

So in other word: If I want a progress bar, I use Batch API. If I want to run it during cron, I use Queue.

Since I need to do both, I'm afraid I'll have to use both. Two APIs that do the same thing... Time to salvage what code I can and see if I can separate the actual fetching / processing / sending data separately in such a way that both APIs can use them.

Guess I'll better go learn Queue now.

Raf’s picture

Just as reference: The slides VanDyk uses in his presentation during DrupalCon SF 2010 can be found here: http://www.ent.iastate.edu/it/Batch_and_Queue.pdf

Now back to finding info on the Queue API.

Example module can be found here:
http://api.drupal.org/api/examples/queue_example--queue_example.module/7
(Again, mostly for reference both to people needing this as well, and for me myself, cause I'll have forgotten the link by monday)

Turns out that link has a link to the slides and a video of that DrupalCon presentation as well.

tmrhymer’s picture

I'd like to utilize queue to automate processing a couple csv files once every morning. I have a form set up that handles this logic perfectly using the batch api. Using the context object I can process whatever number of records I want at a time and save my place in the file with the context object, and then reload my place in the file on the next runs. Does anyone have any examples of converting something similar using the drupal_queue module. I understand the concept of a queue and have watched VanDyk's video, but I'm not sure how to utilize it for my needs. Any help would be greatly appreciated.

Raf’s picture

I managed to do this a few weeks ago. It's pretty simple and straightforward once you know how to do it.

This is how I did it:
Attention: This is done using the drupal_queue module for Drupal 6. For Drupal 7, the queue part is similar, but with a slight difference when actually setting the queue up. Dunno if there's a difference for the Batch part.

The batch

Take the actual logic out of the batch and place it in separate functions. All the batch then does, is be a wrapper around these functions. Your code will now look like this:

<?php
/**
 * BATCH
 */

// Batch definition
function mymodule_mybatch() {
  $lotsa_data = mymodule_logic_get_data();

  foreach ($lotsa_data as $one_data) {
    $operations[] = array('mymodule_mybatch_first_operation', array($one_data));
    $operations[] = array('mymodule_mybatch_second_operation', array());
  }

  $batch = array(
    'operations' => $operations,
    'title' => t('My batch'),
    'init_message' => t('Initializing'),
    'error_message' => t('Whoopseedaisy'),
    'finished' => 'mymodule_mybatch_finished',
  );

  batch_set($batch);
}



// Batch callbacks
function mymodule_mybatch_first_operation($data, &$context) {
  $my_variable = mymodule_logic_function_one($data);
  $context['message'] = t('First operation');
  $context['sandbox']['my_stuff'] = $my_variable;
}



function mymodule_mybatch_second_operation(&$context) {
  $my_variable = $context['sandbox']['my_stuff'];
  mymodule_logic_function_two($my_variable);
  $context['message'] = t('Second operation');
}



// Batch finish callback
function mymodule_mybatch_finished($success, $results, $operations) {
  $message = t('All done!');
  drupal_set_message($message);
}



/**
 * ACTUAL LOGIC
 */
function mymodule_logic_get_data() {
  // Get some data
  // Return some data
}



function mymodule_logic_function_one($data) {
  // Do some stuff
  // Return something
}



function mymodule_logic_function_two($my_variable) {
  // Do some stuff
}
?>
The queue
  1. Make wrapper functions around the logic functions you separated, similar to the operation callbacks for the batch. For a queue, they're worker callbacks, and always take the parameter $item (which is the queue item that's being handled).
     
  2. Define your queues using hook_cron_queue_info().

    Since queues can't exchange data the way the Batch API does (no $context equivalent), and since you can't assign worker callback results to $item for another queue (defining all $items are done before the queues are run, not while they're running. This should actually be possible if you experiment a bit, but this way's just simpler), put all operations for a single item in one worker callback. You could probably get around this in a couple of ways, but due to the Queue API's nature, it doesn't guarantee stability.

    If you have several operations that don't have anything to do with the results of the previous one, you can set up separates queues for them.
     

  3. Finally, actually set your queue up. You can do this in a separate function and call it in hook_cron. For the sake of not cluttering everything up, I'm doing this in hook_cron directly.

 
This code will look like this:

<?php
/**
 * QUEUE
 */

// Queue definition -- implementation of hook_queue_info()
function mymodule_myqueue_queue_info() {
  $queue['myqueue'] = array('worker callback' => 'mymodule_myqueue_worker_one');
  return $queue;
}



// Worker callbacks
function mymodule_myqueue_worker_one($item) {
  $my_variable = mymodule_logic_function_one($item);
  mymodule_logic_function_two($my_variable);
}



// Implementation of hook_cron().
function mymodule_cron() {
  // Create the actual queue
  $myqueue = drupal_queue_get('myqueue');
  $myqueue->createQueue();

  // Get all data for your queue
  $lotsa_data = mymodule_logic_get_data();

  // Fill the queue
  foreach ($lotsa_data as $one_data) {
    $myqueue->createItem($one_data);
  }

  // Run the queue
  drupal_queue_cron_run();
}
?>

 
That's it! All done. Depending on what you need to do, I suggest also implementing a way that checks whether or not an item's been handled already. That way, you won't endlessly queue the same things up for no good reason. If you always need to handle the same single file completely from beginning to end, then you don't need to.

chrisshattuck’s picture

This was a thoughtful example, well done! One thing I noticed when I implemented this was that the hook (at least for D6, unless I'm missing something) is actually hook_cron_queue_info(), and not hook_queue_info(). Also, when you add this to the docs, it might help to clarify that this is code from two different modules, or remove the '_myqueue' from the function name to make it just mymodule_hook...

Thanks again for this, it was helpful today!

Learn virtually any aspect of Drupal on BuildAModule, where I've recorded over 2200 video tutorials.

wojtha’s picture

Thanks RaF for the example, it is very helpful.

However there is a problem in queue cron processing - it will not work for some cases of large datasets processing which needs more than one cron run to process the queue.

Calling drupal_queue_cron_run() in the own hook_cron() implementation isn't needed since it will be launched by Drupal Queue itself by default. If you don't disable drupal_queue_on_cron setting, drupal_queue_cron_run() could be called twice during one cron run.

/**
 * Implementation of hook_cron().
 */
function drupal_queue_cron() {
  if (variable_get('drupal_queue_on_cron', TRUE)) {
    drupal_queue_cron_run();
  }

  // etc ...
}

When you need to process large datasets using Drupal Queue I see four basic possibilities:

A) Set "drupal_queue_on_cron" setting to FALSE and use your hook_cron implementation only for creation of the queue not for processing. For the queue processing copy the drupal_queue_cron.php from Drupal Queue module directory to the Drupal root directory and define separate crontab/job schedule call to that file. Call this file more frequent (number of drupal_queue_cron.php calls per one cron.php run depends on how big is the processed queued job.

B) Define some threshold - time limit - for the queued job.

/**
 * Implementation of hook_cron()
 */
function mymodule_cron() {
  $threshold = variable_get('mymodule_cron_threshold', 3600);
  $last_run = variable_get('mymodule_job_last_run', 0);

  if ((time() - $last_run) > $threshold) {
    // Set start time
    variable_set('mymodule_job_last_run', time());

    // Create the actual queue 
    // Load operations definition
    // Fill the queue
  }

  // Run the queue - DISABLED - See drupal_queue_cron()
  // drupal_queue_cron_run();
}

C) Example B is a kind of time based semaphore. We could evolute this further and implement a semaphore which will not allow creation of the new queued job, until the last queued job will be done.

/**
 * Implementation of hook_cron()
 */
function mymodule_cron() {
  if (variable_get('mymodule_job_done', FALSE)) {    
    // Create the actual queue 
    // Load operations definition
    // Fill the queue

    // We defined new job set semaphore back to "job unprocessed" state
    variable_set('mymodule_job_done', FALSE);
  }

  // Run the queue - DISABLED - See drupal_queue_cron()
  // drupal_queue_cron_run();
}

function mymodule_last_operation() {
    variable_set('mymodule_job_done', TRUE);
}

D) Combine B and C: time limit + semaphore for the queued job. This implementation does this: create new queued job after some predefined time period after the end of the last processed job. NOTE: variable 'mymodule_job_last_run is set to current time in the last operation of the queue.

/**
 * Implementation of hook_cron()
 */
function mymodule_cron() {
  $threshold = variable_get('mymodule_cron_threshold', 3600);
  $last_run = variable_get('mymodule_job_last_run', 0);

  if ((time() - $last_run) > $threshold) {
    // Create the actual queue
    // Load operations definition
    // Fill the queue
  }

  // Run the queue - DISABLED - See drupal_queue_cron()
  // drupal_queue_cron_run();
}

function mymodule_last_operation() {
   variable_set('mymodule_job_last_run', time());
}

Tip for exporting data to a file: if you are doing some export of data using queue, write new data to the temporary file first, replace the original file with the new one in the last step. Using this method you avoid parsers to access the incomplete data, which could leads to 'fatal' parsing errors when the export file is a complex data structure - ie. XML file.

Raf’s picture

Thanks for the clarifications! I didn't realize Drupal Queue already implements hook_cron. It's indeed best to leave that out then.

For your four basic possibilities, I concidered them too when doing research for this.

A.
I dismissed A because I had to use shared hosting, and didn't want the module I was writing to be depending on whether or not your host allows you to set crontabs (which that host didn't allow). Alternative cron modules, like Poormanscron, didn't solve this problem, so I had to dismiss possibility A.

B.
I actually didn't concider this possibility, as I thought Drupal Queue takes care of this already. Your example code seems familiar, though. I might actually have seen it in Drupal Queue. That or I've seen it in core's cron implementation. Been a while, so I'm not sure which one of the two it was (looked deeply into both -- except in drupal_queue_cron_run)

C.
For some use cases, this one will be ideal, but as I said at the end of the tutorial:
Depending on what you need to do, I suggest also implementing a way that checks whether or not an item's been handled already. That way, you won't endlessly queue the same things up for no good reason. If you always need to handle the same single file completely from beginning to end, then you don't need to.
In some use cases, possibility C will definitely be the way to go.

In my own use case, I had to make sure new items could be stored in queue even if the queue isn't empty yet, but only if they haven't been in the queue before. That requires a different way of checking (basically, it came down to creating a new database table and tossing some primary keys in there to check up on -- which was done on queue creation and with in_array, so as not to constantly create needless database connections)

The case of a single file, as mentioned in that last paragraph of the tutorial, won't benefit of either possibility C or of how I solved it, cause the queue'll always be empty. Of course, that also does depend on the nature of operations performed on the file and its size, I guess. So with a single file, there will also be cases that benefit from possibility C.

D.
Due to its dependency on C, see C.

All four solutions are good ones. Their benefit -- or the need of another solution outside these four, is more of a per-case decision. I'll include them all (except for B if the semaphore's in Drupal Queue, but will also include it if I saw it in cron instead) as examples to help people on their way for that last paragraph.

rjbrown99’s picture

I also like Elysia Cron for this purpose. It allows you to run each cron hook on its own defined time interval, and as a bonus it tracks the state of any hooks that are running. You can set a timeout/stuck setting as well, so it will reset itself if the hook gets stuck. It's an easy/fast way to get a lot of what you are after without writing custom code (aside from your cron hook.)

a.milkovsky’s picture

Batch API is for when you want to do bulk operations with an interface
Queue is for when you want to do bulk automatically

You can run Batch in cron like it was described here(in russian) :

/**
 * Implements hook_cron()
 */
mymodule_cron() {
  batch_set(array(...));
  $batch = &batch_get();
  $batch['progressive'] = FALSE;
  batch_process('');
}
Two APIs that do the same thing...

I'm not sure about it. On my opinion batch can give more possibilities.
For example I have lots of users on site (let's say 100000...) and I want to do smth with them in cron.
Using Batch API I can select 10 users on each operation and process them.
But I can't do it with Queue API. I have to select all users somewhere in code before cron, create queue items with 10 users in each item...
Another example - parsing some hudge XML file on cron and etc... I know how to do it only with Batch API.
Might someone can give me a hint how to do it using queues?