This patch adds a new hook_cron_worker and implements it for aggregator. Observe that we do not lock this one unlike the normal cron, there is no need and it would go against the point. After this patch, you can actually use as many webfrontends you have to fetch feeds in parallel. Win. Needs system api documentation. I doubt it needs tests, the aggregator tests will deal with it.

Files: 
CommentFileSizeAuthor
#60 578676-60_queue_locking.patch6.31 KBksenzee
PASSED: [[SimpleTest]]: [MySQL] 17,964 pass(es).
[ View ]
#60 578676-60_queue_locking.interdiff.txt1.27 KBksenzee
#55 578676-55_queue_locking.patch6.71 KBalex_b
PASSED: [[SimpleTest]]: [MySQL] 17,594 pass(es).
[ View ]
#54 queue.patch3.6 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 17,539 pass(es).
[ View ]
#45 578676-45_queued_flag.patch4.61 KBalex_b
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 578676-45_queued_flag.patch.
[ View ]
#39 drupal.cron-queue-changelog-cleanup.patch6.9 KBsun
Failed: Failed to apply patch.
[ View ]
#36 cron_worker.patch5.91 KBchx
Failed: Failed to apply patch.
[ View ]
#30 cron_worker.patch5.69 KBchx
Failed: Failed to apply patch.
[ View ]
#25 cron_worker.patch5.38 KBchx
Failed: Failed to apply patch.
[ View ]
#24 cron_worker.patch5.38 KBchx
Failed: Failed to apply patch.
[ View ]
#22 cron_worker.patch7 KBchx
Failed: Failed to apply patch.
[ View ]
#11 cron_worker.patch6.43 KBchx
Failed: Failed to apply patch.
[ View ]
#9 cron_worker.patch8.59 KBchx
Failed: Failed to apply patch.
[ View ]
#7 cron_worker.patch5.89 KBchx
Failed: 12920 passes, 12 fails, 0 exceptions
[ View ]
#6 cron_worker.patch0 byteschx
Passed: 14698 passes, 0 fails, 0 exceptions
[ View ]
cron_worker.patch2.39 KBchx
Failed: Failed to apply patch.
[ View ]

Comments

Interesting.

Something feels a bit awkward about this -- it doesn't 100% map onto models/patterns I'm used to. The fact that we add to the worker queue, and empty the worker queue in the same run of hook_cron, makes it odd. Normally, there would be some separation or decoupling along with some kind of notification system but given that PHP doesn't support multi-threading ... it's how you use a queue in a single-threaded application.

But, for example, it means that if you have a five thousand small tasks to do (e.g. fetch an RSS feed from hook_cron_worker as is the case in this patch), you also have to execute hook_cron() five thousand times. Executing hook_cron() isn't cheap. So if I'm reading this patch right, you'd execute the MySQL query to see if there are new feeds to fetch way too often -- worst case, you'd execute that query five thousand times instead of once.

As said, interesting approach. Not necessarily bad, but it might have to grow on me a bit. I want to thinker a bit about alternative solutions that might feel a bit more 'common'. Curious to see what other people's initial reaction is. For example, given that we have poormans cron in core, we have the opportunity to distribute the load of doing small tasks among many different requests. I'm not suggesting that is better, but it is something to consider ...

The fact that we add to the worker queue, and empty the worker queue in the same run of hook_cron, makes it odd.

We do not do that. hook_cron fills the queue and then, not necessarily in the same request, the queue items get dequeued by drupal_cron_run and handed to the worker. They are, in fact, decoupled. That's why we have a new hook which is fired even when the cron lock is active.

But, for example, it means that if you have a five thousand small tasks to do (e.g. fetch an RSS feed from hook_cron_worker as is the case in this patch), you also have to execute hook_cron() five thousand times

Not at all. You execute aggregator_cron_worker not hook_cron. hook_cron should just queue things and leave ASAP because it's locked. You do not even need to fire cron.php five thousand times because in one cron run it will dequeue as much as it can in the time allowed.

chx asked me to look here. I haven't read the patch (and I'm under a very tight deadline, so I don't have time now for a more in depth look), but from the description, this sounds like an utterly good move, and an important use case of the happy new queue API we've got in D7. This would help d.o itself, for example, where we regularly hit trouble with cron timing out while trying to populate the planet feed. We've jumped through all sorts of hoops on d.o to try to keep cron from timing out, and this would be the much cleaner way to solve it.

I think we'd like a use case like this in core so people know when and how to use the queue API. I'd love to see some more API docs and then resubmit this to Dries.

StatusFileSize
new0 bytes
Passed: 14698 passes, 0 fails, 0 exceptions
[ View ]

Added documentation and changed aggregator tests to actually test our code. They passed locally.

StatusFileSize
new5.89 KB
Failed: 12920 passes, 12 fails, 0 exceptions
[ View ]

Status:Needs review» Needs work

The last submitted patch failed testing.

StatusFileSize
new8.59 KB
Failed: Failed to apply patch.
[ View ]

OMG that's stupid, the only problem is that it runs assertText to check there are messages, I changed that to a database check -- the inverse of the same check it uses a code line later to check there are no items...

Status:Needs work» Needs review

StatusFileSize
new6.43 KB
Failed: Failed to apply patch.
[ View ]

With less merge errors in system.api.php.

subscribing... seems insanely simple.

Yes, it very simple, that's great about queues -- it's not too trivial to implement a good fast nonlocking queue but once you have one, life becomes a lot easier :)

In a followup I would like to add a fieldset to the performance settings so you can control how long each worker has. This patch has the variable but not the UI.

If the intent is for cron to be a two phase process, then should hook_cron() be renamed to hook_cron_prepare() or hook_cron_enqueue() or something like that, to make it clear that it's a setup routine for step 2, which is hook_cron_worker()?

Or is that something you'd only do for selected cases, like when in the normal admin UI you'd toss stuff off to batch API? I'm no clear how extensive this change is intended to be.

Is the function name really the best key to use for the queue? If it's tightly coupled to the module, (I generally dislike such tight coupling but I won't get into that here), then it should just be the module, or $module_cron_job, or something like that. Using the function name feels off to me.

Either way, +1 on the concept.

aah hard work paying off. Need to review this.

sub

@crell I think the idea at this point is to allow large tasks to have a chance to offload work to the queue system. That'd be why node_cron and others don't need to use hook_cron_worker and aren't in this patch.

Its possible we might end up wanting to make all cron tasks jobs but I'm happy with this halfway step for the time being.

Couple questions.
Do we need an update hook to add the queue we are adding in the install hook? I tried doing this real quick and ran into trouble with DrupalQueue not being defined(locked up batchapi).

Do we really need the "Attempting to re-run cron while it is already running." message. Seems less useful now since multiple cron runs aren't necessarily a bad thing like they where.

update hook sounds a plan. Note however that whatever problem you ran into (locked up batchapi, humm, where is batchapi used in this patch??) can't be caused by that 'cos in MySQL creating a queue is a noop.

How would a module implement two different type of hook_cron_worker tasks with different parameters?

Either you encode an array with type and value keys and call the right processing function based on type or you write two modules. I recommned the second solution. I would not like to go into a solution where you define various queues and so on. For now this simple solution will do IMHO and later we certainly do more.

Just seems like we'd be trapping contrib into inflexibility.

StatusFileSize
new7 KB
Failed: Failed to apply patch.
[ View ]

Well. We can do something like this.

+++ modules/aggregator/aggregator.install 2009-09-16 03:31:42 +0000
@@ -7,6 +7,13 @@
+function aggregator_install() {
+  DrupalQueue::get('aggregator_cron_worker')->createQueue();
+}

Why does aggregator needs to install? The API docs of hook_cron() don't mention that case.

+++ modules/aggregator/aggregator.module 2009-09-20 00:28:23 +0000
@@ -306,19 +306,31 @@ function aggregator_permission() {
+  $queue = DrupalQueue::get('aggregator_feeds');
...
+function aggregator_cron_queues() {
+  $queues['aggregator_feeds'] = array(
+++ modules/system/system.api.php 2009-09-20 00:33:44 +0000
@@ -124,20 +124,48 @@ function hook_entity_info_alter(&$entity
+  // Don't forget to replace hook with the module name in the queue name.
+  $queue = DrupalQueue::get('hook_cron_worker');

Contrary to that inline comment, it seems like the queue name refers to the 'worker callback' name?

+++ modules/aggregator/aggregator.module 2009-09-20 00:28:23 +0000
@@ -306,19 +306,31 @@ function aggregator_permission() {
+    'description' => t('Refresh aggregator feeds'),
+++ modules/system/system.api.php 2009-09-20 00:33:44 +0000
@@ -124,20 +124,48 @@ function hook_entity_info_alter(&$entity
+ *     'description'      A human readable description of the queue.
...
+    'description' => t('Update from the cloud'),

As long as we have no use-case/UI for exposing the given description, we shouldn't define one. Chances are very high that the defined descriptions won't make any sense in any UI that is fleshed out later on.

+++ modules/aggregator/aggregator.test 2009-09-18 02:56:25 +0000
@@ -94,7 +94,7 @@ class AggregatorTestCase extends DrupalW
     // Refresh the feed (simulated link click).
-    $this->drupalGet('admin/config/services/aggregator/update/' . $feed->fid);
+    drupal_cron_run();

This is a UI functionality we need to test... is it already tested elsewhere or why do we replace it here with a low-level function?

+++ modules/system/system.api.php 2009-09-20 00:33:44 +0000
@@ -124,20 +124,48 @@ function hook_entity_info_alter(&$entity
+function hook_cron_queues() {
+  $queues['cloud_update'] = array(
+    'worker callback' => 'cloud_update',
+    'description' => t('Update from the cloud'),
+  )

We should use a real-world implementation in core as example here.

+++ modules/system/system.api.php 2009-09-20 00:33:44 +0000
@@ -124,20 +124,48 @@ function hook_entity_info_alter(&$entity
+function hook_cron_queues() {

This should be singular, i.e. hook_cron_queue(). (and, yes, even if it can return more than one; comparison: hook_menu(), hook_library(), etc. :)

This review is powered by Dreditor.

StatusFileSize
new5.38 KB
Failed: Failed to apply patch.
[ View ]

Removed the install, description, copied aggregator stuff to system.api.php reverted some of the drupal_cron_runs but kept one so that this new patch is tested as well.

StatusFileSize
new5.38 KB
Failed: Failed to apply patch.
[ View ]

Fixed a few "queues" in comments.

Issue tags:+API clean-up

Sorry, no time for a review right now, but tagging to get into the critical list.

Ah, I like the actual new hook. I think it will end up working a lot better than the worker hook with all its magic values.

+    // Make sure every queue exists. There is no harm in trying to recreate an
+    // existing queue.
+    foreach ($queues as $queue_name => $info) {
+      DrupalQueue::get($queue_name)->createQueue();
+    }

Are we sure that the create queue function will always be cheap enough to make this assumption?

+ * Automatic mailings, retrieval of remote data are good examples of what
+ * should be done via hook_cron_queue) instead.

Maybe this could be "Examples of jobs that are good candidates for hook_cron_queue include automated mailing, retrieving remote data, and intensive file tasks."

Other then that, very simple and solid.

Are we sure that the create queue function will always be cheap enough to make this assumption?

Yes. For a number of queue implementations like our MySQL or beanstalkd, create is a no op, and for Amazon SQS, it's the same cost (one HTTP request) as checking for the existence of a queue. If your queue implementations is so braindead that creation is expensive then list your queues and only create if necessary.

Brilliant. The reserving mechanism in DrupalQueue kicks the door open for simultaneous cron runs - tried many times before #87528: Multi-Threaded Cron Jobs http://drupal.org/project/cron_mt etc. Can't think of any conceptual problems with aggregator. I have set up a D7 with this patch on my dev server:

http://barth.devseed.com/d7agg/

1)

<?php
$end
= time() + variable_get($queue_name, 15);
?>

- Shouldn't this be more explicit/keep the namespace open? E. g. $queue_name . '_worker_time'
- 15 seconds is not a lot for aggregator, I would set this time for aggregator to at least 60 seconds if not 120.

2)

<?php
foreach ($queues as $queue_name => $info) {
 
DrupalQueue::get($queue_name)->createQueue();
}
?>

This for-loop is only executed when no semaphore is present (and hook_cron() is about to be invoked) - can't there be missing queues when a semaphore exists and the worker callback is being invoked?

StatusFileSize
new5.69 KB
Failed: Failed to apply patch.
[ View ]

That variable does not need renaming it needs to be moved into the hook. Now, whether there can be missing queues when a semaphore exists and the worker callback is being invoked -- hardly. If the queue does not exist, it surely is empty :) so though the queue backend might send you an error but that's not much different from getting nothing at all. We make sure the queue exists when you want to write it.

Status:Needs review» Reviewed & tested by the community

Think this has been gone over with a fine tooth comb long enough to RTBC it.

Just to clarify then, there's now hook_cron(), which is good for delayed-but-short tasks, and hook_cron_queue(), which is for setting up many-instance tasks. I like. Code looks good, too. My only question is whether or not we should be calling it hook_cron_queue_info() for consistency, since it's another "associative array with alter" type hook. (It doesn't get cached, though. Not sure if it makes sense to do so?)

On the scalability front, there's no way to control the order in which workers run, right, aside from module weight? I'm just wondering if we need to be concerned about the case where the first worker that gets called has a queue that grows just fast enough to never empty. Will the other workers ever get a chance to run? Or is that where the separate timer for each worker comes in?

Right. We have the per-queue timer to avoid "One Queue to rule them all, One Queue to find them, One Queue to bring them all and in the darkness bind them" (sorry). I dunno you do not call hook_forms hook_form_info, hook_menu hook_menu_info and so on.

I agree that cron_queue should probably be cron_queue_info.

It still feels like the documentation is a bit dense but I don't have any concrete ideas to make it better -- maybe it really is that simple. What seems to be missing is some guidance as how it interacts with the crontab stuff. If you put something in a queue, you want to know how fast it is going to be taken out of the queue. It feels like we could do with a high-level description of when the job is going to be processed and by whom.

Just in case I wasn't clear, I'm in nitpick mode. I am still very much +1 on the concept and the implementation. This is sweet sweet stuff, and should make long-running cron processes dramatically easier to code. (Frankly if we had time I'd want to convert batch API over to this model as well.)

Dries, for documentation what about a nice big docblock tutorial for the "cron" doc group? That's the same thing the DB system does for most of its "how you approach it" docs, but it should be much shorter here as there's simply less to do. (Really, there's now 2 parallel cron-based systems.)

(And arguably hook_menu should really be hook_router_info, but it's way too late to make that change for Drupal 7 so let's not go there. :-) )

StatusFileSize
new5.91 KB
Failed: Failed to apply patch.
[ View ]

Renamed to info. When the item is getting processed, who knows. For a low work load, in the same crontab run. For a high workload, who knows. Added a few words in this regard.

Status:Reviewed & tested by the community» Fixed

I committed this to CVS HEAD. It's a new feature so it is an exception to the exceptions in code slush. This is pretty rare but I felt it was important to let this one slip through. If we can all focus on the other exceptions now, that would be appreciated. Thanks!

Also, feel free to follow-up with a CHANGELOG.txt entry.

Status:Fixed» Needs review
StatusFileSize
new6.9 KB
Failed: Failed to apply patch.
[ View ]

A couple of clean-ups.

Status:Needs review» Fixed

Committed to CVS HEAD. Thanks!

re Crell #35 "Frankly if we had time I'd want to convert batch API over to this model as well."
I didn't really look into the queue API nor this patch, but +1 on principle. There are possibly interesting stuff behind merging / unifying the mechanisms.

Does this do multi threading? Should we do multi threading? I've had a lot of testing of my Boost crawler which works with 8 threads currently; there really is no limit, could be 32 for all I care. Each cron/batch task could be spun up into its own thread. Nice thing is my code has been tested on virtually all kinds of hosts and the majority of the bugs have been worked out now. Long story short, it calls a URL and cuts the connection while that new process is still running. It's a very simple "threading" system. Works best for long operations due to the bootstrap overhead. Uses the database to take care of any concurrency issues.

I wouldn't want to do something like that directly in core. However, a contrib module that runs on cron and self-calls a dozen times over curl wouldn't be difficult to write, and the mechanism here should be able to handle that.

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Status:Reviewed & tested by the community» Needs review
StatusFileSize
new4.61 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 578676-45_queued_flag.patch.
[ View ]

Using the queue for Feeds module I noticed a particularity of recurring jobs: The scheduling party has to flag items that have already been queued, otherwise they may be queued multiple times.

In concrete terms for aggregator this means: If the feeds added to the queue aren't worked off until the next cron run, aggregator adds them again to the queue. The result is a waste of resources as the same feeds are added multiple times to the queue.

This patch adds a "queued" flag [edit: it's a "queued" timestamp] to aggregator_feed table that is set to REQUEST_TIME when an aggregator_refresh job was added to the queue. The flag is set back to 0 when a job has been worked off OR is older than 6 hours.

Status:Closed (fixed)» Needs review

Side note: Reviewers may notice that the update of aggregator_feed table after refresh is spread out over 3 queries in 2 functions. This is being addressed with #564686: Fetcher should not update aggregator_feed record .

Status:Needs review» Reviewed & tested by the community

The patch is a good one but the description is very misleading. The patch does not add a queued flag it adds a queued timestamp. And Alex is so right that it's needed.

Yeah, right. Thanks for pointing that out. I edited comment #54 to clarify that.

Priority:Normal» Critical

Bumping to critical.

Issue tags:-API clean-up

Re-test of 578676-45_queued_flag.patch from comment #45 was requested by moshe weitzman.

Status:Needs review» Needs work

The last submitted patch, 578676-45_queued_flag.patch, failed testing.

Status:Needs work» Needs review

#45: 578676-45_queued_flag.patch queued for re-testing.

Status:Needs review» Needs work
Issue tags:+API clean-up

The last submitted patch, 578676-45_queued_flag.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new3.6 KB
PASSED: [[SimpleTest]]: [MySQL] 17,539 pass(es).
[ View ]

Untested re-roll after #564686: Fetcher should not update aggregator_feed record - query had just been moved out of two functions into one central one.

Status:Needs review» Reviewed & tested by the community
StatusFileSize
new6.71 KB
PASSED: [[SimpleTest]]: [MySQL] 17,594 pass(es).
[ View ]

Two slight adjustments and tests:

- On cron, mark a feed as checked no matter whether parsing was successful or not. Previously a feed was marked checked if import was successful or fetching failed, but not if parsing failed. If we don't mark a feed as checked when it fails, the feed will be queued for updating the very next cron run - many failing feeds would quickly occupy a lot of resources.
- Removed 'flag' verbiage according to chx' observation in #47
- Added tests for simple updating on cron and for verifying that a feed is not being queued and imported if it is marked 'queued'.

Status:Reviewed & tested by the community» Needs review

Let's get review of the most recent changes.

+++ modules/aggregator/aggregator.install 1 Feb 2010 17:06:28 -0000
@@ -275,3 +284,16 @@ function aggregator_update_7000() {
\ No newline at end of file
+++ modules/aggregator/aggregator.test 1 Feb 2010 17:06:28 -0000
@@ -638,3 +638,48 @@ class ImportOPMLTestCase extends Aggrega
\ No newline at end of file

Aside from those, the code looks quite nice.

However, this patch needs approval from dww - according to most recent MAINTAINERS.txt revamp discussions.

Powered by Dreditor.

Status:Needs review» Reviewed & tested by the community

The point of adding maintainer is to speed up review and commit. This one has gone been waiting for 2 weeks.

General comment: Yes, something to keep track of what you put into the queue for processing is necessary using our queue system at this time. I had to do the same in the update manager as part of #597484: Use the Queue API to fetch available update data. Due to the desire to allow lots of possible queue backends, the semantics of our queue API are not strong enough to support querying the queue for what tasks are already/still in there. So yes, +1 for a 'queued' column in the aggregator like this.

My only (extremely minor, not really worth un-RTBC'ing this) complaint is that I think the test could be a bit more robust in one regard:

+    $this->drupalGet($base_url . '/cron.php', array('external' => TRUE, 'query' => array('cron_key' => $key)));
+    $this->assertEqual(5, db_query('SELECT COUNT(*) FROM {aggregator_item} WHERE fid = :fid', array(':fid' => $feed->fid))->fetchField(), 'Expected number of items in database.');
+    $this->removeFeedItems($feed);
+    $this->drupalGet($base_url . '/cron.php', array('external' => TRUE, 'query' => array('cron_key' => $key)));
+    $this->assertEqual(5, db_query('SELECT COUNT(*) FROM {aggregator_item} WHERE fid = :fid', array(':fid' => $feed->fid))->fetchField(), 'Expected number of items in database.');

I think it'd be nice to assert there are 0 entries in {aggregator_item} after the call to removeFeedItems() before we hit cron.php again. While it's likely that functionality is already tested on its own, it seems like the other assertions in this test would be more dependable if we're sure we go from 0 to 5 to 0 to 5 again. ;)

And yeah, the lack of newlines is unfortunate. I'd re-roll but then my general agreement on the RTBCness of this patch wouldn't count. ;)

Status:Reviewed & tested by the community» Needs review
StatusFileSize
new1.27 KB
new6.31 KB
PASSED: [[SimpleTest]]: [MySQL] 17,964 pass(es).
[ View ]

I had a few minutes during a meeting and this tab was open, so here's a reroll with newlines and the assertion dww wanted. I'm also attaching an interdiff to show those are the only changes between this and #55.

Status:Needs review» Reviewed & tested by the community

Great, thanks -- looks good. This is once again RTBC unless the bot contradicts me. ;)

ksenzee - thanks for the reroll.

#60: 578676-60_queue_locking.patch queued for re-testing.

Status:Reviewed & tested by the community» Fixed

chx gave me the cliff's notes on here. Eek! Committed to HEAD. Thanks!

Status:Fixed» Closed (fixed)
Issue tags:-API clean-up

Automatically closed -- issue fixed for 2 weeks with no activity.