This patch adds a new hook_cron_worker and implements it for aggregator. Observe that we do not lock this one unlike the normal cron, there is no need and it would go against the point. After this patch, you can actually use as many webfrontends you have to fetch feeds in parallel. Win. Needs system api documentation. I doubt it needs tests, the aggregator tests will deal with it.
Comment | File | Size | Author |
---|---|---|---|
#60 | 578676-60_queue_locking.patch | 6.31 KB | ksenzee |
#60 | 578676-60_queue_locking.interdiff.txt | 1.27 KB | ksenzee |
#55 | 578676-55_queue_locking.patch | 6.71 KB | alex_b |
#54 | queue.patch | 3.6 KB | catch |
#45 | 578676-45_queued_flag.patch | 4.61 KB | alex_b |
Comments
Comment #2
Dries CreditAttribution: Dries commentedInteresting.
Something feels a bit awkward about this -- it doesn't 100% map onto models/patterns I'm used to. The fact that we add to the worker queue, and empty the worker queue in the same run of hook_cron, makes it odd. Normally, there would be some separation or decoupling along with some kind of notification system but given that PHP doesn't support multi-threading ... it's how you use a queue in a single-threaded application.
But, for example, it means that if you have a five thousand small tasks to do (e.g. fetch an RSS feed from hook_cron_worker as is the case in this patch), you also have to execute hook_cron() five thousand times. Executing hook_cron() isn't cheap. So if I'm reading this patch right, you'd execute the MySQL query to see if there are new feeds to fetch way too often -- worst case, you'd execute that query five thousand times instead of once.
As said, interesting approach. Not necessarily bad, but it might have to grow on me a bit. I want to thinker a bit about alternative solutions that might feel a bit more 'common'. Curious to see what other people's initial reaction is. For example, given that we have poormans cron in core, we have the opportunity to distribute the load of doing small tasks among many different requests. I'm not suggesting that is better, but it is something to consider ...
Comment #3
chx CreditAttribution: chx commentedThe fact that we add to the worker queue, and empty the worker queue in the same run of hook_cron, makes it odd.
We do not do that. hook_cron fills the queue and then, not necessarily in the same request, the queue items get dequeued by drupal_cron_run and handed to the worker. They are, in fact, decoupled. That's why we have a new hook which is fired even when the cron lock is active.
But, for example, it means that if you have a five thousand small tasks to do (e.g. fetch an RSS feed from hook_cron_worker as is the case in this patch), you also have to execute hook_cron() five thousand times
Not at all. You execute aggregator_cron_worker not hook_cron. hook_cron should just queue things and leave ASAP because it's locked. You do not even need to fire cron.php five thousand times because in one cron run it will dequeue as much as it can in the time allowed.
Comment #4
dwwchx asked me to look here. I haven't read the patch (and I'm under a very tight deadline, so I don't have time now for a more in depth look), but from the description, this sounds like an utterly good move, and an important use case of the happy new queue API we've got in D7. This would help d.o itself, for example, where we regularly hit trouble with cron timing out while trying to populate the planet feed. We've jumped through all sorts of hoops on d.o to try to keep cron from timing out, and this would be the much cleaner way to solve it.
Comment #5
moshe weitzman CreditAttribution: moshe weitzman commentedI think we'd like a use case like this in core so people know when and how to use the queue API. I'd love to see some more API docs and then resubmit this to Dries.
Comment #6
chx CreditAttribution: chx commentedAdded documentation and changed aggregator tests to actually test our code. They passed locally.
Comment #7
chx CreditAttribution: chx commentedComment #9
chx CreditAttribution: chx commentedOMG that's stupid, the only problem is that it runs assertText to check there are messages, I changed that to a database check -- the inverse of the same check it uses a code line later to check there are no items...
Comment #10
chx CreditAttribution: chx commentedComment #11
chx CreditAttribution: chx commentedWith less merge errors in system.api.php.
Comment #12
drewish CreditAttribution: drewish commentedsubscribing... seems insanely simple.
Comment #13
chx CreditAttribution: chx commentedYes, it very simple, that's great about queues -- it's not too trivial to implement a good fast nonlocking queue but once you have one, life becomes a lot easier :)
In a followup I would like to add a fieldset to the performance settings so you can control how long each worker has. This patch has the variable but not the UI.
Comment #14
Crell CreditAttribution: Crell commentedIf the intent is for cron to be a two phase process, then should hook_cron() be renamed to hook_cron_prepare() or hook_cron_enqueue() or something like that, to make it clear that it's a setup routine for step 2, which is hook_cron_worker()?
Or is that something you'd only do for selected cases, like when in the normal admin UI you'd toss stuff off to batch API? I'm no clear how extensive this change is intended to be.
Is the function name really the best key to use for the queue? If it's tightly coupled to the module, (I generally dislike such tight coupling but I won't get into that here), then it should just be the module, or $module_cron_job, or something like that. Using the function name feels off to me.
Either way, +1 on the concept.
Comment #15
neclimdulaah hard work paying off. Need to review this.
Comment #16
mattyoung CreditAttribution: mattyoung commentedsub
Comment #17
neclimdul@crell I think the idea at this point is to allow large tasks to have a chance to offload work to the queue system. That'd be why node_cron and others don't need to use hook_cron_worker and aren't in this patch.
Its possible we might end up wanting to make all cron tasks jobs but I'm happy with this halfway step for the time being.
Couple questions.
Do we need an update hook to add the queue we are adding in the install hook? I tried doing this real quick and ran into trouble with DrupalQueue not being defined(locked up batchapi).
Do we really need the "Attempting to re-run cron while it is already running." message. Seems less useful now since multiple cron runs aren't necessarily a bad thing like they where.
Comment #18
chx CreditAttribution: chx commentedupdate hook sounds a plan. Note however that whatever problem you ran into (locked up batchapi, humm, where is batchapi used in this patch??) can't be caused by that 'cos in MySQL creating a queue is a noop.
Comment #19
Dave ReidHow would a module implement two different type of hook_cron_worker tasks with different parameters?
Comment #20
chx CreditAttribution: chx commentedEither you encode an array with type and value keys and call the right processing function based on type or you write two modules. I recommned the second solution. I would not like to go into a solution where you define various queues and so on. For now this simple solution will do IMHO and later we certainly do more.
Comment #21
Dave ReidJust seems like we'd be trapping contrib into inflexibility.
Comment #22
chx CreditAttribution: chx commentedWell. We can do something like this.
Comment #23
sunWhy does aggregator needs to install? The API docs of hook_cron() don't mention that case.
Contrary to that inline comment, it seems like the queue name refers to the 'worker callback' name?
As long as we have no use-case/UI for exposing the given description, we shouldn't define one. Chances are very high that the defined descriptions won't make any sense in any UI that is fleshed out later on.
This is a UI functionality we need to test... is it already tested elsewhere or why do we replace it here with a low-level function?
We should use a real-world implementation in core as example here.
This should be singular, i.e. hook_cron_queue(). (and, yes, even if it can return more than one; comparison: hook_menu(), hook_library(), etc. :)
This review is powered by Dreditor.
Comment #24
chx CreditAttribution: chx commentedRemoved the install, description, copied aggregator stuff to system.api.php reverted some of the drupal_cron_runs but kept one so that this new patch is tested as well.
Comment #25
chx CreditAttribution: chx commentedFixed a few "queues" in comments.
Comment #26
sunSorry, no time for a review right now, but tagging to get into the critical list.
Comment #27
neclimdulAh, I like the actual new hook. I think it will end up working a lot better than the worker hook with all its magic values.
Are we sure that the create queue function will always be cheap enough to make this assumption?
Maybe this could be "Examples of jobs that are good candidates for hook_cron_queue include automated mailing, retrieving remote data, and intensive file tasks."
Other then that, very simple and solid.
Comment #28
chx CreditAttribution: chx commentedAre we sure that the create queue function will always be cheap enough to make this assumption?
Yes. For a number of queue implementations like our MySQL or beanstalkd, create is a no op, and for Amazon SQS, it's the same cost (one HTTP request) as checking for the existence of a queue. If your queue implementations is so braindead that creation is expensive then list your queues and only create if necessary.
Comment #29
alex_b CreditAttribution: alex_b commentedBrilliant. The reserving mechanism in DrupalQueue kicks the door open for simultaneous cron runs - tried many times before #87528: Multi-Threaded Cron Jobs http://drupal.org/project/cron_mt etc. Can't think of any conceptual problems with aggregator. I have set up a D7 with this patch on my dev server:
http://barth.devseed.com/d7agg/
1)
- Shouldn't this be more explicit/keep the namespace open? E. g. $queue_name . '_worker_time'
- 15 seconds is not a lot for aggregator, I would set this time for aggregator to at least 60 seconds if not 120.
2)
This for-loop is only executed when no semaphore is present (and hook_cron() is about to be invoked) - can't there be missing queues when a semaphore exists and the worker callback is being invoked?
Comment #30
chx CreditAttribution: chx commentedThat variable does not need renaming it needs to be moved into the hook. Now, whether there can be missing queues when a semaphore exists and the worker callback is being invoked -- hardly. If the queue does not exist, it surely is empty :) so though the queue backend might send you an error but that's not much different from getting nothing at all. We make sure the queue exists when you want to write it.
Comment #31
neclimdulThink this has been gone over with a fine tooth comb long enough to RTBC it.
Comment #32
Crell CreditAttribution: Crell commentedJust to clarify then, there's now hook_cron(), which is good for delayed-but-short tasks, and hook_cron_queue(), which is for setting up many-instance tasks. I like. Code looks good, too. My only question is whether or not we should be calling it hook_cron_queue_info() for consistency, since it's another "associative array with alter" type hook. (It doesn't get cached, though. Not sure if it makes sense to do so?)
On the scalability front, there's no way to control the order in which workers run, right, aside from module weight? I'm just wondering if we need to be concerned about the case where the first worker that gets called has a queue that grows just fast enough to never empty. Will the other workers ever get a chance to run? Or is that where the separate timer for each worker comes in?
Comment #33
chx CreditAttribution: chx commentedRight. We have the per-queue timer to avoid "One Queue to rule them all, One Queue to find them, One Queue to bring them all and in the darkness bind them" (sorry). I dunno you do not call hook_forms hook_form_info, hook_menu hook_menu_info and so on.
Comment #34
Dries CreditAttribution: Dries commentedI agree that cron_queue should probably be cron_queue_info.
It still feels like the documentation is a bit dense but I don't have any concrete ideas to make it better -- maybe it really is that simple. What seems to be missing is some guidance as how it interacts with the crontab stuff. If you put something in a queue, you want to know how fast it is going to be taken out of the queue. It feels like we could do with a high-level description of when the job is going to be processed and by whom.
Comment #35
Crell CreditAttribution: Crell commentedJust in case I wasn't clear, I'm in nitpick mode. I am still very much +1 on the concept and the implementation. This is sweet sweet stuff, and should make long-running cron processes dramatically easier to code. (Frankly if we had time I'd want to convert batch API over to this model as well.)
Dries, for documentation what about a nice big docblock tutorial for the "cron" doc group? That's the same thing the DB system does for most of its "how you approach it" docs, but it should be much shorter here as there's simply less to do. (Really, there's now 2 parallel cron-based systems.)
(And arguably hook_menu should really be hook_router_info, but it's way too late to make that change for Drupal 7 so let's not go there. :-) )
Comment #36
chx CreditAttribution: chx commentedRenamed to info. When the item is getting processed, who knows. For a low work load, in the same crontab run. For a high workload, who knows. Added a few words in this regard.
Comment #37
Dries CreditAttribution: Dries commentedI committed this to CVS HEAD. It's a new feature so it is an exception to the exceptions in code slush. This is pretty rare but I felt it was important to let this one slip through. If we can all focus on the other exceptions now, that would be appreciated. Thanks!
Comment #38
Dries CreditAttribution: Dries commentedAlso, feel free to follow-up with a CHANGELOG.txt entry.
Comment #39
sunA couple of clean-ups.
Comment #40
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks!
Comment #41
yched CreditAttribution: yched commentedre Crell #35 "Frankly if we had time I'd want to convert batch API over to this model as well."
I didn't really look into the queue API nor this patch, but +1 on principle. There are possibly interesting stuff behind merging / unifying the mechanisms.
Comment #42
mikeytown2 CreditAttribution: mikeytown2 commentedDoes this do multi threading? Should we do multi threading? I've had a lot of testing of my Boost crawler which works with 8 threads currently; there really is no limit, could be 32 for all I care. Each cron/batch task could be spun up into its own thread. Nice thing is my code has been tested on virtually all kinds of hosts and the majority of the bugs have been worked out now. Long story short, it calls a URL and cuts the connection while that new process is still running. It's a very simple "threading" system. Works best for long operations due to the bootstrap overhead. Uses the database to take care of any concurrency issues.
Comment #43
Crell CreditAttribution: Crell commentedI wouldn't want to do something like that directly in core. However, a contrib module that runs on cron and self-calls a dozen times over curl wouldn't be difficult to write, and the mechanism here should be able to handle that.
Comment #45
alex_b CreditAttribution: alex_b commentedUsing the queue for Feeds module I noticed a particularity of recurring jobs: The scheduling party has to flag items that have already been queued, otherwise they may be queued multiple times.
In concrete terms for aggregator this means: If the feeds added to the queue aren't worked off until the next cron run, aggregator adds them again to the queue. The result is a waste of resources as the same feeds are added multiple times to the queue.
This patch adds a "queued" flag [edit: it's a "queued" timestamp] to aggregator_feed table that is set to REQUEST_TIME when an aggregator_refresh job was added to the queue. The flag is set back to 0 when a job has been worked off OR is older than 6 hours.
Comment #46
alex_b CreditAttribution: alex_b commentedSide note: Reviewers may notice that the update of aggregator_feed table after refresh is spread out over 3 queries in 2 functions. This is being addressed with #564686: Fetcher should not update aggregator_feed record .
Comment #47
chx CreditAttribution: chx commentedThe patch is a good one but the description is very misleading. The patch does not add a queued flag it adds a queued timestamp. And Alex is so right that it's needed.
Comment #48
alex_b CreditAttribution: alex_b commentedYeah, right. Thanks for pointing that out. I edited comment #54 to clarify that.
Comment #49
chx CreditAttribution: chx commentedBumping to critical.
Comment #52
tonyoconnell CreditAttribution: tonyoconnell commented#45: 578676-45_queued_flag.patch queued for re-testing.
Comment #54
catchUntested re-roll after #564686: Fetcher should not update aggregator_feed record - query had just been moved out of two functions into one central one.
Comment #55
alex_b CreditAttribution: alex_b commentedTwo slight adjustments and tests:
- On cron, mark a feed as checked no matter whether parsing was successful or not. Previously a feed was marked checked if import was successful or fetching failed, but not if parsing failed. If we don't mark a feed as checked when it fails, the feed will be queued for updating the very next cron run - many failing feeds would quickly occupy a lot of resources.
- Removed 'flag' verbiage according to chx' observation in #47
- Added tests for simple updating on cron and for verifying that a feed is not being queued and imported if it is marked 'queued'.
Comment #56
webchickLet's get review of the most recent changes.
Comment #57
sun.core CreditAttribution: sun.core commentedAside from those, the code looks quite nice.
However, this patch needs approval from dww - according to most recent MAINTAINERS.txt revamp discussions.
Powered by Dreditor.
Comment #58
moshe weitzman CreditAttribution: moshe weitzman commentedThe point of adding maintainer is to speed up review and commit. This one has gone been waiting for 2 weeks.
Comment #59
dwwGeneral comment: Yes, something to keep track of what you put into the queue for processing is necessary using our queue system at this time. I had to do the same in the update manager as part of #597484: Use the Queue API to fetch available update data. Due to the desire to allow lots of possible queue backends, the semantics of our queue API are not strong enough to support querying the queue for what tasks are already/still in there. So yes, +1 for a 'queued' column in the aggregator like this.
My only (extremely minor, not really worth un-RTBC'ing this) complaint is that I think the test could be a bit more robust in one regard:
I think it'd be nice to assert there are 0 entries in {aggregator_item} after the call to removeFeedItems() before we hit cron.php again. While it's likely that functionality is already tested on its own, it seems like the other assertions in this test would be more dependable if we're sure we go from 0 to 5 to 0 to 5 again. ;)
And yeah, the lack of newlines is unfortunate. I'd re-roll but then my general agreement on the RTBCness of this patch wouldn't count. ;)
Comment #60
ksenzeeI had a few minutes during a meeting and this tab was open, so here's a reroll with newlines and the assertion dww wanted. I'm also attaching an interdiff to show those are the only changes between this and #55.
Comment #61
dwwGreat, thanks -- looks good. This is once again RTBC unless the bot contradicts me. ;)
Comment #62
alex_b CreditAttribution: alex_b commentedksenzee - thanks for the reroll.
Comment #63
ksenzee#60: 578676-60_queue_locking.patch queued for re-testing.
Comment #64
webchickchx gave me the cliff's notes on here. Eek! Committed to HEAD. Thanks!