Updated: Comment #0

Problem/Motivation

Thanks for the great module! I'm using Feeds with Job Scheduler and have been running into problems with duplicate queue entries. I'm reporting this issue as a bug here because I think the rescheduling of 'stuck' periodic jobs plays a role and I'm wondering if we could alter this behaviour to be a bit more Feeds friendly.

Discussions in #1231332: periodic import imports only one file per cron also illustrate this issue and tries to address it by making Feeds work harder on the queue:

"...You'll end up with a lot of items stuck in the queue table (Mine had 70,000+ rows of duplicate entries. That's because every scheduled run, all your feed sources get added to the queue, but only one or two get processed..." comment #19

In addition there are discussions on creating unique queues in #1548286: API for handling unique queue items which may provide alternative options for addressing this issue.

So here is what I think is happening (exaggerated and approximate for illustration purposes):

Say I have ~ 200 nodes to import and I schedule a periodic import interval of 2 hours.
Say that I set cron to run once per hour.

At 0 hours (first cron run):

  • job scheduler will select all 200 jobs (nodes to import) and add them to the queue setting them to reserved - next is set to 2 hours from now.
  • queue workers will process a subset of the 200 jobs in the queue (say 50) leaving 150 jobs in the queue.

At 1 hour (second cron run):

  • job scheduler will 'reschedule stuck jobs' - that is it will set schedule to 0 for the 150 jobs that are still in the queue
  • job scheduler won't select any new jobs (nodes to import) because next is in the future.
  • queue workers will process a subset of the 150 remaining jobs in the queue (say 50) leaving 100 jobs in the queue.

At 2 hour (third cron run):

  • job scheduler will select all 200 jobs (nodes to import) (because schedule is set to 0 after the previous reschedule and next time has passed) and add them to the queue setting them to reserved - next set to 2 hours from now.
  • queue workers will process a subset of the 200 new jobs plus 100 previous jobs queued (say 50) leaving 250 jobs in the queue.
  • we now have 50 jobs that appear twice in the queue.

Over time, as noted in #1231332: periodic import imports only one file per cron, we can end up with 70,000+ duplicate jobs in the queue.

If my interpretation is correct then my thinking is:

  1. users are not going to be able to accurately predict what the optimum cron schedule and Feeds periodic import schedule combination should be in order to avoid duplicate jobs - therefore ideally we want to avoid adding items to the queue if they are still waiting to be processed
  2. we already know if an item is still in the queue (scheduled is > 0) which prevents us from adding it again, however only for the first hour - after a fixed time of 1 hour we reset scheduled to 0, but it may legitimately take longer than 1 hour for items to be processed
  3. next is set to the time the job is added to the queue, not the time it is processed, which also allows items to be added to the queue more than once - actually FeedsSource::feeds_source_import worker reschedules the job when it picks a feed off the queue.

Proposed resolution

So, in conclusion possible workarounds (or combination of workarounds) might be:

  1. don't reschedule items after fixed time of 1 hour?
  2. if a catch is needed for 'stuck' jobs perhaps it can be adjustable or relative to cron schedule since it can take longer than one hour for Feeds jobs to be processed?
  3. if catch is needed for 'stuck' jobs, then also remove the job from the queue before rescheduling it to avoid duplicate entries?
  4. adjust next value so that it is the time the queue processed the job not the time it was added to the queue?

Thanks for taking the time to read through this. Hopefully it makes sense. I'd appreciate any comments/thoughts you have on whether this is a legitimate bug/issue relevant to Job Scheduler or whether I'm barking up the wrong tree so to speak.

Comments

Assuming that my theories are correct, here is a patch to increase the fixed amount of time to wait until 'stuck' periodic jobs should be updated to one week (previously one hour). This is a band-aid rather than a robust solution, and attempts to avoid duplicate queue items by giving workers more time to process scheduled queue items before they are deemed to be 'stuck' and rescheduled.

Version:7.x-2.0-alpha3» 7.x-2.x-dev

Rather than keeping track of how long a job has been scheduled, I think it would make more sense to keep track of the number of cron passes that a job has had. Say, reschedule after 5 passes. That would make it relative to the cron interval.

As far as I can tell, there's no way to remove specific queue items, so that won't work.

I wonder if we could get rid of the stuck job handling and make clients do it manually. That would be an API change, so not on the table now. There could be a special exception thrown, for when a job is stuck maybe.

Issue summary:View changes

Correction to claim that next is not updated when job is processed.