Updated: Comment #0
Thanks for the great module! I'm using Feeds with Job Scheduler and have been running into problems with duplicate queue entries. I'm reporting this issue as a bug here because I think the rescheduling of 'stuck' periodic jobs plays a role and I'm wondering if we could alter this behaviour to be a bit more Feeds friendly.
Discussions inalso illustrate this issue and tries to address it by making Feeds work harder on the queue:
"...You'll end up with a lot of items stuck in the queue table (Mine had 70,000+ rows of duplicate entries. That's because every scheduled run, all your feed sources get added to the queue, but only one or two get processed..." comment #19
In addition there are discussions on creating unique queues inwhich may provide alternative options for addressing this issue.
So here is what I think is happening (exaggerated and approximate for illustration purposes):
Say I have ~ 200 nodes to import and I schedule a periodic import interval of 2 hours.
Say that I set cron to run once per hour.
At 0 hours (first cron run):
- job scheduler will select all 200 jobs (nodes to import) and add them to the queue setting them to reserved - next is set to 2 hours from now.
- queue workers will process a subset of the 200 jobs in the queue (say 50) leaving 150 jobs in the queue.
At 1 hour (second cron run):
- job scheduler will 'reschedule stuck jobs' - that is it will set schedule to 0 for the 150 jobs that are still in the queue
- job scheduler won't select any new jobs (nodes to import) because next is in the future.
- queue workers will process a subset of the 150 remaining jobs in the queue (say 50) leaving 100 jobs in the queue.
At 2 hour (third cron run):
- job scheduler will select all 200 jobs (nodes to import) (because schedule is set to 0 after the previous reschedule and next time has passed) and add them to the queue setting them to reserved - next set to 2 hours from now.
- queue workers will process a subset of the 200 new jobs plus 100 previous jobs queued (say 50) leaving 250 jobs in the queue.
- we now have 50 jobs that appear twice in the queue.
Over time, as noted in, we can end up with 70,000+ duplicate jobs in the queue.
If my interpretation is correct then my thinking is:
- users are not going to be able to accurately predict what the optimum cron schedule and Feeds periodic import schedule combination should be in order to avoid duplicate jobs - therefore ideally we want to avoid adding items to the queue if they are still waiting to be processed
- we already know if an item is still in the queue (scheduled is > 0) which prevents us from adding it again, however only for the first hour - after a fixed time of 1 hour we reset scheduled to 0, but it may legitimately take longer than 1 hour for items to be processed
next is set to the time the job is added to the queue, not the time it is processed, which also allows items to be added to the queue more than once- actually FeedsSource::feeds_source_import worker reschedules the job when it picks a feed off the queue.
So, in conclusion possible workarounds (or combination of workarounds) might be:
- don't reschedule items after fixed time of 1 hour?
- if a catch is needed for 'stuck' jobs perhaps it can be adjustable or relative to cron schedule since it can take longer than one hour for Feeds jobs to be processed?
- if catch is needed for 'stuck' jobs, then also remove the job from the queue before rescheduling it to avoid duplicate entries?
adjust next value so that it is the time the queue processed the job not the time it was added to the queue?
Thanks for taking the time to read through this. Hopefully it makes sense. I'd appreciate any comments/thoughts you have on whether this is a legitimate bug/issue relevant to Job Scheduler or whether I'm barking up the wrong tree so to speak.