This is a follow-up to
#774682: Convert to use Drupal Queue

Since converting to Drupal Queue, the number of jobs in my test site's queue keeps growing. (I only realized this due to implementing a fix for #1432050: Show parsing status on branch admin page, and seeing the number growing instead of shrinking each cron run.)

The reason is that when we were using Job Queue, we were calling job_queue_add() with a $no_duplicates = TRUE argument, which made sure that no duplicate jobs were added. But now that we are using Drupal Queue, duplicate jobs are added each cron run, because each cron run the parser checks to see what needs updating (what files have been touched since the last time they were parsed), and adds all of them to the Drupal Queue.

There doesn't seem to be a way to tell Drupal Queue to avoid adding duplicates, unfortunately. The Drupal Queue that we are using is a backport of the Drupal 7 core queue, and it doesn't have this ability.

We might need to go back to Job Queue... The only other option I can think of is keeping track in the API module. This is actually feasible, because we could put queued items in a table and then when the parse function is called, remove them again. But it seems a little pointless then to use Drupal Queue if we have to keep track ourselves?

Anyway, the current state of things is *not* OK. It would never finish parsing.

Comments

jhodgdon’s picture

I had a conversation with drumm about this in IRC:

(02:38:10 PM) drumm: jhodgdon: I'm not maintaining Job Queue at all, and don't really want to see it maintained since Drupal queue is supposed to replace it

(02:40:21 PM) drumm: jhodgdon: I think we probably want to add a "status" column to api_file for marking this. Not sure if status is the right name. Anyway, needs to store at least "queued, or up to date"

(02:40:56 PM) drumm: maybe a "not up to date & not queued" status, but that doesn't really happen I think
(02:40:57 PM) jhodgdon: drumm: I guess it would need to store a "when" in that column perhaps?
(02:41:09 PM) jhodgdon: drumm: yeah... ok I can work with that idea
(02:41:30 PM) drumm: jhodgdon: there is the timestamp column we are already using, but I'm wary of being too clever with it

I think this is probably the right approach to fixing this problem.

jhodgdon’s picture

More thoughts on this after more IRC conversation with drumm:

We also need to consider what to do for new files, which don't yet have entries in {api_file} or {api_documentation}. There are two possible approaches:

a) When encountering a new file, save it in both tables, with a stub entry. This means the file might show up in listings before it is parsed, but that is probably OK.

b) Taking a look at the columns in {api_file}, none of them are currently really being used except the 'modified' column, which stores the modified time of the file that is parsed. Other columns that were there at one time have not been used for a while (except the 'version' column which tries to get the CVS ID from a file and put that in, but I think we could get rid of that since we don't have those any more anyway). So basically this table is only for use of the parser to figure out if a file needs parsing again.

So, we could totally redo this file. Right now it stores the file's documentation ID (did) and the time modified. It could instead store the branch ID, file name, time modified, and time queued. This would allow us to put new files in there. Then when they're parsed, the time modified would be updated and time queued set to 0 again.

I'm kind of in favor of (b), because it would keep this {api_file} table kind of self-sufficient for the parser, and it wouldn't be needed by the rest of the API module any more. We should really get rid of those extraneous columns in either case too.

jhodgdon’s picture

And one more thing. In the update function that will go with this change, I think we need to do something like this to get rid of the existing jobs in the queue (for instance, on my test site right now I have 14000 of them):

// Delete all existing api_parse jobs in the queue.
drupal_queue_include();
$queue = DrupalQueue::get('api_parse');
$queue->deleteQueue(); 

The jobs that are needed will be re-created the next cron run anyway.

gordon’s picture

Yes I found this problem as well. In the end since I am using Beanstalkd and running the worker in a shell I just slowed down the cron, as the worker would keep up.

I did take a look into adding a last queued time to the database and then check if this had queued in the last x amount of time and not queuing it if it was too soon.

jhodgdon’s picture

I think the approach of just recording that it's been queued, and then zeroing that queued time out if it's parsed, will work better than trying to guess if it's been too long.

jhodgdon’s picture

Two more thoughts:
@gordon: If you know of or find any other issues with the API module, please file issues about them, so that we can fix them. :)

I think I'll also add a "clear queue" button to the API admin page. This would:
a) Remove everything from the queue (see code in #3).
b) Set the queued time to zero for all the files.
After that, at next cron run, files would once again be checked to see if they need reparsing based on the usual criteria.

gordon’s picture

Firstly I send through as much as I can, but some issues I am not 100% sure if they are API core issues or problems with my highly modified version and sometime I just don't have the time to work out exactly which side of the fence they are on.

a) I really think that clearing the queue is something that should be done by the queue management software. An example of this is that Beanstalkd doesn't implement ::deleteQueue() mainly because Beanstalkd the server doesn't support deleting queues, and for the Beanstalkd to delete all the jobs on the queue depending on how may kill the system. I can't remember if I have implement the deleting of all jobs on the queue from drush, but it is something I will implement in the future if it is needed.

b) I started on my approach to this problem in that it would set the last Queued time, and any subsequent attempts to queue it would not happen if the queue time was greater than the modified time.

c) I think this should be a drush command which will all the admin to do this from the command line, or called from a administration system like webmin.

I am going to take another look at what I did and see if I can get it going. I need to sync up my version of api with the d.o version.

Gordon.

jhodgdon’s picture

RE #7

a) If Beanstalkd doesn't implement ::deleteQueue() then it isn't implementing the full interface. I can't help that... We are using Drupal Queue, so I can't help it if the queue someone decides to use isn't a full implementation, and I should be able to rely on the full implementation being there.

b) I think we'll be going for a slightly different approach (see above).

c) Having a Drush command as well as a button on the admin page is probably a good idea.

jhodgdon’s picture

Status: Active » Fixed

I've fixed this. I decided to go with option (a) from #2 (stub entries), as it required the fewest code changes. This could be refactored later.

This commit includes:
- Fix for the bug.
- Ability to reset the queue from the admin interface or from drush.
- #1432050: Show parsing status on branch admin page (very related).
- Tests for the queue behavior.
- Remove the two old, unused fields from the {api_files} table.

http://drupalcode.org/project/api.git/commit/316080d

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.