This has two parts, which I'll probably do both at the same time – therefore, a single issue should be enough.

First off, the cron tasks should use a cron queue – see hook_cron_queue_info().

Secondly, when indexing directly via the "Index now" form, this should use the Batch API, at least for more than a few items. See Batch API.

Comments

drunken monkey’s picture

Title: Use a cron queue and the Batch API for indexing » Use a cron queue for indexing
Component: Code » Framework

Wow, the oldest open issue in the Search API, finally about to be resolved – well, at least half of it. Patch coming soon.
See #1225620: Use the Batch API for the "Index now" functionality for the other half, which might well take another nine months …

drunken monkey’s picture

Status: Active » Needs review
StatusFileSize
new11.01 KB

Patch here, please test/review.

One thing I couldn't really pin down is: When do the items in the queue get processed? I thought this was controlled by the "time" setting, and items would get processed automatically on page requests? But after a bit of testing, this doesn't seem to be the case. When running cron, the items would be inserted in the queue. Then, immediately some of the items would be processed. And then the other items would just lie in the queue, unprocessed. Needless to say, this has to be cleared up before this can be committed.

Does anyone know more about this? Documentation for this is kind of useless, once again.

fago’s picture

>And then the other items would just lie in the queue, unprocessed.
They are processed on the next cron runs.

>+ 'time' => 10,
Why only 10seconds? Any reason to not stay with the default?

Apart from that patch looks good to me, though I've not tested it yet.

drunken monkey’s picture

Status: Needs review » Needs work

Oh, OK, seems like I misunderstood the concept a bit. No wonder, without any real documentation, that doesn't assume you already know what they're talking about …

Corrected patch coming soon (this one will put items several times into the queue).

drunken monkey’s picture

Hm, still thinking about how to best correct this …

1) Store items which are already in the queue in a variable, or in the database.
2) Mark items as indexed when inserting them into the queue, and just mark them as dirty again if they couldn't be indexed.
3) Make sure the items still need to be indexed in the worker callback.
4) Don't include items in the task, but only the batch size, and retrieve the concrete items in the worker callback.

I first went with 2), but then realized that that would essentially make the progress display on the "Status" tab useless, as it would hit 100% as soon as the first cron run starts, even though it could still take weeks until the items are indexed. 1) I really don't know about, and 3) would necessitate a new API function for checking if specific items need to be indexed – something we didn't need before, which is especially bad due to #1064884: Add support for indexing non-entities.
At the moment, 4) would sound best to me – it doesn't seem to have any real disadvantages, other than that there might be several no-op tasks lying in the queue (which, I guess, wouldn't do any harm). It just seems to me to be a bit against the spirit of cron queues. Any opinions there?

drunken monkey’s picture

Status: Needs work » Needs review
StatusFileSize
new11.55 KB

OK, this one implements 4), and seems to work fine.

fago’s picture

Status: Needs review » Needs work

uhm, no 4) doesn't work for a queue as two parallel workers would process the same items - ouch.

To keep the status information up2date I'd propose adding a new "queued" flag to the table, then only remove the entries as soon as the queue was successful + only add entries if there is no entry at all. That should do it, or?

drunken monkey’s picture

uhm, no 4) doesn't work for a queue as two parallel workers would process the same items - ouch.

As long as cron runs don't overlap, this should never happen, I think. The workers aren't called in parallel (how should they, in PHP?) by drupal_cron_run(). Also, even if this should be the case in some rare instances, there's really little harm done, only a slight performance loss.

Your suggestion would be what I listed as 1). While possible, I guess this would also need changes to the data source controller in #1064884: Add support for indexing non-entities, which I'd like to avoid, if possible. And when we'd have to change it, I'd rather go with 3), I think (as it at least wouldn't need another field).

fago’s picture

>As long as cron runs don't overlap, this should never happen, I think.

That's the main plus of using the queue, items may be safely processed parallel. Look at drupal_cron_run(), it releases the lock before it is working through the queue.
So cron should run shortly, then long-running operations can happen in the queue. If in the meantime the next cron run starts, everything is fine...

drunken monkey’s picture

So cron should run shortly, then long-running operations can happen in the queue. If in the meantime the next cron run starts, everything is fine...

It's not long-running, it only runs 15 seconds. I don't think parallel processing is the aim of cron queues, but rather that the operations can safely be interrupted and later be continued, without the danger of a timeout.
I don't really see this happening in any way unless a) some module has sometimes very long-running items in the queue (and the PHP timeout is set to an absurdely high value) or b) the cron interval is less than a minute. Oh, or c), the user sets the cron batch size to a very large value (or -1), but that's a configuration error then – and the cron interval would still have to be less than the PHP timeout.

drunken monkey’s picture

StatusFileSize
new11.48 KB

Patch adapted to latest changes in HEAD.

drunken monkey’s picture

StatusFileSize
new11 KB

Oops, missed a merge conflict …

fago’s picture

Point is a) b) and c) may as well happen + d) its 15 seconds per queue, not total.

and the docs of hook_cron_queue_info() say...

While there can be only one hook_cron() process running at the same time, there can be any number of processes defined here running. Because of this, long running tasks are much better suited for this API. Items queued in hook_cron() might be processed in the same cron run if there are not many items in the queue, otherwise it might take several requests, which can be run in parallel.

drunken monkey’s picture

Status: Needs work » Needs review
Issue tags: +API change
StatusFileSize
new18.5 KB
fago’s picture

Patch looks good, though I've not tested it yet.

Another remark: What happens if an entity is deleted while it is queued? Probably indexing fails, but is it going to be wrongly re-queued too?

drunken monkey’s picture

Deleted items are deleted from the indexing table, and therefore can't be marked as queued/changed/indexed anymore. The attempt is made in the worker callback, but it shouldn't have any effect.

drunken monkey’s picture

I've had another idea: a hidden variable for the runtime of the queue in a cron run. Patch upcoming.

fago’s picture

Status: Needs review » Needs work

ops, sry I forgot to give feedback! It works fine for me, a hidden setting (variable) for the maximum queue time would be good to have.

fago’s picture

:P

drunken monkey’s picture

Status: Needs work » Needs review
StatusFileSize
new19.32 KB
fago’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me.

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

Looks good to me.

You and your sloppy reviews …
But after I fixed the patch, I think it now works. Committed.

Status: Fixed » Closed (fixed)
Issue tags: -API change

Automatically closed -- issue fixed for 2 weeks with no activity.