Sometimes Solr is not accessible and in the case of some hosted solr services solr is sometimes configured to be read only. (Like Acquia Search).
This is a problem for the indexing flow because when we do the function

public function indexItems(SearchApiIndex $index, array $items) {

it will fail and catch the error ($this->solr->addDocuments($documents);) but it will not re-add the item to the queue of items to be indexed. We need a way to safely understand when solr is not available and make sure we have the certainty that failed items will be retried. Input on how this retry then happens is welcome.

In the case of Acquia Search the response code is a 403 Access denied but I'm sure this relates to other codes as well. Also, It's not a problem if the indexing job fails and throws an exception and puts some message in watchdog - it's a problem that the item will not be indexed until it gets re-added in the search api queue/system

We would need a test to find out this behavior so it does not re-occur

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

Project: Search API Solr » Search API
Component: Code » Framework

I've now analyzed what happens, and the Solr module isn't really involved, as far as I can tell. As you write, the error (server is read-only) is correctly detected like any other error (completely unreachable server, wrong configuration, etc.) and an empty array (= "none of the items could be indexed") returned to the Search API.

There the problem starts, though. While we correctly check for this case in the Batch API integration ("Index now") and abort¹ (assuming that there's a general problem with indexing), this doesn't happen during cron runs or when calling search_api_index_items() (which uses the queue as well, to index items in the queue before others). For the latter this would be easy to fix, of course, but I don't think it's possible during cron runs (without manually deleting all tasks in the queue).

The second problem is that, even when no search items² could be indexed for a certain queue item, the item is deleted – again both during cron runs and in the function. This leads to the behavior you report, of the search items not being indexed until all others have been indexed successfully – which, while not really violating any contracts, is of course not optimal.

Also not optimal is that our cron task will run for the full 15 seconds, burning up queue items, even when it becomes apparent in the first second that indexing is currently not possible.

I could imagine two solutions here:

  1. The smaller change would be to re-implement drupal_run_cron()'s queue logic in search_api_cron(), processing our queue items there (instead of registering the queue with hook_cron_queue_info()) and allowing it there to also release the item again, or to abort indexing (for this index) completely. We'd then just have to fix search_api_index_items(), too, and this issue should be fixed.
  2. However, as you pointed out in the Prague Solr BoF, a queue is conceptually inadequate for dealing with indexing, since indexing of items should follow an order different from FIFO (i.e., first index new items, then changed items, both in the order of when they were changed). Therefore, re-vamping the internals of how we index items to get rid of cron queues again might be the better solution here. Some of the datasource controller code would then be useless, but that can't be helped. We could also still run 15 seconds in the cron hook and index as many batches of items as possible, to maintain the meaning of the "Cron batch size" setting and the search_api_index_worker_callback_runtime variable. (Otherwise, indexing might be slowed down, especially when initially indexing items, if only cron is used. In the worst case, indexing might become too slow to keep up with new items permanently on some sites.)

For reference, the indexing via cron queue was implemented in #946624: Use a cron queue for indexing. The issue sadly doesn't contain the reasons why this was done (and I don't really remember them), but parallel indexing of items is cited as one of the advantages. However, after some short testing I'm not convinced this really works so smoothly.
Also, in the time since then we've added the Batch API integration, so indexing large numbers of items doesn't require running cron repeatedly anymore.

Because of these reasons, I think I'd actually prefer 2. (it would also make the indexing/cron code a lot easier to read and understand), but I'm open for other arguments or suggestions.

¹ Since the Batch API integration internally uses search_api_index_items(), though, the problem is the same there too as long as there are items in the cron queue.
² Confusing, both "search items" and "queue items" …

drunken monkey’s picture

And yes, having a test for this, even though it didn't turn out to be Solr-related, would of course also make sense.

drunken monkey’s picture

Status: Active » Needs review
Issue tags: +API change
FileSize
14.97 KB

Attached is a first implementation of the latter idea, just getting rid of the cron queue for indexing. As long as I haven't overlooked something major, it has gone very smoothly.

I have left the search_api_index_worker_callback_runtime variable and associated functionality in there, so we continue to index more batches in cron runs, depending on the time. We can now go about it in a much fairer way, though, by indexing a batch for each index in turn until we run out of time, and one for each minimum. (I think that's the best way, otherwise indexes ordered to the front could starve others if the time to run is too short.)

I've also left the logic for tracking queued items in the datasource controller (and the module) to not change the API more than necessary – I tagged the associated method, function and parameter as deprecated, though, so that's kind of an API change. Mostly, just people implementing their own datasource controllers should be aware of this change (though even then they probably won't have to change anything), or people implementing their own low-level indexing functions/mechanisms.

An update function is included as well, which removes the cron queue, deletes a now unneeded variable and sets all items that were in the cron queue to the "new" status, so they will get indexed first. It would of course be even better to give them back their old changed value, but that's of course impossible.

Anyways, there it is. Please test, review, discuss!

Tests aren't included in this patch yet, though, I'll probably add those tomorrow.

Status: Needs review » Needs work

The last submitted patch, 2115127-3--remove_cron_queue.patch, failed testing.

Nick_vh’s picture

@@ -571,21 +578,7 @@ abstract class SearchApiAbstractDataSourceController implements SearchApiDataSou
+   * [@inheritdoc}

[ should be {

Looks ok otherwise. I did not verify if this already solves the problem completely so I'll see if I find some time to test it and perhaps write a test.

drunken monkey’s picture

Status: Needs work » Needs review
FileSize
15.05 KB

Oh, another change is that indexing will now be stopped for all indexes on the search server if an exception occurs. So we might want to change the Solr module, too, and throw an exception instead of returning an empty array, if an exception occurred on the Solr side.

Regarding the test fail: it helps, of course, if you don't go into an infinite loop if no (enabled, not read-only) indexes are available … Fixed patch attached. Adding new test now.

@ Nick: Thanks, good catch!

Nick_vh’s picture

Would it make sense to also add a read-only switch to the search api environments? Perhaps as separate patch but it would allow people to move to a dev or staging environment and make use of this revised functionality that search api does correctly detect if something is not indexing as designed and waits. Also, this might be useful if engineers are working on the solr or any other backend and they are expecting updates to not work.

Having this addition could also enable easier tests as you then can test a couple of scenarios :

1) Try to index and have a stub solr/search service that always replies 200, see if the tracking table progresses
2) Try to index and have a stub solr/search service that replies something different than 200, perhaps 502 or 404, see if the tracking table stays put.
3) Have the index on read only and have a stub solr/search service that always replies 200, make sure we catch the read only statement and see if the tracking table stays put.

Nick_vh’s picture

In terms of solr exceptions, I'd like you to look at http://drupalcode.org/project/apachesolr.git/commitdiff/98003725104e90b8....

It throws more useful error responses so you can actually see what went wrong in Solr. But that's also for a separate issue

drunken monkey’s picture

There is already a read-only mode for indexes, see the index's "Edit" tab. There isn't one for the whole server, though. Also, you couldn't use our current read-only mode for quickly disabling updates when migrating the server or the like, as it's planned as a more permanent setting: while in read-only mode, tracking of new/changed items is disabled, and tracking (and, therefore, indexing) will start anew once you leave read-only mode. It's more intended for reading from a Solr server which gets filled from elsewhere.

Your suggested tests would make sense in any case, though, for the Solr module. We could of course add those.

The code you linked to also looks very interesting, thanks! Might really be a nice addition to the Solr module.

I'm currently working on the test, it's just a few fails from passing …

Nick_vh’s picture

Hmm, so you are saying that in read only mode for an index will not track updates items and as soon as the read only mode is disabled it will not index those items that have been updated from before the switch? That seems like bad behaviour as you would expect it starts indexing from where it left of (timestamp wise).

I would have to take a closer look at the tracking table to see how it is handled.

drunken monkey’s picture

No, I don't think you mean the right thing. After you disable read-only mode, all items will be indexed again, no matter whether they were updated while the index was in read-only mode or not. That's because, as said, that mode is intended as a permanent setting, and in that case we of course don't want to waste space by tracking items that will never get indexed anyways.

Attached is the patch with tests, as well as a patch which contains only the tests to verify that they currently fail. (The latter also serves as an interdiff to #6, as I didn't have to change any of the non-test code.)

Status: Needs review » Needs work

The last submitted patch, 2115127-11--remove_cron_queue--tests_only.patch, failed testing.

Nick_vh’s picture

I do think we need to keep this in mind when thinking about D8 so I'll leave it like this for this discussion but I do have some remarks on the statement.

After you disable read-only mode, all items will be indexed again, no matter whether they were updated while the index was in read-only mode or not.

Can you imagine that this happens to a site with thousands of nodes, where perhaps only 10 items were updated? The read only mode should not prevent search api from tracking entities, as tracking entities can be done in a single tracking table where each index can keep its own pointer of which entity it last indexed. But this is a much broader topic then just the topic of this issue so let's focus on the issue at hand. Thanks for the work so far!

drunken monkey’s picture

Status: Needs work » Needs review
FileSize
11.16 KB
26.78 KB

Yes, I've had my doubts about this decision as well, but it's hard to change it now, as it would be hard to impossible to implement an update function. But we can definitely discuss it for D8.

Regarding the failing tests, I'm not entirely sure what happened. My best guess would be that the new items are getting touched/inserted too fast and therefore share the same changed timestamp – messing up the order. Patch attached that would fix this, otherwise more debugging would be necessary.

Status: Needs review » Needs work

The last submitted patch, 2115127-14--remove_cron_queue--tests_only.patch, failed testing.

drunken monkey’s picture

Status: Needs work » Needs review

I probably should've added the patch files the other way round …

Nick_vh’s picture

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+  protected function checkIndexingOrder() {

Needs some documentation blocks

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+    $this->assertText('result count = 0', 'Indexed items were successfully deleted from the server.');

Couldn't we check this with a function instead of checking the UI?

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+    $this->assertText('results = ()', 'Indexed items were successfully deleted from the server.');

ditto

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+    sleep(1);

sleep slows down the test. I wonder if we can avoid that and insert fake timestamps if that is what is needed?

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+    $this->drupalGet("admin/config/search/search_api/index/{$this->index_id}/status");

I like UI tests, but again - I wonder if it really belongs here? These values can change if UI work is done and then your tests break. At most, this has to be a function getting the data and comparing the data. If you want to do UI tests, we should do so separately?

@@ -344,6 +324,95 @@ class SearchApiWebTest extends DrupalWebTestCase {
+    variable_set('search_api_test_indexing_break', 8);

Using a variable to fake stop the indexing seems a little dodgy but I wouldn't see a different more efficient way. I'm just wondering if this emulates the non 200 error code in the same way as I have a hard time understanding the $mod = variable logic. Would you be able to add some more documentation there in a way that makes the reader understand that this is needed to test for breakages etc?

@@ -250,25 +263,23 @@ class SearchApiTestService extends SearchApiAbstractService {
+    asort($this->options['indexes'][$index->machine_name]);

Reasoning of why this changed?

Nick_vh’s picture

Also, with the sleep(1) thing, the query getting the items shouldn't care about the timestamps all being equal. Most likely you want to order by entity id and by bundle. and then keep track of your last indexed item so you can continue from there. If you only keep track of last indexed timestamp you'll lose quite a few items.

So probably the question is : is there an indexing flaw? Do things break horribly if they each have the same timestamp?

drunken monkey’s picture

Regarding the UI tests: You are right, that is kind of a relic from way back when I started the module and didn't know you could do that (i.e., access the test database directly) from within the test case. I therefore implemented this helper functions in the test module and then tested via the UI. I recently overhauled this for the database backend, but the code here remained.
I now changed it to do the searches and mark the items as indexed directly in the test cases. However, some parts should in my opinion still be tested via the UI, as otherwise that part of the functionality gets untested.
Of course, if the UI gets updated the tests would have to be adapted, but that's the case anyways as long as you test the UI, and no problem at all in my opinion. However, I guess we should encapsulate checking the index status in its own method, so it won't have to be changed all over the place. (Especially important since I'll have to do exactly that for #2100199: Consider merging view, edit, settings.)

sleep slows down the test. I wonder if we can avoid that and insert fake timestamps if that is what is needed?

As with circumventing the UI, inserting the correct timestamps directly would essentially only test half of the system, so I think we have to do it this way. The only other way would be to switch to using microtime() instead of REQUEST_TIME for the timestamps, and I don't think that's warranted just to make tests easier.
Also, I don't think those seven seconds will be a significant portion of the runtime of the test cases. However, since the timestamp of the inserts shouldn't matter anyways (and it's probably safe to assume as much about the system), we can at least reduce this to three seconds.

Using a variable to fake stop the indexing seems a little dodgy but I wouldn't see a different more efficient way. I'm just wondering if this emulates the non 200 error code in the same way as I have a hard time understanding the $mod = variable logic. Would you be able to add some more documentation there in a way that makes the reader understand that this is needed to test for breakages etc?

I now expanded the comments there to hopefully be easier to understand. And I agree, it's not particularly clean or beautiful, but it serves its purpose and I can't readily think of a better option.

The non-200 error code is of course something different, as it's completely specific to Solr. We of course cannot check here whether this is handled correctly by the Solr module. However, that part was never the problem, we handled non-200 error codes in the Solr module correctly from day 1. So I don't think testing that is necessary, or at least certainly not as part of this issue.
Using the variable exactly emulates the behavior the Solr service class displays when a non-200 error code is received from Solr, as far as the Search API is concerned – so with this code, we are testing exactly what you wanted to test.

Reasoning of why this changed?

It was simply wrong before, but only the recent test changes made that apparent.

Also, with the sleep(1) thing, the query getting the items shouldn't care about the timestamps all being equal. Most likely you want to order by entity id and by bundle. and then keep track of your last indexed item so you can continue from there. If you only keep track of last indexed timestamp you'll lose quite a few items.

So probably the question is : is there an indexing flaw? Do things break horribly if they each have the same timestamp?

Not at all, no, you misunderstand the concept of our tracking system. We save the timestamp of the last change, or 1 for new items, in the tracking table's changed column, then we sort by that when retrieving the items to index. For items that have been successfully indexed we set the column to 0 (and of course filter by changed > 0 when retrieving the items to index). Previously we also had -1 for items in the cron queue, but that's of course gone with this patch.

The only effect of items having the same timestamp, therefore, is that the order they (and only those with that timestamp) are indexed in isn't guaranteed. Which usually won't matter at all (if something was changed half a second earlier than another thing, no-one will notice whether it gets indexed later or not), of course, but since we are expressly testing the indexing order here, we have to take that into account.

Revised patch attached. The tests pass for me, I hope this is the same for the test bot.

drunken monkey’s picture

Issue summary: View changes
Status: Needs review » Fixed

Committed.
Thanks again for your help and reviews, Nick!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

fago’s picture

hm - I'm not sure exactly why indexing has been moved to cron runs? I see that potentially as problematic as long cron runs could result into cron locking issues, while multiple queue items can be processed parallel?