Download & Extend

Mass re-indexation can miss (a lot of) content

Project:Apache Solr Search Integration
Version:7.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs review

Issue Summary

The query used in apachesolr_index_get_entities_to_index is ordering only on aie.entity_id, and filters the result set based on the changed timestamp of the last item that was indexed. This works fine during normal operations, but it can break horribly when someone click on the "Queue all content for reindexing" button, if there's not a strong correlation between changed times and entity_ids.

Simplified example where this would break (the real production environment where this broke is using Apache Solr Commerce, and Migrate to regularly update all the commerce products on the site): let's say your only indexing nodes, you reduced the number of items indexed to 5 per cron, your nodes 1 to 5 are basic pages that get semi-regularly updated, and you are also indexing article nodes that have higher entity_ids and never get any update. The first run will index the 5 pages nodes, and put the changed timestamp of node 5 as in the $last_changed timestamp… Which means that on the next cron run, all the article nodes that were written before the last change of node 5 are now considered indexed, even though they've never been sent to Solr.

(from remimikalsen)

Symptom:

If there are more *updated* items in the apache solr indexing queue (sharing the same timetamp) than will be indexed by one cron run, the "index all queued content" option will only index the number of items that will be indexed by one cron run.

How to reproduce:

- You have indexed 100 documents from Drupal.
- Your Apachesolr settings say you should index 50 items per cron run.
- Force an update of the apache solr index by setting the apachesolr_index_entities_node.changed column beyond the last update for 90 of your already indexed items.
- The Apachesolr status page will now say 90 items remain for indexing.
- If you attempt to "index all queued content", only 50 items are actually sent to Solr.
- The other 40 items will never be reindexed.

Why does this happen?

The update algorithm for selecting items only uses last entity_id or last changed date (apachesolr_index_get_entities_to_index). After the equivalent of one cron run, the last changed date is changed to the date of the last indexed item (apachesolr_index_entities). If several entries share the same timestamp (not at all unthinkable in big custom bulk operations) you risk not getting your data indexed and you have no error messages to tell the story.

Proposed solution

I propose adding a dirty-bit column to the apachesolr_index_entities_node table, named "pending". If set to 1, update, if set to 0, leave alone. The function apachesolr_index_entities could then run a bulk db_update on all rows that were successfully indexed, setting the dirty bit to 0.
You lose efficiency with the db_update, but gain efficiency in the function that selects items to reindex.

Comments

#1

Status:active» needs review

Attaching a patch that should solve this.

AttachmentSizeStatusTest resultOperations
0001-Bug-1828014-by-DeFr-apachesolr_index_get_entities_to.patch844 bytesIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#2

I should test it out, will do that now

#3

I'm a little confused as how an orderBy would help here.

I tried to replicate this but I was not able to get the missing nodes logic.

Could you get us a to z guide on how to replicate?

Also, in any case, a sort on the changed date would be welcome, just to get the oldest ones first.

#4

I'm currently checking that the problem still occurs with the latest dev release, but from visual inspection, the orderBy seems to still be needed.

The problem is the combination of apachesolr_index_entities / apachesiolr_index_get_entities_to_index. After a reset, the index position is at 0, so the condition based on changed is always TRUE, and apachesolr_index_get_entities_to_index always return the first X entities. apachesolr_index_entities then tries to find the maximum "changed" value of those entities, and set that as the limit for the next run.

edited to clarify: With the additional orderBy, the results are sorted by changed first, and then entity_id ; that means that $max_changed is initialized to a saner value than when picking the changed value of the first X entities sorted by entity_id.

#5

Status:needs review» reviewed & tested by the community

Well, it does make sense. I'm going to commit this

#6

Removing the sort in the count query, no need there

AttachmentSizeStatusTest resultOperations
1828014-6.patch833 bytesIdleFAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 1828014-6.patch. Unable to apply patch. See the log in the details link for more information.View details | Re-test

#7

Version:7.x-1.x-dev» 6.x-3.x-dev
Status:reviewed & tested by the community» patch (to be ported)

needs backport to 6.x-3.x

#8

Priority:major» normal

Just found out that another fix went in #1476228: Items to index determined not correct., changing the '=' to '<=', which explains why the couldn't be reproduced with the latest dev. Changing the ordering still makes sense, but isn't major at all.Sorry for the noise :-/

#9

Version:6.x-3.x-dev» 7.x-1.x-dev
Status:patch (to be ported)» needs work

While this will help reduce the amount of content that might be missed, it can still miss content that is updated at the exactly the same time.

For example, lets take this database and assume that we index three documents at a time:

id | timestamp
11 | 100 (in first batch)
12 | 100 (in first batch)
13 | 250 (in first batch)
14 | 250
15 | 250
16 | 250
17 | 250
18 | 250
19 | 100
20 | 100

after first batch
$last_changed = 100
$last_entity_id = 13

You can create something like this by creating ten entities at the same time (timestamp 100), running cron once (which indexes documents 11-13) and then updating six documents (13-18, new timestamp 250). Note that these operations all happened at exactly the same second, which might be due to a busy site or a batch import.

You now have a Solr index that includes the latest version of 11-12, an outdated version of 13 and no version of 14-20. Ideally we should be indexing 14-20 first, followed by 13. This patch gets most of the way there, although it will index 13 earlier than is ideal.

id | timestamp
19 | 100 (in second batch)
20 | 100 (in second batch)
13 | 250 (in second batch)
14 | 250
15 | 250
16 | 250
17 | 250
18 | 250

after second batch
$last_changed = 250
$last_entity_id = 20

The bigger problem is that after the second batch, both $last_changed and $last_entity_id are set to values that will miss content.

$last_changed will be set to 250, so will miss any content that was updated at exactly the same time as the last indexed item (13).

14-18 are now scheduled to be indexed after 19 and 20, so will be missed by the $last_entity_id check, and were also updated at the same time as 13 so will be missed by the $last_changed check.

I can't see any way of keeping track of what has and hasn't been indexed using only the $last_changed and $last_entity_id variables - there will always be ways that entities can end up hidden behind other entities.

I see two potential solutions:
1) Add a 'last_indexed' column to the table. We then choose entities to index where last_indexed is null or last_indexed < last_updated (and choose last_indexed is null entities first). This means changing a database row for every entity we index, which could have performance problems.
2) Replace the table with an "entities to index" queue table with an auto incrementing id. This id can then be tracked as a Drupal variable and the indexed entities periodically deleted from the table. This avoids regular database writes, but the queue id could easily overflow as 32 bit integer.

#10

After some conversation in IRC we found out that this is indeed a potential problem. I'd like to see if we can add the milliseconds to the table as an update to the module. Or even a contrib module perhaps? I'm not sure how easy that would be given that the timestamp is an integer. Perhaps we could add a new column to the index that saves the milliseconds and you have a third sort on that number?

Or... We change the timestamp to a float, and have the regular unix timestamp + the milliseconds behind the comma...

#11

Ignoring timestamps that are exactly the same for now, I think the more complex problem can be summed up as us trying to maintain two queues (based on last updated and entity id) with a single table, and either queue might contain items from the other queue. If the first queue picks up an item that is also in the second queue, then everything behind the item in the second queue gets discarded.

There are two potential solutions to this problem:
1) Separate out the two queues. I think you might be able to do this by simply adding an entity_id check to the first OR. So:
db_or()->condition('aie.changed', $last_changed, '>')
->condition(db_and()
->condition('aie.changed', $last_changed, '<=')
->condition('aie.entity_id', $last_entity_id, '>')
)

Becomes:
// Entities that we have already indexed but have since been updated
db_or()->condition(db_and()
->condition('aie.changed', $last_changed, '>')
->condition('aie.entity_id', $last_entity_id, '<='))
// Entities that we have never indexed
->condition(db_and()
->condition('aie.changed', $last_changed, '<=')
->condition('aie.entity_id', $last_entity_id, '>')
)

I'm wary of this solution though, as I can easily imagine there being another scenario where the indexing remains incorrect.

2) Switch to using $last_changed only. This is much simpler logic, and only has two small disadvantages that I can see:
a) if you edit an unindexed entity then it will be put to the back of the indexing queue. This is only a problem when you've got a lot waiting to be indexed - if that is the case I'd be more concerned about speeding up indexing.
b) it is an api change (if not the function calls, then at least the values that come back). This could actually be seen as an advantage, as any custom code that relies on the current behavior is likely to be broken too.

If we switch to using $last_changed only, then we need a way of differentiating items that have exactly the same timestamp. Increasing to microtime precision would help there, but it would still be possible for two entities to have exactly the same time, either due to very fast updates or, more likely, deliberately setting "changed" on a block of entities that you want to reindex using a single SQL query. Something that is guaranteed to be unique between two entities with the same changed timestamp is the entity_id, so I propose repurposing $last_entity_id as a tiebreaker for multiple entities that have the same changed timestamp.

db_or()
// Entities that have been changed more recently than the most up to date indexed entity
->condition('aie.changed', $last_changed, '>')
// Tie breaker for entities that was changed at exactly the same time as the last indexed entity
->condition(db_and()
->condition('aie.changed', $last_changed, '=')
->condition('aie.entity_id', $last_entity_id, '>'))
)
// It is important that everything is indexed by changed date then entity id
// for the conditions above to work correctly
->orderBy('aie.changed', 'ASC')
->orderBy('aie.entity_id', 'ASC');;

#12

Status:needs work» needs review

This should implement the change described in #11 option 2

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-12.patch6.16 KBIdleFAILED: [[SimpleTest]]: [MySQL] 115 pass(es), 6 fail(s), and 0 exception(s).View details | Re-test

#13

Status:needs review» needs work

The last submitted patch, apachesolr-18281014-indexing_order-12.patch, failed testing.

#14

Status:needs work» needs review

Fix typo in function name. Hopefully that was causing the other failures too, as otherwise I'm not sure how to debug those.

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-14.patch6.16 KBIdleFAILED: [[SimpleTest]]: [MySQL] 513 pass(es), 0 fail(s), and 13 exception(s).View details | Re-test

#15

Status:needs review» needs work

The last submitted patch, apachesolr-18281014-indexing_order-14.patch, failed testing.

#16

Status:needs work» needs review

Forgot to make $bundles available in my shared function.

I've decided to get it again inside the function, as I think the code that returns early in the functions if there are no bundles is premature optimisation and would be better removed (although I've left it for now to avoid complicating the patch).

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-16.patch6.04 KBIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#17

The patch on #16 has worked in my testing

#18

I didn't realise apachesolr_index_send_to_solr() returned true when there was nothing to index, which resulted in last_index_position being cleared.

This version only updates progress if something has been indexed, and logs an error if either of the varialbes are missing.

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-18.patch6.53 KBIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#19

OK, that patch got stuck if an entire batch of documents were inspected and excluded by apachesolr_{entity_type_}exclude, as it would treat them as if there had been an error with the solr server.

This version skips the call to apachesolr_set_last_index_position (and other code) based on whether the module has any documents to index.

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-19.patch6.9 KBIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#20

We have been using this in production for about 10 days now and it seems to be working well

#21

Let's try to drive this home, because yes I still that what's in the git repository now is still kinda broken.

First of all, I think I must apologize a bit for the patch in #1: it was developed on a production site that was running a rather old beta, pre-#1476228: Items to index determined not correct., and I didn't test it as much as I should have when forward porting it to HEAD. This is going to matter latter, because this patch is effectively a backout of 65f3ce, 25d53bc and a few other commits)

From a general point of view, I think option #11.2 makes total sense. The entities needs to be retrieved in a definitive order if we want to make sure that all of them are indexed only once. That almost happened in beta16, which was only lacking the orderBy on entity_id to get the order straight, but the current dev need a bit more love.

Moving on to the patch review:

- The change to apachesolr_index_entities makes sense. We're indexing entities in a well defined order, which means that our new bondaries are necessary at the end of the newly indexed documents. The case catched by the watchdog call seems a bit nasty: the last index position isn't updated, so on the next cron run, the same entities will be indexed again, probably leading to the same error condition, and a loop. I don't see anything better than the watchdog call though, and at least the site administrator now has a way to found out about the problem.

- The refactoring makes it slightly harder to track the query changes, but makes the whole thing easier to maintain on the long run. Globally, it's a back out 25d53b + a few comments, which seems fine. Technically, the count query doesn't need the orderBy calls, so they could be left out of the query wrapper, and only added in apachesolr_index_get_entities_to_index. Given how crucial they are to the correct execution of the whole process though, putting them in the query wrapper and letting it handle all the logic of the correct retrieving seems fine.

#22

I've come across the same problem, and pinned it pretty much down to exactly what goes wrong and an easy way of replicating the problem.

Symptom:

If there are more *updated* items in the apache solr indexing queue (sharing the same timetamp) than will be indexed by one cron run, the "index all queued content" option will only index the number of items that will be indexed by one cron run.

How to reproduce:

- You have indexed 100 documents from Drupal.
- Your Apachesolr settings say you should index 50 items per cron run.
- Force an update of the apache solr index by setting the apachesolr_index_entities_node.changed column beyond the last update for 90 of your already indexed items.
- The Apachesolr status page will now say 90 items remain for indexing.
- If you attempt to "index all queued content", only 50 items are actually sent to Solr.
- The other 40 items will never be reindexed.

Why does this happen?

The update algorithm for selecting items only uses last entity_id or last changed date (apachesolr_index_get_entities_to_index). After the equivalent of one cron run, the last changed date is changed to the date of the last indexed item (apachesolr_index_entities). If several entries share the same timestamp (not at all unthinkable in big custom bulk operations) you risk not getting your data indexed and you have no error messages to tell the story.

Proposted solution

I propose adding a dirty-bit column to the apachesolr_index_entities_node table, named "pending". If set to 1, update, if set to 0, leave alone. The function apachesolr_index_entities could then run a bulk db_update on all rows that were successfully indexed, setting the dirty bit to 0.

You lose efficiency with the db_update, but gain efficiency in the function that selects items to reindex.

I have myself implemented a custom solution (1 changed and 3 added lines of code) that doesn't demand changes in the database structure or API. It's much less efficient than the proposed solution and not worthy of publication unless explicitly asked for. Simply put, the hack subtracts 1ms from the last changed timestamp for every indexed document, updates the apache_solr_index_entities_node table for each indexed document, and sets that timestamp as the "last index update" timestamp. Because of the subtracted millisecond, all documents sharing timestamp with the recently indexed documents will still be indexed on next run, and the already indexed documen'ts won't be indexed again. Hacky, dirty, slow - but it works for us until there is a permanent solution.

Remi

#23

@remimikalsen, does the patch posted above fix your problem?

+++ b/apachesolr.index.incundefined
@@ -440,29 +423,7 @@ function apachesolr_index_get_entities_to_index($env_id, $entity_type, $limit) {
+  $query = apachesolr_index_get_entities_to_index_get_query($env_id, $entity_type);

I don't like this function name, can we somehow make this way shorter?
apachesolr_index_get_next_set_query() or so?

Can Defr and Thomas mark this as RTBC once they feel comfortable?

#24

Status:needs review» needs work

#25

Applying the patch fails on 7.x-1.x-dev and 7.x-1.1. Assuming you mean #19.

patch -p1 < apachesolr-18281014-indexing_order-19.patch
patching file apachesolr.index.inc
Hunk #4 FAILED at 128.
Hunk #5 FAILED at 423.
Hunk #6 succeeded at 443 (offset -2 lines).
2 out of 6 hunks FAILED -- saving rejects to file apachesolr.index.inc.rej

Remi

#26

Status:needs work» needs review

This patch should apply to 7.x-1.x-dev (#19 had bit rotted) and renames the function to apachesolr_index_get_next_set_query(). I'm happy with this patch, but obviously can't RTBC.

@remimikalsen I did consider fixing this with a dirty-bit, but discarded that option at the time. I can't remember exactly why though (and why I didn't mention it) - possibly there was a scenario where new content got missed because it never had the dirty bit set? Or maybe it just required bigger changes that this patch. Happy to look into that again though if there is a problem with the solution in the attached patch.

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-26.patch6.88 KBIdleFAILED: [[SimpleTest]]: [MySQL] Unable to apply patch apachesolr-18281014-indexing_order-26.patch. Unable to apply patch. See the log in the details link for more information.View details | Re-test

#27

Status:needs review» needs work

+++ b/apachesolr.index.incundefined
@@ -515,6 +469,46 @@ function apachesolr_index_get_entities_to_index($env_id, $entity_type, $limit) {
+      ->condition('aie.changed', $last_index_position['last_changed'], '>')

Maybe to make it a bit more robust than it was before, should we check if the value $last_index_position['last_changed'] was actually set? Same with last_entity_id

+++ b/apachesolr.index.incundefined
@@ -38,6 +38,9 @@ function apachesolr_index_entities($env_id, $limit) {
+    if (!count($rows)) {

Is count quicker than trying to see if $rows is empty?

+++ b/apachesolr.index.incundefined
@@ -47,20 +50,18 @@ function apachesolr_index_entities($env_id, $limit) {
+          'Failure recording indexing progress. Last entitiy id processed: %entity_id with timestamp %last_changed',

Typo

I'm also wondering how we now cope with entities in the index table that have timestamp 0 or empty and have greater entity id's. This means they were never indexed and the query will now also not select them if I reviewed it properly?

#28

Maybe to make it a bit more robust than it was before, should we check if the value $last_index_position['last_changed'] was actually set? Same with last_entity_id

Maybe, but that probably deserves it's own issue to avoid delaying this one. If that happens at the moment then the most likely scenario is it will do a full reindex (because it will set these after the first batch). That's quite a good failure, and I'm not sure what I'd do if these values weren't set.

Is count quicker than trying to see if $rows is empty?

The performance of both is negligible, but using count gives a hint to the reader that $rows is an array.

I'm also wondering how we now cope with entities in the index table that have timestamp 0 or empty and have greater entity id's. This means they were never indexed and the query will now also not select them if I reviewed it properly?

Good spot - I'll add an extra or.

#29

I applied the new patch, but unfortunately it doesn't seem to work as I'd expected.

When I mark N-items for re-indexing it correctly does the re-indexing, but when the re-indexing is done, the N items are still queued for re-indexing.

R

#30

Sounds like your set_last_index wasn't triggered. Can you put the output of apachesolr_get_last_index_position($env_id, $entity_type) here before indexing and after indexing?

+++ b/apachesolr.index.incundefined
@@ -47,20 +50,18 @@ function apachesolr_index_entities($env_id, $limit) {
+        apachesolr_set_last_index_position($env_id, $entity_type, $last_row->changed, $last_row->entity_id);

I'm also dubious about this, do we want to set the last index position if the current value in the variable is higher than the timestamp of the last entity update?

#31

It seems the problem I encountered wasn't in the patch, but in the dev-version of the Apachesolr module. The patch doesn't work on 7.x-1.1. If someone can provide a ported patch I'd be happy to test it.

#32

Could you please move this to #1900036: node reference with multiple values and remove this information here?

#33

After solving the other bug, this patch apparently worked quite well for me :)

#34

Status:needs work» needs review

I'm also wondering how we now cope with entities in the index table that have timestamp 0 or empty and have greater entity id's. This means they were never indexed and the query will now also not select them if I reviewed it properly?

Changed should never be 0 or empty - it will either be the timestamp that the entity was last updated, or the time when it was last queued for reindexing, whichever is more recent.

#30 I'm also dubious about this, do we want to set the last index position if the current value in the variable is higher than the timestamp of the last entity update?

I'm not sure I understand. $last_row->changed and $last_row->entity_id are effectively a compound key identify the last element indexed, so we'll always need both of them to identify the next element to index.

Other than the typo, I think the patch in #26 is good

#35

#36

Status:needs review» needs work

The last submitted patch, apachesolr-18281014-indexing_order-26.patch, failed testing.

#37

Status:needs work» needs review

Re-roll with the typo fixed

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-37.patch6.75 KBIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#38

This change seems unnecessary:

+    if (!count($rows)) {
+      continue;
+    }

#39

I guess it is. I think I figured either apachesolr_send_to_index would be expensive (but it already optimises for an empty set of documents) or the code would error in some way (it won't).

Any other thoughts?

AttachmentSizeStatusTest resultOperations
apachesolr-18281014-indexing_order-39.patch6.27 KBIdlePASSED: [[SimpleTest]]: [MySQL] 513 pass(es).View details | Re-test

#40

We should have a test case or two to demonstrate the bug's manifestations and insure against regression

nobody click here