|Project:||Apache Solr Search Integration|
The query used in apachesolr_index_get_entities_to_index is ordering only on aie.entity_id, and filters the result set based on the changed timestamp of the last item that was indexed. This works fine during normal operations, but it can break horribly when someone click on the "Queue all content for reindexing" button, if there's not a strong correlation between changed times and entity_ids.
Simplified example where this would break (the real production environment where this broke is using Apache Solr Commerce, and Migrate to regularly update all the commerce products on the site): let's say your only indexing nodes, you reduced the number of items indexed to 5 per cron, your nodes 1 to 5 are basic pages that get semi-regularly updated, and you are also indexing article nodes that have higher entity_ids and never get any update. The first run will index the 5 pages nodes, and put the changed timestamp of node 5 as in the $last_changed timestamp… Which means that on the next cron run, all the article nodes that were written before the last change of node 5 are now considered indexed, even though they've never been sent to Solr.
If there are more *updated* items in the apache solr indexing queue (sharing the same timetamp) than will be indexed by one cron run, the "index all queued content" option will only index the number of items that will be indexed by one cron run.
How to reproduce:
- You have indexed 100 documents from Drupal.
- Your Apachesolr settings say you should index 50 items per cron run.
- Force an update of the apache solr index by setting the apachesolr_index_entities_node.changed column beyond the last update for 90 of your already indexed items.
- The Apachesolr status page will now say 90 items remain for indexing.
- If you attempt to "index all queued content", only 50 items are actually sent to Solr.
- The other 40 items will never be reindexed.
Why does this happen?
The update algorithm for selecting items only uses last entity_id or last changed date (apachesolr_index_get_entities_to_index). After the equivalent of one cron run, the last changed date is changed to the date of the last indexed item (apachesolr_index_entities). If several entries share the same timestamp (not at all unthinkable in big custom bulk operations) you risk not getting your data indexed and you have no error messages to tell the story.
I propose adding a dirty-bit column to the apachesolr_index_entities_node table, named "pending". If set to 1, update, if set to 0, leave alone. The function apachesolr_index_entities could then run a bulk db_update on all rows that were successfully indexed, setting the dirty bit to 0.
You lose efficiency with the db_update, but gain efficiency in the function that selects items to reindex.