Hi, I ran into a situation where I had manually corrupted the {node} and {apachesolr_search_node} tables. Quite a lot of nodes had been deleted, but Solr had no way of knowing which ones anymore.
Normally you would "Delete & rebuild index". However, in my extreme case, this would cost 6-10 hours of indexing time, my site would go down all day, and I would not have assurance it would be fixed. So rebuilding the index is not an option anymore. I've gone so far, I can't afford to go back wasting days on re-indexing Solr.
So I needed a better solution. What I came up with is this:
- a Batch API script, "Reverse Integrity Check"
- it queries all nid's in Solr, in batches
- it matches these nid's (from Solr) against {apachesolr_search_node} (from MySQL)
- whichever is missing will be deleted from Solr with apachesolr_delete_node_from_index($node);
I haven't written a full patch for Apache Solr, but I'm attaching the code as a small custom module. The module has no interface yet. You should call the function apachesolr_integrity_check(); manually to start the batching process.
Results:
- on a huge solr index, I was able to track down the 10K or so "orphaned" nodes
- the speed is very good, we only query Solr for nid's and we only delete orphaned nodes
Future:
We might integrate this with the ApacheSolr project? As an fourth option on admin/settings/apachesolr/index? I would be honoured. The module works as intended (ran it today again), but the Drupal messages aren't perfect yet. Still says "results=0" on finish, but that is wrong.
Module / code attached.
| Comment | File | Size | Author |
|---|---|---|---|
| #7 | apachesolr_integrity-D7.tgz | 2.09 KB | j0rd |
| apachesolr_integrity.zip | 2.35 KB | Anonymous (not verified) |
Comments
Comment #1
jpmckinney commentedWhen you delete nodes the "Drupal way", they are marked for removal from Solr. Keeping this around as a support request in case someone else has deleted nodes (maybe through direct SQL queries, or with the apachesolr module disabled?) and needs a way to repair the index.
Comment #3
crash-land commentedI realize that not deleting things the drupal way caused the problem. I've got thousands of such orphaned nodes after using SQL to clean up another mess. This looks like it would be a very nice help but I don't understand precisely what is meant by "run the function manually"... I tried to just put some php calling it in a node and then called the node and it sort of worked, it displayed the batch screen and did a single batch before exiting. Maybe I just need some coffee. Can anybody provide any guidance?
Comment #4
j0rd commented@jpmckinney "When you delete nodes the "Drupal way", they are marked for removal from Solr. "
And if you don't, your site is left in an in-consistent state forever.
I've run into this problem myself. I went through and deleted 8 or so nodes, for what ever reason, they didn't get deleted from solr. I personally have no idea why. I just pressed the standard "node delete" button on edit page. Now my solr directory is busted and with out something like the feature suggested / written by the author, your site is broken for ever, unless you delete the index. Deleting and rebuilding the solr index, is not a solution for users who run sites with traffic, which I assume is probably a lot, since they've gone through the trouble of setting up solr.
If you mess up solr indexing on a saved node, it's pretty easy to fix. Re-save the node and it'll be fixed.
If you do this on a large portion of nodes, it's not the worst, since you can rebuild the index with out deleting it.
Problem is, when you delete a node and for what ever reason, it's not removed from solr....you're up the creek with out a paddle. Currently no easy fix with out assassinating your solr index, which means downtime or feature subset for quite a while.
When my site goes live, there's no way I'm going to be able to delete & rebuild the entire index. Instead I need something like the feature requested.
IMHO, since this fixes a fairly common fsck up, I think it should be added to the mainline branch and should be added under the standard "Delete & Re-index", "Rebuild Index" settings as "Remove Orphaned Nodes from Index" or something.
+1 for this.
Comment #5
heacu commentedi agree that something like this is a critical feature. however, we do need to tread carefully, and consider the multitude of ways that a solr index may back a drupal site. for example, if your index includes documents that are not generated by drupal (ie additional documents that interact with drupal, don't require the overhead of drupal, or whatever), then you have to be careful not to delete those. in fact, more generally, one wonders whether "Delete index" should really delete the index, or if there should be another option which deletes the drupal portion of the index.
Comment #6
j0rd commented@heacu could you not create another index for non-drupal nodes. I assume this would be the way to go for something like that.
If currently in the admin you delete the index, and the entire index gets deleted, then I think your suggestion is a rather rare edge case, not currently supported anyways.
Comment #7
j0rd commentedHere's my D7 port. It works fine for me. Will only prune nodes from default index. Hard-coded currently like that. Use at your own risk.
I've also added a button to the admin/config/search/apachesolr where you can run this batch script. Button is called "Delete orphaned nodes"
I'm also using Solr4, which isn't currently deleting any nodes from the index due to a bug, so if you're using solr4 you'll also have to follow along with this issue:
#1874420: Solr4 Entites Not Being Removed with deleteByQuery