Add Batch API support for rebuilding indexes
anarchivist - May 7, 2009 - 15:29
| Project: | Apache Solr Search Integration |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
Description
I'd like to see apachesolr have a index rebuilding process tied to Drupal's Batch API. There are times when it'd be preferable to rebuild the index quickly rather than waiting for cron to process several thousand nodes.

#1
One of my coworkers and I have put together a really, really rough module to handle reindexing using the Batch API. The code definitely needs work, better comments, etc. - please give feedback if you'd like!
#2
#3
Any feedback on this would be greatly appreciated. Ideally, I think it would make sense to add this functionality to apachesolr, but if the developers think otherwise, I'd like to know.
#4
This is not really a priority for inclusion since it would only be relevant for large sites with developers in a big hurry :)
In those cases, they could change settings.php to increase the # of updates per cron run, and run corn 1x per minute or some such.
#5
OK, thanks. I understand your point of view but we have had this use case...
I guess this is something that I should have filed as another issue (and will certainly do if it comes up again), but I was running into the problem of having stale information in apachesolr_search_node which prevented a complete reindexing, which is why this code clears out that table before it starts - have you seen this issue before?
#6
If there is stale info there, that's certainly a bug - though it's possible not all node updates are being caught, especially any that bypass nodeapi (i.e. don't use node_save()) will be missed.
#7
I've been using this module a bit in a test install. I would have found it useful when migrating our D5 site to D6.
@pwolanin: I think Apache Solr itself is geared to site builders, and they *do* have to run cron.php a zillion times to get nodes in... I'm thinking this has an ample audience (potentially, *everyone* who's installing the module for the first time, or perhaps recently activated node access modules, etc?)
Perhaps it's not 1.x material, but 2.x as a contrib/ module? It could go in the current Reindex admin page; I'd recommend it go through a confirmation screen first.
I wonder what would be the impact on the server of doing this kind of batch processing...?
Thoughts?
PS: A code sugestion: change apachesolr_batch_reindex_settings_form() like so:
function apachesolr_batch_reindex_settings_form() {$form = array();
$form['help'] = array(
'#type' => 'item',
'#value' => t('Re-indexing will add all content to the index again (overwriting the index), but existing content in the index will remain searchable. This reindexing process uses the Batch API.'),
);
$form['submit'] = array(
'#type' => 'submit',
'#value' => t('Batch reindex'),
'#submit' => array('apachesolr_batch_reindex_submit'),
);
return $form;
}
#8
I love this feature and want it to be an option on the main admin page.
#9
As I have time, I'll work on polishing this code.
#10
Here's a start at a patch for integrating it directly into apachesolr. The actual batch processing is broken but the error is probably something minor. It'd be great if someone could take it and finish it up.
#11
It was a headache figuring it out, but solved it: the problem was a missing 'file' element in the $batch array, needed when the processing/finishing functions are on a different file than the .module itself.
New patch.
#12
I just installed the patch from #11 and it's running smoothly.
I'm (re)developing a site that has 67k photos/nodes that have to be indexed. It would have taken forever to run the cron version of the update. Not only that, by using the cron method, it was causing really bad apache loads on our system. By running this version, it has reduced the load slightly.
Thanks for the patch.
I do have one question, not sure how possible this is, but we're doing a bulk index from a "re-index". It would be nice if this could touch off of a partially indexed site. I say this because if our browser cuts out, the bulk index stops.
Or maybe there is a way to run this from cli/drush. If I could run it as a background process on the server:
drush updatesolr &
something like that.
But again, thanks for the updates. If there is anything I can provide, please let me know.
#13
Great stuff! Committed. Thanks anarchivist, janusman.
Open new issues for the suggestions to do bath indexing on the remainder and for adding Drush support.
#14
Great! Glad to see janusman turned this around so quickly. :)
#15
I've applied the patch to the latest tarball for 6.x-1.x-dev, and it seems to be working fine. Would someone be willing to add this to 6.x-1.x.dev?
#16
Here's a patch based on the application of the patch from comment #11 to apachesolr 6.x-1.x-dev in CVS.
#17
I also would very much like to see this patch in 6.x-1.x.dev.
#18
I have been testing this, and I love it.
Whenever, I have an issue with solr, for example most recently, with the capitalization, I might need to test out schema changes, and re-index with existing content, so this feature is very useful for me.
The only minor grief is that at the end of the process, the message is "0 items successfully processed." as an error. Even when cron is run it would say "200 items successfully processed." in red. However the message overcomes the suggestion of error when the number is 200 but when 0, the color and number may alarm an administrator.
#19
Comment above is relevant - a "catch up indexing" option instead of a full reindex.
If we are re-indexing all content, quite possibly this should only be directly available in combination with deleting the index?
Maybe the "catch up" should be the only thing added - you can trigger a delete or reindex and then catch up to get the content up right away. The same feature would then work to quickly get recent content in the index, or pick up from a failed batch.
#20
@pwolanin:
This might make sense and wouldn't necessarily be too difficult to add as part of this patch. If I have a chance over the next week I'll see if I can start hacking on this.
#21
I'd say we should sync DRUPAL-6--1 with the code already available (committed to DRUPAL-6--2) and then work on fixing the mentioned warning and making the "catch up" feature.
#22
I second @robertDouglass's idea.
#23
Ok, maybe someone can look at the error and wrong reporting that happens at the end of batch cycles.
#24
Here's a rerolled version of the patch in #11 and #16 against DRUPAL-6--1. This still hasn't been committed to that tag yet.