Add Batch API support for rebuilding indexes

anarchivist - May 7, 2009 - 15:29
Project:Apache Solr Search Integration
Version:6.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs work
Description

I'd like to see apachesolr have a index rebuilding process tied to Drupal's Batch API. There are times when it'd be preferable to rebuild the index quickly rather than waiting for cron to process several thousand nodes.

#1

anarchivist - May 14, 2009 - 05:23

One of my coworkers and I have put together a really, really rough module to handle reindexing using the Batch API. The code definitely needs work, better comments, etc. - please give feedback if you'd like!

AttachmentSize
apachesolr_batch_reindex-6.x-1.1.tar_.gz 3.84 KB

#2

anarchivist - May 14, 2009 - 05:24
Status:active» needs review

#3

anarchivist - June 10, 2009 - 04:01

Any feedback on this would be greatly appreciated. Ideally, I think it would make sense to add this functionality to apachesolr, but if the developers think otherwise, I'd like to know.

#4

pwolanin - June 10, 2009 - 14:13

This is not really a priority for inclusion since it would only be relevant for large sites with developers in a big hurry :)

In those cases, they could change settings.php to increase the # of updates per cron run, and run corn 1x per minute or some such.

#5

anarchivist - June 10, 2009 - 15:00

OK, thanks. I understand your point of view but we have had this use case...

I guess this is something that I should have filed as another issue (and will certainly do if it comes up again), but I was running into the problem of having stale information in apachesolr_search_node which prevented a complete reindexing, which is why this code clears out that table before it starts - have you seen this issue before?

#6

pwolanin - June 10, 2009 - 16:14

If there is stale info there, that's certainly a bug - though it's possible not all node updates are being caught, especially any that bypass nodeapi (i.e. don't use node_save()) will be missed.

#7

janusman - July 22, 2009 - 22:37
Status:needs review» needs work

I've been using this module a bit in a test install. I would have found it useful when migrating our D5 site to D6.

@pwolanin: I think Apache Solr itself is geared to site builders, and they *do* have to run cron.php a zillion times to get nodes in... I'm thinking this has an ample audience (potentially, *everyone* who's installing the module for the first time, or perhaps recently activated node access modules, etc?)

Perhaps it's not 1.x material, but 2.x as a contrib/ module? It could go in the current Reindex admin page; I'd recommend it go through a confirmation screen first.

I wonder what would be the impact on the server of doing this kind of batch processing...?

Thoughts?

PS: A code sugestion: change apachesolr_batch_reindex_settings_form() like so:

function apachesolr_batch_reindex_settings_form() {
  $form = array();
  $form['help'] = array(
    '#type' => 'item',
    '#value' => t('Re-indexing will add all content to the index again (overwriting the index), but existing content in the index will remain searchable. This reindexing process uses the Batch API.'),
  );
  $form['submit'] = array(
    '#type' => 'submit',
    '#value' => t('Batch reindex'),
    '#submit' => array('apachesolr_batch_reindex_submit'),
  );
  return $form;
}

#8

robertDouglass - July 23, 2009 - 09:28
Version:6.x-1.x-dev» 6.x-2.x-dev

I love this feature and want it to be an option on the main admin page.

#9

anarchivist - July 23, 2009 - 20:03

As I have time, I'll work on polishing this code.

#10

robertDouglass - August 13, 2009 - 13:23

Here's a start at a patch for integrating it directly into apachesolr. The actual batch processing is broken but the error is probably something minor. It'd be great if someone could take it and finish it up.

AttachmentSize
batch.patch 6.96 KB

#11

janusman - August 28, 2009 - 20:23
Status:needs work» needs review

It was a headache figuring it out, but solved it: the problem was a missing 'file' element in the $batch array, needed when the processing/finishing functions are on a different file than the .module itself.

New patch.

AttachmentSize
apachesolr-456420-11.patch 6.83 KB

#12

Jason Ruyle - August 29, 2009 - 21:48

I just installed the patch from #11 and it's running smoothly.
I'm (re)developing a site that has 67k photos/nodes that have to be indexed. It would have taken forever to run the cron version of the update. Not only that, by using the cron method, it was causing really bad apache loads on our system. By running this version, it has reduced the load slightly.

Thanks for the patch.

I do have one question, not sure how possible this is, but we're doing a bulk index from a "re-index". It would be nice if this could touch off of a partially indexed site. I say this because if our browser cuts out, the bulk index stops.

Or maybe there is a way to run this from cli/drush. If I could run it as a background process on the server:
drush updatesolr &

something like that.

But again, thanks for the updates. If there is anything I can provide, please let me know.

#13

robertDouglass - September 1, 2009 - 09:47
Status:needs review» fixed

Great stuff! Committed. Thanks anarchivist, janusman.

Open new issues for the suggestions to do bath indexing on the remainder and for adding Drush support.

#14

anarchivist - September 3, 2009 - 12:49

Great! Glad to see janusman turned this around so quickly. :)

#15

anarchivist - September 14, 2009 - 21:17
Version:6.x-2.x-dev» 6.x-1.x-dev
Status:fixed» needs review

I've applied the patch to the latest tarball for 6.x-1.x-dev, and it seems to be working fine. Would someone be willing to add this to 6.x-1.x.dev?

#16

anarchivist - September 28, 2009 - 20:53

Here's a patch based on the application of the patch from comment #11 to apachesolr 6.x-1.x-dev in CVS.

AttachmentSize
apachesolr-456420-6.x-1.x.patch 6.82 KB

#17

tituomin - October 6, 2009 - 10:53

I also would very much like to see this patch in 6.x-1.x.dev.

#18

aufumy - October 12, 2009 - 14:52

I have been testing this, and I love it.

Whenever, I have an issue with solr, for example most recently, with the capitalization, I might need to test out schema changes, and re-index with existing content, so this feature is very useful for me.

The only minor grief is that at the end of the process, the message is "0 items successfully processed." as an error. Even when cron is run it would say "200 items successfully processed." in red. However the message overcomes the suggestion of error when the number is 200 but when 0, the color and number may alarm an administrator.

#19

pwolanin - October 15, 2009 - 20:57

Comment above is relevant - a "catch up indexing" option instead of a full reindex.

If we are re-indexing all content, quite possibly this should only be directly available in combination with deleting the index?

Maybe the "catch up" should be the only thing added - you can trigger a delete or reindex and then catch up to get the content up right away. The same feature would then work to quickly get recent content in the index, or pick up from a failed batch.

#20

anarchivist - October 15, 2009 - 21:37

@pwolanin:

If we are re-indexing all content, quite possibly this should only be directly available in combination with deleting the index? ... Maybe the "catch up" should be the only thing added ...

This might make sense and wouldn't necessarily be too difficult to add as part of this patch. If I have a chance over the next week I'll see if I can start hacking on this.

#21

robertDouglass - November 6, 2009 - 11:02

I'd say we should sync DRUPAL-6--1 with the code already available (committed to DRUPAL-6--2) and then work on fixing the mentioned warning and making the "catch up" feature.

#22

anarchivist - November 6, 2009 - 21:58

I second @robertDouglass's idea.

#23

robertDouglass - November 25, 2009 - 16:51
Version:6.x-1.x-dev» 6.x-2.x-dev
Status:needs review» needs work

Ok, maybe someone can look at the error and wrong reporting that happens at the end of batch cycles.

#24

anarchivist - December 31, 2009 - 03:59

Here's a rerolled version of the patch in #11 and #16 against DRUPAL-6--1. This still hasn't been committed to that tag yet.

AttachmentSize
apachesolr-456420-24.patch 6.82 KB
 
 

Drupal is a registered trademark of Dries Buytaert.