Add Batch API support for rebuilding indexes [#456420]

Comment	File	Size	Author
#25	apachesolr-456420-25.patch	6.8 KB	tituomin
#24	apachesolr-456420-24.patch	6.82 KB	anarchivist
#16	apachesolr-456420-6.x-1.x.patch	6.82 KB	anarchivist
#11	apachesolr-456420-11.patch	6.83 KB	janusman
#10	batch.patch	6.96 KB	robertdouglass
#1	apachesolr_batch_reindex-6.x-1.1.tar_.gz	3.84 KB	anarchivist

Comment #1

anarchivist commented 14 May 2009 at 05:23

Status	File	Size
new	apachesolr_batch_reindex-6.x-1.1.tar_.gz	3.84 KB

One of my coworkers and I have put together a really, really rough module to handle reindexing using the Batch API. The code definitely needs work, better comments, etc. - please give feedback if you'd like!

Log in or register to post comments

Comment #2

anarchivist commented 14 May 2009 at 05:24

Status:

Active

» Needs review

Log in or register to post comments

Comment #3

anarchivist commented 10 June 2009 at 04:01

Any feedback on this would be greatly appreciated. Ideally, I think it would make sense to add this functionality to apachesolr, but if the developers think otherwise, I'd like to know.

Log in or register to post comments

Comment #4

pwolanin commented 10 June 2009 at 14:13

This is not really a priority for inclusion since it would only be relevant for large sites with developers in a big hurry :)

In those cases, they could change settings.php to increase the # of updates per cron run, and run corn 1x per minute or some such.

Log in or register to post comments

Comment #5

anarchivist commented 10 June 2009 at 15:00

OK, thanks. I understand your point of view but we have had this use case...

I guess this is something that I should have filed as another issue (and will certainly do if it comes up again), but I was running into the problem of having stale information in apachesolr_search_node which prevented a complete reindexing, which is why this code clears out that table before it starts - have you seen this issue before?

Log in or register to post comments

Comment #6

pwolanin commented 10 June 2009 at 16:14

If there is stale info there, that's certainly a bug - though it's possible not all node updates are being caught, especially any that bypass nodeapi (i.e. don't use node_save()) will be missed.

Log in or register to post comments

Comment #7

janusman commented 22 July 2009 at 22:37

Status:

Needs review

» Needs work

I've been using this module a bit in a test install. I would have found it useful when migrating our D5 site to D6.

@pwolanin: I think Apache Solr itself is geared to site builders, and they *do* have to run cron.php a zillion times to get nodes in... I'm thinking this has an ample audience (potentially, *everyone* who's installing the module for the first time, or perhaps recently activated node access modules, etc?)

Perhaps it's not 1.x material, but 2.x as a contrib/ module? It could go in the current Reindex admin page; I'd recommend it go through a confirmation screen first.

I wonder what would be the impact on the server of doing this kind of batch processing...?

Thoughts?

PS: A code sugestion: change apachesolr_batch_reindex_settings_form() like so:

function apachesolr_batch_reindex_settings_form() {
  $form = array();
  $form['help'] = array(
    '#type' => 'item',
    '#value' => t('Re-indexing will add all content to the index again (overwriting the index), but existing content in the index will remain searchable. This reindexing process uses the Batch API.'),
  );
  $form['submit'] = array(
    '#type' => 'submit',
    '#value' => t('Batch reindex'),
    '#submit' => array('apachesolr_batch_reindex_submit'),
  );
  return $form;
}

Log in or register to post comments

Comment #8

robertdouglass commented 23 July 2009 at 09:28

Version:

6.x-1.x-dev

» 6.x-2.x-dev

I love this feature and want it to be an option on the main admin page.

Log in or register to post comments

Comment #9

anarchivist commented 23 July 2009 at 20:03

As I have time, I'll work on polishing this code.

Log in or register to post comments

Comment #10

robertdouglass commented 13 August 2009 at 13:23

Status	File	Size
new	batch.patch	6.96 KB

Here's a start at a patch for integrating it directly into apachesolr. The actual batch processing is broken but the error is probably something minor. It'd be great if someone could take it and finish it up.

Log in or register to post comments

Comment #11

janusman commented 28 August 2009 at 20:23

Status:

Needs work

» Needs review

Status	File	Size
new	apachesolr-456420-11.patch	6.83 KB

It was a headache figuring it out, but solved it: the problem was a missing 'file' element in the $batch array, needed when the processing/finishing functions are on a different file than the .module itself.

New patch.

Log in or register to post comments

Comment #12

jason ruyle commented 29 August 2009 at 21:48

I just installed the patch from #11 and it's running smoothly.
I'm (re)developing a site that has 67k photos/nodes that have to be indexed. It would have taken forever to run the cron version of the update. Not only that, by using the cron method, it was causing really bad apache loads on our system. By running this version, it has reduced the load slightly.

Thanks for the patch.

I do have one question, not sure how possible this is, but we're doing a bulk index from a "re-index". It would be nice if this could touch off of a partially indexed site. I say this because if our browser cuts out, the bulk index stops.

Or maybe there is a way to run this from cli/drush. If I could run it as a background process on the server:
drush updatesolr &

something like that.

But again, thanks for the updates. If there is anything I can provide, please let me know.

Log in or register to post comments

Comment #13

robertdouglass commented 1 September 2009 at 09:47

Status:

Needs review

» Fixed

Great stuff! Committed. Thanks anarchivist, janusman.

Open new issues for the suggestions to do bath indexing on the remainder and for adding Drush support.

Log in or register to post comments

Comment #14

anarchivist commented 3 September 2009 at 12:49

Great! Glad to see janusman turned this around so quickly. :)

Log in or register to post comments

Comment #15

anarchivist commented 14 September 2009 at 21:17

Version:	6.x-2.x-dev	» 6.x-1.x-dev
Status:	Fixed	» Needs review

I've applied the patch to the latest tarball for 6.x-1.x-dev, and it seems to be working fine. Would someone be willing to add this to 6.x-1.x.dev?

Log in or register to post comments

Comment #16

anarchivist commented 28 September 2009 at 20:53

Status	File	Size
new	apachesolr-456420-6.x-1.x.patch	6.82 KB

Here's a patch based on the application of the patch from comment #11 to apachesolr 6.x-1.x-dev in CVS.

Log in or register to post comments

Comment #17

tituomin commented 6 October 2009 at 10:53

I also would very much like to see this patch in 6.x-1.x.dev.

Log in or register to post comments

Comment #18

aufumy commented 12 October 2009 at 14:52

I have been testing this, and I love it.

Whenever, I have an issue with solr, for example most recently, with the capitalization, I might need to test out schema changes, and re-index with existing content, so this feature is very useful for me.

The only minor grief is that at the end of the process, the message is "0 items successfully processed." as an error. Even when cron is run it would say "200 items successfully processed." in red. However the message overcomes the suggestion of error when the number is 200 but when 0, the color and number may alarm an administrator.

Log in or register to post comments

Comment #19

pwolanin commented 15 October 2009 at 20:57

Comment above is relevant - a "catch up indexing" option instead of a full reindex.

If we are re-indexing all content, quite possibly this should only be directly available in combination with deleting the index?

Maybe the "catch up" should be the only thing added - you can trigger a delete or reindex and then catch up to get the content up right away. The same feature would then work to quickly get recent content in the index, or pick up from a failed batch.

Log in or register to post comments

Comment #20

anarchivist commented 15 October 2009 at 21:37

@pwolanin:

If we are re-indexing all content, quite possibly this should only be directly available in combination with deleting the index? ... Maybe the "catch up" should be the only thing added ...

This might make sense and wouldn't necessarily be too difficult to add as part of this patch. If I have a chance over the next week I'll see if I can start hacking on this.

Log in or register to post comments

Comment #21

robertdouglass commented 6 November 2009 at 11:02

I'd say we should sync DRUPAL-6--1 with the code already available (committed to DRUPAL-6--2) and then work on fixing the mentioned warning and making the "catch up" feature.

Log in or register to post comments

Comment #22

anarchivist commented 6 November 2009 at 21:58

I second @robertDouglass's idea.

Log in or register to post comments

Comment #23

robertdouglass commented 25 November 2009 at 16:51

Version:	6.x-1.x-dev	» 6.x-2.x-dev
Status:	Needs review	» Needs work

Ok, maybe someone can look at the error and wrong reporting that happens at the end of batch cycles.

Log in or register to post comments

Comment #24

anarchivist commented 31 December 2009 at 03:59

Status	File	Size
new	apachesolr-456420-24.patch	6.82 KB

Here's a rerolled version of the patch in #11 and #16 against DRUPAL-6--1. This still hasn't been committed to that tag yet.

Log in or register to post comments

Comment #25

tituomin commented 18 February 2010 at 12:02

Version:

6.x-2.x-dev

» 6.x-1.x-dev

Status	File	Size
new	apachesolr-456420-25.patch	6.8 KB

I replaced the error message with a more accurate message. Also, now the operation will completely delete the index in the beginning. The user is warned about this.

I think this feature would be nice to have in the 6.x-1.0 version.

Log in or register to post comments

Comment #26

tituomin commented 5 March 2010 at 15:32

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #27

janusman commented 8 March 2010 at 21:50

Status:

Needs review

» Reviewed & tested by the community

Works, code looks ok.

Log in or register to post comments

Comment #28

drewish commented 11 March 2010 at 17:31

Status:

Reviewed & tested by the community

» Needs review

Bumping the status back partially because it's bad form for the patch author to mark it RTBC and partially because I think we need to give this more thought.

I'd suggest that the 2.x code needs more work before we back port to 1.x. #573734: Index controls should be radio buttons with one form submission button has some good ideas. At the very least think we should allow the admin to index all remaining content without starting from the beginning every time. I've got 60,000+ nodes and run into issues after importing content where cron won't be able to catch up. I don't want to re-index from the beginning, I just want to index the remaining content.

Log in or register to post comments

Comment #29

pwolanin commented 11 March 2010 at 19:38

Status:

Needs review

» Needs work

sounds like it needs work then.

Log in or register to post comments

Comment #30

janusman commented 11 March 2010 at 23:58

@drewish: to clarify the original author was @anarchivist, not me =) See comment #1.

Log in or register to post comments

Comment #31

drewish commented 18 March 2010 at 16:49

I posted a patch to #573734: Index controls should be radio buttons with one form submission button that does a bunch of cleanup to the batch api code.

In the current 2.x code we combine too many operations. By splitting out batch indexing from the delete and reset operations the interface becomes much easier to understand and use:

I'd love it if we could get this sorted out in the 2.x branch and then get a clean version backported.

Log in or register to post comments

Comment #32

jpmckinney commented 29 April 2010 at 22:21

#28 through #31: #573734: Index controls should be radio buttons with one form submission button can be committed after the original patches in this issue.

#19, #25: I don't always want to delete the index before re-indexing. Sometimes I just want to re-index.

#18, #21, #23:

The only minor grief is that at the end of the process, the message is "0 items successfully processed." as an error.

work on fixing the mentioned warning

Ok, maybe someone can look at the error and wrong reporting that happens at the end of batch cycles.

Seems to be fixed in #573734: Index controls should be radio buttons with one form submission button (with E_STRICT fixes from http://drupal.org/cvs?commit=358046)

Log in or register to post comments

Comment #33

jpmckinney commented 29 April 2010 at 22:21

Status:

Needs work

» Patch (to be ported)

Log in or register to post comments

Comment #34

drewish commented 29 April 2010 at 23:35

I'd suggest just copying and pasting the code from 2.x rather than trying to backport the patch.

Log in or register to post comments

Comment #35

jpmckinney commented 30 April 2010 at 02:34

Right, but to know what code to copy-paste, you will need the patches to guide you :)

Log in or register to post comments

Comment #36

pwolanin commented 20 March 2011 at 02:11

If ported to 6.x-1.x needs to consider the fixes I proposed here: http://drupal.org/node/1062232#comment-4234262

I'm not seeing this as urgent.

Log in or register to post comments

Comment #37

alibama commented 15 August 2011 at 18:25

working until i hit 4800 nodes when updating 100-200 nodes at a time, makes it up to 10,000 when updating @ 50 at a time... then it errors out.... running the latest devs on all pieces as of august 15 - any help greatly appreciated, got ~60K nodes

Log in or register to post comments

Comment #38

alibama commented 19 August 2011 at 13:53

just wanted to let you guys know that this is a great mod, found that my theme was partly responsible for the timeouts.. it was a fusion theme with some javascript - went to the batch page with the ?op=nojs to turn the javascript off and shazaam was indexing 30k nodes with no timeouts... otherwise it wasnt' so bad - would just close the browser and restart the batch process and it would plow on slowly but surely... ya'll should consider rolling this into a project... it's really a great tool - also in case anyone else has this problem apachesolr views was somehow really messing with the indexing... like destroying it :)

Thank you all for your help

Log in or register to post comments

Comment #39

m.stenta

he/him

English

commented 9 September 2011 at 13:40

Thank you so much for this. I was getting ready to roll my own module for batch reindexing. You are awesome (speaking to the original module author, and all those involved in this issue thread). Can't wait to see this included in the official module. (long story short = subscribe)

Log in or register to post comments

Comment #40

michael121 commented 18 November 2011 at 01:28

Hi, is there a way to run the batch indexing without deleting the index, to index only new or updated content like cron does for the 6.x-1.x-dev Branch? Or to catch up the batch....

Log in or register to post comments

Comment #41

nick_vh

he/him

Ghent

commented 18 November 2011 at 11:10

You can click "index queued content"?

Log in or register to post comments

Comment #42

nick_vh

he/him

Ghent

commented 18 November 2011 at 14:42

Status:

Patch (to be ported)

» Postponed

At this point this is going to be postponed until 7 has a stable release. Then we'll work towards a better 6 version.
However, if you'd like to have this you should probably take a stab at this yourself?

Log in or register to post comments

Comment #43

michael121 commented 23 November 2011 at 15:14

#41 - yes, in the 6.x.2.x-dev but not in 6.x.1.x-dev. Here you can only do a full re-index with cron, apllying the patch above you can use the batch api to re-index, but even only a full re-index.

#42 - I think, to do this it is necessary to rewrite a lot of code... I compared the 2.x and the 1.x code and as I understood the code in 1.x it is written to delete the index before re-indexing. I'm not sure where to start...
May be you can give me a tip and I will try.

Log in or register to post comments

Comment #44

nick_vh

he/him

Ghent

commented 28 December 2011 at 22:05

Version:	6.x-1.x-dev	» 6.x-3.x-dev
Status:	Postponed	» Closed (won't fix)

And since feature requests are not valid for 6.x-1.x we are moving this to 6.x-3.x
Then again, I'm going to close this because 6.x-3.x will be a backport of 7.x-1.x and this version already support better drush support. Any new functionality should be requested in the 7.x branch and will or will not be backported.

Log in or register to post comments

Add Batch API support for rebuilding indexes

Comments