Don't reindex all of a node's comments if a comment is added/removed [#658740]

Right now, apachesolr_commentsearch does its indexing at the node level, causing all comments in a node to be reindexed if one has been added or changed. For large quantities of comments, this is problematic, since a single comment change can cause hundreds of documents to be indexed. If more than one node changes between cron runs, the cron process can end up exceeding its time/memory limits. And regardless, the redundant updates are significant resource hogs.

My proposal: create a marking table for comments, and implement hook_cron in apachesolr_commentsearch. Existing code would still run if apachesolr_index_comments_with_node was true, but if it was false, hook_comment would mark the comment rather than the node, and hook_cron would look at this table instead.

Changes to the indexing batch process would require an additional hook in apachesolr to keep things modular, but this may not be necessary as long as the minimal number of nodes to index can be set to 1. I've been able to index a node containing more than 1200 comments without exceeding batch time limits on my dev machine.

I'll of course do this work. Just want to get some feedback before I proceed.

Comment	File	Size	Author
#8	apachesolr_commentsearch.zip	17.81 KB	Andrey Zakharov

Comments

Comment #1

Scott Reynolds commented 13 December 2009 at 19:37

Interesting problem. The only comment I would have is to maybe use http://api.drupal.org/api/function/hook_update_index/6 instead of hook_cron.

The only real reason in my mind is consistency ?

Comment #2

kcoop commented 14 December 2009 at 06:36

If it's consistency we're after, apachesolr_search_cron() probably should be changed too.

Comment #3

kcoop commented 15 December 2009 at 01:21

Looking into this a bit, it appears the apachesolr has a mechanism (namespaces) for managing multiple node indexes in the same update table. Considering the amount of potential duplicated code, it is tempting to leverage this, but there are assumptions about nodes rather than comment ids, and going this route would create less than elegant code, so I'm rejecting it for now.

But it raises the question, should I be considering namespaces in this implementation? I'm not clear on a concrete example of how they would be used. Also, any thoughts on use cases where other modules might want to implement hooks that index by comment?

On the duplicated code question, my impulse is to limit changes to apachesolr_commentsearch where possible, so I'm avoiding refactoring in apachesolr itself, even if it leads to duplication. Should I be thinking differently?

Comment #4

kcoop commented 16 December 2009 at 00:32

Thinking some more about this... if we're updating individual comments, why bother batching them with cron? Why not simply talk to solr as comments are created/updated/deleted? Simpler code, smoother resource demands. Don't even need to query the comment table, since we have the comment in hand.

EDIT: I didn't consider the overhead of connecting to solr. That's probably reason enough not to do it this way, especially if it's on a separate box. Will stick with the marking table.

Comment #5

kcoop commented 16 December 2009 at 06:22

Just chattering away here...

I see that the marking table in apachesolr maintains a record for every node, even after it's been indexed. I was thinking of it more as a list of dirty entries instead of a status table. Is there a reason for tracking indexed entries? When it comes to comments, the table may get considerably larger...

Comment #6

Andrey Zakharov commented 13 December 2010 at 13:58

subscribing

Comment #7

BetaTheta commented 2 January 2011 at 08:55

subscribing

Comment #8

Andrey Zakharov commented 9 February 2011 at 15:07

Status	File	Size
new	apachesolr_commentsearch.zip	17.81 KB

I'm using this module with particial comment indexing support.
It uses separate indexing table, and I do not like how it related to apachesolr module. It can be better with http://drupal.org/node/832118

Comment #9

jpmckinney commented 20 March 2011 at 04:19

Title:	Comment Granularity for Indexing	» Don't reindex all of a node's comments if a comment is added/removed
Version:	6.x-2.x-dev	» 7.x-1.x-dev

Comment #10

nick_vh

he/him

Ghent

commented 28 December 2011 at 10:47

Status:

Active

» Closed (works as designed)

Solved by the multiple indexer patch. You can easily add your own entity callbacks for comments and modify the ones for node.

Don't reindex all of a node's comments if a comment is added/removed

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

News items

Our community

Documentation

Drupal code base

Governance of community