Right now, apachesolr_commentsearch does its indexing at the node level, causing all comments in a node to be reindexed if one has been added or changed. For large quantities of comments, this is problematic, since a single comment change can cause hundreds of documents to be indexed. If more than one node changes between cron runs, the cron process can end up exceeding its time/memory limits. And regardless, the redundant updates are significant resource hogs.

My proposal: create a marking table for comments, and implement hook_cron in apachesolr_commentsearch. Existing code would still run if apachesolr_index_comments_with_node was true, but if it was false, hook_comment would mark the comment rather than the node, and hook_cron would look at this table instead.

Changes to the indexing batch process would require an additional hook in apachesolr to keep things modular, but this may not be necessary as long as the minimal number of nodes to index can be set to 1. I've been able to index a node containing more than 1200 comments without exceeding batch time limits on my dev machine.

I'll of course do this work. Just want to get some feedback before I proceed.

CommentFileSizeAuthor
#8 apachesolr_commentsearch.zip17.81 KBAndrey Zakharov

Comments

Scott Reynolds’s picture

Interesting problem. The only comment I would have is to maybe use http://api.drupal.org/api/function/hook_update_index/6 instead of hook_cron.

The only real reason in my mind is consistency ?

kcoop’s picture

If it's consistency we're after, apachesolr_search_cron() probably should be changed too.

kcoop’s picture

Looking into this a bit, it appears the apachesolr has a mechanism (namespaces) for managing multiple node indexes in the same update table. Considering the amount of potential duplicated code, it is tempting to leverage this, but there are assumptions about nodes rather than comment ids, and going this route would create less than elegant code, so I'm rejecting it for now.

But it raises the question, should I be considering namespaces in this implementation? I'm not clear on a concrete example of how they would be used. Also, any thoughts on use cases where other modules might want to implement hooks that index by comment?

On the duplicated code question, my impulse is to limit changes to apachesolr_commentsearch where possible, so I'm avoiding refactoring in apachesolr itself, even if it leads to duplication. Should I be thinking differently?

kcoop’s picture

Thinking some more about this... if we're updating individual comments, why bother batching them with cron? Why not simply talk to solr as comments are created/updated/deleted? Simpler code, smoother resource demands. Don't even need to query the comment table, since we have the comment in hand.

EDIT: I didn't consider the overhead of connecting to solr. That's probably reason enough not to do it this way, especially if it's on a separate box. Will stick with the marking table.

kcoop’s picture

Just chattering away here...

I see that the marking table in apachesolr maintains a record for every node, even after it's been indexed. I was thinking of it more as a list of dirty entries instead of a status table. Is there a reason for tracking indexed entries? When it comes to comments, the table may get considerably larger...

Andrey Zakharov’s picture

subscribing

BetaTheta’s picture

subscribing

Andrey Zakharov’s picture

StatusFileSize
new17.81 KB

I'm using this module with particial comment indexing support.
It uses separate indexing table, and I do not like how it related to apachesolr module. It can be better with http://drupal.org/node/832118

jpmckinney’s picture

Title: Comment Granularity for Indexing » Don't reindex all of a node's comments if a comment is added/removed
Version: 6.x-2.x-dev » 7.x-1.x-dev
nick_vh’s picture

Status: Active » Closed (works as designed)

Solved by the multiple indexer patch. You can easily add your own entity callbacks for comments and modify the ones for node.