Right now, apachesolr_commentsearch does its indexing at the node level, causing all comments in a node to be reindexed if one has been added or changed. For large quantities of comments, this is problematic, since a single comment change can cause hundreds of documents to be indexed. If more than one node changes between cron runs, the cron process can end up exceeding its time/memory limits. And regardless, the redundant updates are significant resource hogs.
My proposal: create a marking table for comments, and implement hook_cron in apachesolr_commentsearch. Existing code would still run if apachesolr_index_comments_with_node was true, but if it was false, hook_comment would mark the comment rather than the node, and hook_cron would look at this table instead.
Changes to the indexing batch process would require an additional hook in apachesolr to keep things modular, but this may not be necessary as long as the minimal number of nodes to index can be set to 1. I've been able to index a node containing more than 1200 comments without exceeding batch time limits on my dev machine.
I'll of course do this work. Just want to get some feedback before I proceed.
| Comment | File | Size | Author |
|---|---|---|---|
| #8 | apachesolr_commentsearch.zip | 17.81 KB | Andrey Zakharov |
Comments
Comment #1
Scott Reynolds commentedInteresting problem. The only comment I would have is to maybe use http://api.drupal.org/api/function/hook_update_index/6 instead of hook_cron.
The only real reason in my mind is consistency ?
Comment #2
kcoop commentedIf it's consistency we're after, apachesolr_search_cron() probably should be changed too.
Comment #3
kcoop commentedLooking into this a bit, it appears the apachesolr has a mechanism (namespaces) for managing multiple node indexes in the same update table. Considering the amount of potential duplicated code, it is tempting to leverage this, but there are assumptions about nodes rather than comment ids, and going this route would create less than elegant code, so I'm rejecting it for now.
But it raises the question, should I be considering namespaces in this implementation? I'm not clear on a concrete example of how they would be used. Also, any thoughts on use cases where other modules might want to implement hooks that index by comment?
On the duplicated code question, my impulse is to limit changes to apachesolr_commentsearch where possible, so I'm avoiding refactoring in apachesolr itself, even if it leads to duplication. Should I be thinking differently?
Comment #4
kcoop commentedThinking some more about this... if we're updating individual comments, why bother batching them with cron? Why not simply talk to solr as comments are created/updated/deleted? Simpler code, smoother resource demands. Don't even need to query the comment table, since we have the comment in hand.
EDIT: I didn't consider the overhead of connecting to solr. That's probably reason enough not to do it this way, especially if it's on a separate box. Will stick with the marking table.
Comment #5
kcoop commentedJust chattering away here...
I see that the marking table in apachesolr maintains a record for every node, even after it's been indexed. I was thinking of it more as a list of dirty entries instead of a status table. Is there a reason for tracking indexed entries? When it comes to comments, the table may get considerably larger...
Comment #6
Andrey Zakharov commentedsubscribing
Comment #7
BetaTheta commentedsubscribing
Comment #8
Andrey Zakharov commentedI'm using this module with particial comment indexing support.
It uses separate indexing table, and I do not like how it related to apachesolr module. It can be better with http://drupal.org/node/832118
Comment #9
jpmckinney commentedComment #10
nick_vhSolved by the multiple indexer patch. You can easily add your own entity callbacks for comments and modify the ones for node.