Give modules the opportunity to make several documents out of a node

Damien Tournoud - August 3, 2009 - 11:29
Project:Apache Solr Search Integration
Version:6.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:closed
Description

The current hook_apachesolr_update_index() doesn't give the opportunity for modules to generate several documents out of the same node. Could we add this feature, pretty please?

#1

Damien Tournoud - August 3, 2009 - 11:31

Two possible solutions:

  • apachesolr_node_to_document() becomes apachesolr_node_to_documents(), and hook_apachesolr_update_index() changes accordingly
  • add a hook_apachesolr_documents_alter() in apachesolr_index_nodes()

#2

pwolanin - August 4, 2009 - 19:26
Status:active» needs review

I would prefer allowing developers to specify a custom callback to generate documents from nodes.

AttachmentSize
node-doc-callback-538636-2.patch 974 bytes

#3

anarchivist - August 6, 2009 - 15:23

Seconding pwolanin's approach.

#4

Damien Tournoud - August 7, 2009 - 13:24

The issue with this approach is that only one module can take over the apachesolr_search_custom_indexing_callback variable. There is no way for several modules to alter documents (or make several documents out of one, as in my use case).

#5

pwolanin - August 9, 2009 - 02:03

per discussion earlier w/ Damien - any module could also maintain its own queue of nodes to be indexed (like apachesolr_attachements), and any module can implement hook_apachesolr_node_exclude(), so especially with this addition, their ought to be sufficient flexibity for most any application.

#6

dgarciad - August 13, 2009 - 09:13

Hi

I am especially interested in the implementation of this funtionality for apachesolr_attachements due to the fact that right know, if there are many documents attached to a single node (as it happens in my system), very often I get cron timeouts as the system tries to index all of them.

I would appreciate it very much if someone would take steps in this direction.

Regards

#7

robertDouglass - August 13, 2009 - 11:19
Version:6.x-1.x-dev» 6.x-2.x-dev

Let's keep 6.1 more or less frozen.

#8

pwolanin - August 13, 2009 - 11:42

@robert - adding this to 6.x1.x would be ok I think, since it doesn't alter BC.

@dgarciad - this is the wrong issue queue then and not directly related to your problem. Look at the code in apachesolr_attachments and post an issue there. It does try to limit the total time taken, but if you have huge numbers of attachments per node there is no easy answer.

#9

robertDouglass - August 13, 2009 - 12:05

@pwolanin - ok - but let's fix it in 6.2 and backport, then.

#10

pwolanin - August 13, 2009 - 12:32

sure - go for it.

#11

janusman - August 13, 2009 - 13:15

@dgarciad: You might also be interested in this issue: #456420: Add Batch API support for rebuilding indexes

#12

robertDouglass - September 16, 2009 - 12:41

wrt #2 from pwolanin: I think I'd be more in support of calling a hook. The variable seems clunky. Is there any reason not to do an invoke all on the rows?

<?php
- apachesolr_index_nodes($rows, 'apachesolr_search');
+
module_invoke_all('apachesolr_index', $rows);
?>

#13

robertDouglass - September 16, 2009 - 17:35

Comments from pwolanin from chat:

If we are doing this as node-based, maybe the approach should be to allow modules ot add to or alter the doc(s) for one node?
When you get the docs later you likely have to call node_load() all over again, etc
So right now we have (line ~150 in apachesolr.index.inc?):
// Let modules add to the document - TODO convert to drupal_alter().
foreach (module_implements('apachesolr_update_index') as $module) {

apachesolr_node_to_document() could become apachesolr_node_to_documentS

I have to add that node_load all over again isn't so horrible ... but better if it can be avoided.

#14

robertDouglass - September 17, 2009 - 14:22

As a first step I'm addressing Peter's concern at running node_load multiple times. Especially since we invalidate the static cache when we do it, this would present a huge performance problem and slow down indexing even more. So I'm centralizing node_load at a higher level and passing $node instead of $nid to all the (currently one) functions that want to build documents. This patch does that, plus it removes apachesolr_add_node_document which doesn't seem needed, and lets the document building functions return the $documents directly.

Since I'm in the mood to be a cowboy I'm committing all this as I go. Feel free to comment and tell me I'm full of it. Willing to roll things back if needed.

AttachmentSize
rm_apachesolr_add_node_document.patch 2.55 KB

#15

robertDouglass - September 17, 2009 - 15:17

Here's a decent solution. There is now a hook_apachesolr_document_handlers that collects a list of function names. These functions are all capable of turning an entity into a document. The $type of the entity and the $namespace of the module triggering indexing are all passed along.

Committing this so I can move straight on to indexing comments. Review still welcome, of course.

AttachmentSize
document_handlers.patch 3.75 KB

#16

robertDouglass - November 6, 2009 - 12:28
Version:6.x-2.x-dev» 6.x-1.x-dev
Status:needs review» patch (to be ported)

Ok. If Peter is interested in backporting, go for it. Otherwise, please close.

#17

pwolanin - November 7, 2009 - 01:07

If we are changing the indexing API, I'm a little reluctant to backport.

#18

robertDouglass - November 7, 2009 - 14:20
Status:patch (to be ported)» fixed

Settled.

#19

pwolanin - November 7, 2009 - 15:05
Version:6.x-1.x-dev» 6.x-2.x-dev

#20

System Message - November 21, 2009 - 15:10
Status:fixed» closed

Automatically closed -- issue fixed for 2 weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.