Having looked at the code and debugged some of my own modules, I have noticed that on every node update apachesolr triggers a reindex for that node.
In setups where we are programatically adding a lot of short-lived information into additional cck fields, it does not seem useful to have constant reindexation running or even have the the reindexation request written to the DB.
My suggestion:
- Provide an option to enable/disable reindex-on-update based on the content type.
- Add options to periodically reindex content types in bulk operations, maybe even with a decresing frequency based on age.
- Provide an API to other modules to explicitely trigger an update for a given node. (like in fuzzysearch)
Comments
Comment #1
JacobSingh commentedLook at hook_apachesolr_update_index
You should be able to do all of this here.
Look at the setting per content type for indexing in the settings pages.
Re-open with a new subject for a specific feature if this doesn't solve your issue.
Best,
Jacob
Comment #2
digi24 commentedHi Jacob, thanks for your comments. However the solutions suggested do not match the problem, I will try to clarify it:
The point I am trying to make is not which content types to index, re-index or what information to add, but the re-indexation strategy used by the apachesolr module.
Right now, apachesolr_nodeapi triggers an UPDATE of the search_node table on every node_save. This might not be desirable in certain setups where nodes get updated frequently and the changes are only of minor importance.
I would like to have a more granular control over the reindexation strategy used by this module. In my case, for certain content types, I want new nodes to be added into search index as fast as possible. Updated nodes can wait, in fact I would prefer to reindex updated nodes just once a week. The aproach taken by major search engines is not that much different.
The current apachesolr_nodeapi logic however is not compatible with my wishes expressed above. New nodes share the same queue as updated nodes. hook_apachesolr_update_index is AFAIK "too late" to address these issues what gets updated or not.
The approach I am trying to chose and implement:
A: Re-Index settings based on content types
1. Create a form that lets the admin select which content types get reindexed on update.
2. In apachesolr_nodeapi, update section, compare $node->type and decide whether to update {apachesolr_search_node}
3. Create some cron-based functions and settings to reindex content-types based on other aspects that {apachesolr_search_node} updated field.
OR
B: Re-Index settings based on content field
1. on node_save, store re-index instructions in an additional field
2. In apachesolr_nodeapi, check this field for further instructions, whether or not the node should be reindexed.
Comment #3
anarchivist commented@drupal24, have you considered building out some sort of reindexing settings based on whether a new revision of the node has been saved? This of course would rely on setting your content type to save revisions of content.
Comment #4
jpmckinney commentedWe can possibly add some hooks to allow developers to intercept changes to apachesolr_search_node - for example, to prevent the module from scheduling the reindexing of those nodes.
Comment #5
cpliakas commentedI often heard this requirement for Search Lucene API. I added an alter hook to the luceneapi_node_get_queue_query() function that allows people to modify the actual indexing query object. May or may not be a good solution, however I thought I would throw it out there as a simple idea as it seems to add a lot of flexibility.
Thanks,
Chris
Comment #6
nick_vhDuplicate of #966796: Separate indexer for multiple entity types
These tables are being handled in the multi entity so it is probably the best thing to take a look in that issue
Comment #7
cpliakas commentedTaking a look at the patches in the multi-entity thread, I am not sure #966796: Separate indexer for multiple entity types accomplished the goal of this issue. More specifically, it seems that the function
apachesolr_index_get_entities_to_index()builds a query that creates the list of queued content to be indexed. What if I wanted to implement a custom module that prevented a selected number of entity content from being indexed? I don't see how you can alter that query to add custom conditions to alter that queue query in any way. I haven't looked at the patch in too much detail, so I definitely could be missing something.Comment #8
nick_vhYou can always use the indexer callback and per document you could allow/ignore a certain document to index.
However, if you feel a specific query to retrieve the content is needed I would also prefer to continue in the multi-entity thread and allow a new callback in the definition to be able to select your own content?
Comment #9
cpliakas commentedThat's the missing piece! A custom callback would work in this instance.
Thanks for the clarification,
Chris