Having looked at the code and debugged some of my own modules, I have noticed that on every node update apachesolr triggers a reindex for that node.

In setups where we are programatically adding a lot of short-lived information into additional cck fields, it does not seem useful to have constant reindexation running or even have the the reindexation request written to the DB.

My suggestion:

  • Provide an option to enable/disable reindex-on-update based on the content type.
  • Add options to periodically reindex content types in bulk operations, maybe even with a decresing frequency based on age.
  • Provide an API to other modules to explicitely trigger an update for a given node. (like in fuzzysearch)

Comments

JacobSingh’s picture

Status: Active » Closed (fixed)

Look at hook_apachesolr_update_index
You should be able to do all of this here.

Look at the setting per content type for indexing in the settings pages.

Re-open with a new subject for a specific feature if this doesn't solve your issue.

Best,
Jacob

digi24’s picture

Status: Closed (fixed) » Active

Hi Jacob, thanks for your comments. However the solutions suggested do not match the problem, I will try to clarify it:

The point I am trying to make is not which content types to index, re-index or what information to add, but the re-indexation strategy used by the apachesolr module.

Right now, apachesolr_nodeapi triggers an UPDATE of the search_node table on every node_save. This might not be desirable in certain setups where nodes get updated frequently and the changes are only of minor importance.

I would like to have a more granular control over the reindexation strategy used by this module. In my case, for certain content types, I want new nodes to be added into search index as fast as possible. Updated nodes can wait, in fact I would prefer to reindex updated nodes just once a week. The aproach taken by major search engines is not that much different.

The current apachesolr_nodeapi logic however is not compatible with my wishes expressed above. New nodes share the same queue as updated nodes. hook_apachesolr_update_index is AFAIK "too late" to address these issues what gets updated or not.

The approach I am trying to chose and implement:

A: Re-Index settings based on content types
1. Create a form that lets the admin select which content types get reindexed on update.
2. In apachesolr_nodeapi, update section, compare $node->type and decide whether to update {apachesolr_search_node}
3. Create some cron-based functions and settings to reindex content-types based on other aspects that {apachesolr_search_node} updated field.

OR

B: Re-Index settings based on content field
1. on node_save, store re-index instructions in an additional field
2. In apachesolr_nodeapi, check this field for further instructions, whether or not the node should be reindexed.

anarchivist’s picture

@drupal24, have you considered building out some sort of reindexing settings based on whether a new revision of the node has been saved? This of course would rely on setting your content type to save revisions of content.

jpmckinney’s picture

Title: Limit Re-Indexing on Node-Update / granular control » Better control over apachesolr_search_node queue
Version: 6.x-1.x-dev » 7.x-1.x-dev

We can possibly add some hooks to allow developers to intercept changes to apachesolr_search_node - for example, to prevent the module from scheduling the reindexing of those nodes.

cpliakas’s picture

I often heard this requirement for Search Lucene API. I added an alter hook to the luceneapi_node_get_queue_query() function that allows people to modify the actual indexing query object. May or may not be a good solution, however I thought I would throw it out there as a simple idea as it seems to add a lot of flexibility.

Thanks,
Chris

nick_vh’s picture

Status: Active » Closed (duplicate)

Duplicate of #966796: Separate indexer for multiple entity types

These tables are being handled in the multi entity so it is probably the best thing to take a look in that issue

cpliakas’s picture

Status: Closed (duplicate) » Active

Taking a look at the patches in the multi-entity thread, I am not sure #966796: Separate indexer for multiple entity types accomplished the goal of this issue. More specifically, it seems that the function apachesolr_index_get_entities_to_index() builds a query that creates the list of queued content to be indexed. What if I wanted to implement a custom module that prevented a selected number of entity content from being indexed? I don't see how you can alter that query to add custom conditions to alter that queue query in any way. I haven't looked at the patch in too much detail, so I definitely could be missing something.

nick_vh’s picture

You can always use the indexer callback and per document you could allow/ignore a certain document to index.
However, if you feel a specific query to retrieve the content is needed I would also prefer to continue in the multi-entity thread and allow a new callback in the definition to be able to select your own content?

cpliakas’s picture

Status: Active » Closed (duplicate)

That's the missing piece! A custom callback would work in this instance.

Thanks for the clarification,
Chris