I would like to be able to make some search indexes read-only, so that an external service can index data, and then that data can be used in Drupal via Search API. I've written a patch that adds a 'read only' flag to Search API index entities. This was straightforward, but involved touching several places in the code.

This issue comes out of #1064884: Add support for indexing non-entities, starting with comment #23.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

Status: Active » Needs work

As written in #1064884-24: Add support for indexing non-entities, I still see some issues with this patch that need to be fixed or discussed.
Subscribing, for now.

becw’s picture

Status: Needs work » Active
FileSize
13.96 KB

Here's an updated version of the "read-only" patch from #1064884: Add support for indexing non-entities, with the following changes (comments quoted from comment 24 on issue #1064884):

  • Switching between read-only states (enabled / disabled) isn't really handled, as far as I see. You should delete the corresponding items from the search_api_item table when marking an index as read-only, and then re-insert all items when unsetting read-only again.

    For this, it would probably be a good idea to split the "Remember items to index" logic from SearchApiIndex::postCreate() into its own method. (And also use that in search_api_enable() – what the heck did I drink while writing that?)
  • Why remove the search_api_mark_dirty() function, instead of adapting it? I also don't really understand the code you replace it with – why use a merge query with that IF condition instead of just filtering on changed = 0 like I do? And why execute one query for each index instead of a single one for all?

I've added two methods to SearchApiIndex, ::queueItems() and ::dequeueItems(), and called them in SearchApiIndex::postCreate(), SearchApiIndex::postDelete(), search_api_index_enable(), and search_api_index_disable(). Is that what you intended when you said "split the 'Remember items to index' logic"?

I also removed my modifications to search_api_mark_dirty(); I had removed it because it was only called in one place, and refactored its functionality because using a merge query would ensure that items would get added to an index on update if they weren't already present. This should now be handled by the ::queueItems() and dequeue... methods. I would consider moving all "mark dirty" functionality into the index class itself, perhaps by passing an array of ids into SearchApiIndex::queueItems() (with similar logic to the $ids parameter of entity_load()).

  • I also don't really understand how the relation to the server is handled. As I see, you don't call $server->removeIndex() for read-only indexes, but otherwise there doesn't seem to be any changes. You can't call addIndex() when you won't call removeIndex(), as you can't know what the server does in those methods, and what structures it might set up.

    I guess this is just so the server doesn't delete the items, which might have been added by another program? Please figure out another way to do this. If necessary we can even say that servers will have to check for the index's read-only state themselves when removing it.

    Behaviour when moving a read-only index to another server is also unclear – would this even make sense? Or does the feature even make sense for normal servers? In some sense, this is more of a server feature than something on the index side, the index is (at least in your case) just a front for the server getting its data from elsewhere.

    In other cases, where you really just set a regular index on a regular server to read-only, I'd expect the server to remove its indexed items when the index is removed. So maybe the index should just be removed normally, and you can special-case this in your Sarnia server?

I didn't call SearchApiIndex::removeIndex() so that the server wouldn't delete items. I don't think it would make much sense for a read-only index to switch servers, but then again the point of Search API is to abstract the service details, so it's possible that someone might want to change the type of server where data was being indexed, but not change searches and views based on the Search API index. However, this functionality is pretty far out since the only service I'm working on right now is Solr.

In an ideal world, I would avoid creating a Sarnia server entirely and just use features present in Search API Solr's server.

  • While forbidding access to the "Status" tab might make sense, you'll still have to let users define indexed fields and processors, as those are used at search time, too. (This is also the answer to your @TODO in the index class.)

I've removed any read-only-flag-related access restrictions to the fields and processors tabs. This intertwining of index and search functionality was party of what made it very hard to separate the indexer and searcher functionalities of the index class.

Done.

  • The new field should be included in search_api_entity_property_info().

Done.

becw’s picture

Oops, forgot to update a method call. Fresh patch attached.

drunken monkey’s picture

I've added two methods to SearchApiIndex, ::queueItems() and ::dequeueItems(), and called them in SearchApiIndex::postCreate(), SearchApiIndex::postDelete(), search_api_index_enable(), and search_api_index_disable(). Is that what you intended when you said "split the 'Remember items to index' logic"?

Yes, that's what I meant. However, I moved the whole logic for index updates from the to hook_search_api_index_update(), as such reactions to property changes are all executed there. This also ensure this is called when someone doesn't use the SearchApiIndex::update() method or the search_api_index_[en/dis]able() functions.
Also, we should use similar code in hook_enable() and hook_disable().

I also removed my modifications to search_api_mark_dirty(); I had removed it because it was only called in one place, and refactored its functionality because using a merge query would ensure that items would get added to an index on update if they weren't already present. This should now be handled by the ::queueItems() and dequeue... methods. I would consider moving all "mark dirty" functionality into the index class itself, perhaps by passing an array of ids into SearchApiIndex::queueItems() (with similar logic to the $ids parameter of entity_load()).

The search_api_mark_dirty() is an API function that could also be used by other modules, so we shouldn't remove it unless necessary. And regarding items being already present in the item list: a) this now shouldn't be the case anyways; and b), we can just dequeue all items before queueing them, to make sure. This removes both the join and the @TODO from the method.

I've removed any read-only-flag-related access restrictions to the fields and processors tabs. This intertwining of index and search functionality was party of what made it very hard to separate the indexer and searcher functionalities of the index class.

I think instead of restricting access to the "Status" tab we should just explain things there.

See the attached patch for my suggestions. Sorry that there's also a bit of unrelated clean-up, but when I stumbled upon those things, I just thought I should fix them right away.

Hm, and now that I think about it – why not track dirty items even when the index is disabled or in read-only mode? Would certainly help if we don't have to re-index all items when re-enabling the index (or unsetting the read-only flag).

becw’s picture

Your changes make sense, and work in my testing.

why not track dirty items even when the index is disabled or in read-only mode? Would certainly help if we don't have to re-index all items when re-enabling the index (or unsetting the read-only flag).

I think that part of the idea behind a read-only index is that Search API may or may not know what's in the index. I agree that it makes sense not to re-index content when switching an index out of read-only mode, if possible, but I think that while an index is in read-only mode, and especially when a read-only index is created, no attempt should be made to add indexed items to {search_api_items}.

I don't think that a join is a terrible thing to have (there is an index on the 'item_id' and 'index_id' columns, after all) so perhaps we could:

  • leave the join in SearchApiIndex::queueItems() for entities with base tables
  • use SearchApiIndex::dequeueItems() for entities that we have to queue using entity_load(), and document the disadvantage
  • let hook_entity_update() use search_api_mark_dirty() on existing items (ie, leave it alone)
  • don't add entities to the queue of read-only indexes in hook_entity_insert()

Well, that sounds less good now I've written it.

drunken monkey’s picture

Well, that sounds less good now I've written it.

So, other suggestions? Or is the patch therefore OK as it is?

As indexes aren't re-enabled that often (and probably even more rarely set to read-only and back), I don't think the performance matters that much, whether we use a join or a simple db_delete() (which also will be pretty quick). Deleting and re-adding the items is probably just my taste, because this is then handled consistently for both cases (direct db query and entity_load()).

becw’s picture

Status: Active » Reviewed & tested by the community

I think the patch is ok as it is. Tracking indexed items while an index is in read-only mode adds too much complexity.

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

OK, I now committed this. Thanks for the input!
One line in the test file had to be deleted, but otherwise the tests ran fine, too.

Automatically closed -- issue fixed for 2 weeks with no activity.