Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I use search_api_db, but I guess this is a generic issue and not related to the db implementation.
I have some hierarchical vocabulary and fields linking to this vocabulary indexed using the processor that also index parents for each term.
If I change the parent of a term I would expect that the search engine would reindex every content referring to that term but this is not happening.
Should this be handled by the search_api module? Or am I expected to write my own hooks to intercept that changes and invalidate contents?
Comment | File | Size | Author |
---|---|---|---|
#48 | 2007692-48--react_to_changes_in_related_entities.patch | 74.33 KB | drunken monkey |
|
Comments
Comment #1
drunken monkeyRegrettably, it's very hard to well-nigh impossible to keep changes of such second-hand data changes and mark the appropriate items as dirty. You can try to provide a patch if you feel it is (or should be) possible, at least for core fields, but even for those it would, while probably being possible, be a rather complex process. (The Hierarchy data alteration makes this even more complex, and would probably make a exact-match type of query impossible without hard-coding exceptions for it.)
But, I don't know, maybe we should still try to catch this in as many cases as we can … I agree that it is irritating, and could be hard to spot for site builders/admins.
On your own site, apart from using custom code you could also use Rules to fix this.
Comment #2
drunken monkeyComment #3
bago CreditAttribution: bago commentedI don't know search_api internals, so I may need some guidance...
Here is what I would do:
1) Use hooks to detect when a taxonomy term update involve its relations.
2) In this hook ask search_api for a list of indexed fields of type "taxonomy term"
3) For each field detect if the hierarchical processor is used or not
4) if the hierarchical processor is used run a search to find every content that links my "tid" on one of the fields (using OR search or running one search for each field)
5) mark all of that nodes as changed in the search_api.
I don't think this is easy, but it should work most time: am I missing anything?
Have you any hint about what are the correct ways to ask search_api the informations we need in that steps?
If this issues takes long to be fixed, what about implementing only #1 and add a set_message to tell the user that relations changes in taxonomies are not propagated to the search engine? (my main issue was that I found out that I had out of sync data after a lot of time and it took a while to understand the cause).
Comment #4
drunken monkeyYes, the code you describes should work as a custom fix for your site. You can use
$fields = $index->getFields()
to get a list of all indexed fields on an index,$fields[$field_key]['entity_type']
will then contain the entity type of the field (if it is present).However, if you mean this as a patch for inclusion in the Search API project, it's much too specific. Changing the taxonomy term name, or a linked node's title, etc., would have to be caught as well. Special-casing for the Hierarchy data alteration would probably be too specific, as well.
For that, putting a note prominently in the documentation (I'm currently working on a "common pitfalls" in the Getting started section) should probably suffice, too.
Comment #5
bago CreditAttribution: bago commentedI'm not sure I follow you about the "generic" vs "specific" stuff.
Isn't search_api indexing only tids? Why should it care if the path or the name of a term is changed?
(is this behaviour depending on the search_api backend of choice?)
What are the other "generic" cases where a taxonomy term change would require to reindex related entities (excluding the taxonomy_term itself)?
Comment #6
drunken monkeyThe problem is that you cannot just index term IDs: you can index the related term's name, the related term's parent's parent's vocabulary's description, a related node's author's mail adress, … – you get the idea. There are an infinite number of possibilities here, so just adding a fix for this tiny portion of use cases (changing the parents when you have the Hierarchy data alteration enabled and are indexing a term reference directly on the item) would do more harm than good.
We'd either need a generic solution, building on generic Entity API functionality, or we should just document this and refer users to using Rules or custom code if this is a common occurrence for their site.
Comment #7
bago CreditAttribution: bago commentedI didn't even know it was possible, but I see there is a "Add related fields" option in the field tab. Is that feature you are talking about? In that case you should reindex the content whenever a referenced term is changed.
I was proposing a specific "taxonomy term" solution because I thought that taxonomy term was already a special entity type in search api as I was seeing it as field type in my index (I'm not talking about what kind of entities can be indexed but what kind of fields can be indexed). Are you saying that an indexed field can be of any entity type (e.g: a referenced node)? In this case search_api does a lot more than I thought.
Comment #8
drunken monkeyYes, exactly. You can index any field of any related entity, and even of ones indirectly related. This makes reacting properly here very hard.
Of course, it wouldn't be too hard to keep track of fields from which entity types are indexed. But it would be very hard to reliably figure out what indexed nodes are actually affected by a, say, changed term name. And having to re-index all nodes when you change a single term name is, of course, out of the question.
Yes, exactly that is the case. For example, you'll see that the Author field is of type "User". And the same is possible for all other entity types, if there is a reference to them. (You can, e.g., also index the File entity of an image field on the node.)
Taxonomy terms have a bit of special treatment in the Views integration, but other than that the Search API supports all entity types the same way. Which makes it very powerful, but also makes it very hard to solve such issues while keeping it this generic.
Comment #9
drunken monkeyChanging the title to reflect the broader issue.
Comment #10
gilsbert CreditAttribution: gilsbert commentedHi Drunken!
I'm facing the same challenge: how to reindex contents that had a referenced taxonomy term changed?
I understood all the complexity.
I have two questions.
1 - Are you planning to develop an automatic process for this?
It would be fantastic if we could work on this but I have no idea how to implement this. Perhaps it would be need:
a) a dozen of hooks just to be able to see all possible changes over entities that might be used in search_api indexes (or can we write just one generic hook for it?)
b) a sequence of tests to check: if the changed entity is used in an index and what are the "contents" affected by the change that should be marked as dirty and with that reindexed!
If you have a trial code I'm avaliable to help you with the task.
2 - How can I use "rules" to make it work?
In case there is no better solution how could I use "rules" to integrate changes in an entity with search_api? I would need something similar of what I wrote before right? I would need to trigger all "contents" related to a changed taxonomy term (just an example) as dirty. Am I right?
Finally, how could I make this in a taxonomy with hierarchy option turned on?
Regards,
Gilsberty
Comment #11
drunken monkeyAt same point maybe/probably, but not in the near future, no. It's sadly far from trivial to do this, otherwise we'd have implemented it already.
From the top of my head, it might work like this:
field_tags
andauthor
are indexed, savetaxonomy_term
anduser
for this index (in some data strutcture). Probably also save all of the fields directly or indirectly indexed.hook_entity_update()
, don't just check for indexes of that type, but also of indexes who have that type saved as an indexed related entity type.However, I guess at some point this would just be a heuristic (not least because of the possible discrepancy between fields on the objects and Entity API metadata properties) and we'd still be sure to miss some edge cases – the "Index hierarchy" data alteration would be a prime example for that.
So, a lot of work and thinking to do here before this has a chance to be fixed. And even then, we'd a) still not catch all cases, probably, and b) it would not work for things other than entities anyways.
To answer your other question, with Rules: I'm honestly not sure what's possible with that and what isn't. If you are using the "Index hierarchy" data alteration, a proper reindexing of affected items sounds very much like it might be in the "isn't" category, though. So if you don't find a way to do it, I guess only some custom code would help you. Implement
hook_taxonomy_term_update()
(and, probably, thehook_taxonomy_term_delete()
, too), check whether relevant properties changed, find the items that are affected by the change and mark them as "dirty" withsearch_api_track_item_change()
.Comment #12
gilsbert CreditAttribution: gilsbert commentedHi @drunken monkey!
I would like to start this week a code trying to automatically integrate taxonomy changes and reindexing process.
I'm thinking about to write a code using the two hooks you suggested (term_update and term_delete). At first the code will be very simple and will only test "ids" (vid, tid, etc). We might work on it later and give to the code more smartness to test if the index really use the property/field changed (only in case of term_update of course).
Well please take a look on the pseudo code below.
Now I have doubts about my approach.
a) Are there those functions?
1 - search_api_vocabulary_in_use ( $vid )
2 - search_api_list_items_by_tid( $tid )
3 - search_api_vocabulary_with_hierarchy( $vid )
b) In case there is not one or more of the mentioned functions can we substitute by a similar one or can we write one?
Special note about "search_api_list_items_by_tid( $tid )": I believe that indexes directly based over taxonomy terms are already integrated with taxonomy changes just like indexes directly based over nodes are already integrated with content changes. Am I right?
In case I'm right this function should not list "items_id" from those indexes because it would be unnecessary.
Regards,
Gilsberty
Comment #13
drunken monkeyNo, none of these functions exist. Do the following to emulate them:
entity_type
set totaxonomy_term
, try to find their Field API field name (for properties directly on the search item, that's the Search API field identifier itself). Then use the Field API functions to find out the field's vocabulary. Use that to build an index mapping the vocabulary IDs to the indexes and indexed fields using them.search_api_query()
). Use the information from the previous step to do this: i.e., search for all indexes and then filter for all fields that use this vocabulary.Yes, changes for the items themselves are already taken into account. However, if you index taxonomy terms along with their parents/ancestors, the same will not be true for changes in those related terms. Only a change on the term object itself will trigger it to be reindexed automatically.
Comment #14
gilsbert CreditAttribution: gilsbert commentedOkies.
Can you help me with (1) and (2)?
It can be a draft / pseudo code that I will test and make work.
The rest I'm able to handle alone.
Regards,
Gilsberty
Comment #15
drunken monkey1.
2.
Comment #16
gilsbert CreditAttribution: gilsbert commentedThanks.
I will write a beta code and post it here for your considerations.
It might take sometime but I will do it. :-)
Regards,
Gilsberty
Comment #17
jtbayly CreditAttribution: jtbayly commentedYikes!
I didn't realize that changes to related entities (that are part of the index) didn't make a node get re-indexed.
-Joseph
Comment #18
Poieo CreditAttribution: Poieo commentedThis also comes into play with Commerce products/displays. If a product entity is updated directly, the product display node is not updated in the index.
Specifically, if I change a product to disabled directly (via admin/commerce/products/[pid]), rather than editing a product display node, the product is not removed from the search results.
Comment #19
drunken monkeyAs mentioned already, you can use Rules for now to work around the problem.
But of course I realize this is less than ideal. It's just a rather hard problem, and I'm chronically short on time.
Comment #20
Poieo CreditAttribution: Poieo commentedI've tried Rules. I have a rule that reacts on the event 'After updating an existing commerce product'. It fires the action 'Index and entity'. I have 'commerce-product' as the data selector and 'index immediately' is checked.
Viewing the Rules debug, everything fires correctly. However, the product display is not added/removed from the index. Do I need to add a condition to pull in the product display and use that as the data selector? If so, any thoughts on how to do that?
Comment #21
drunken monkeyThe data selector of course has to match the entity that is indexed in your search index. So if you're indexing product displays, yes, you need to match that in the data selector. No idea how to do that, though, sorry.
Comment #22
Poieo CreditAttribution: Poieo commentedI was able to get this working for updating a commerce product. Rule export below.
Comment #23
tobiberlinThis issue is quite old but referenced in https://www.drupal.org/docs/8/modules/search-api/getting-started/common-... - as I run into such a problem now I would like to "ping" this issue. Had there been any progress?
I think of writing a module for this so I would like to know if anyone started to work on this already? And if the maintainer(s) of search API would assist me when it comes to offering settings for each index and so on?!
Comment #24
drunken monkeyIf you can get it working in a clean enough manner, I'd even be interested in adding this to the module itself.
What would have been your approach?
I'm not aware of any work in this direction, no.
Comment #25
tobiberlinThis is a good question and I still think around this. It seems that the differences between Drupal 7 and Drupal 8 are too big so would you recommend to focus on Drupal 8 for that?
My thoughts about the approach:
- it would be helpful if the settings for this would be done individually on every search index and on bundle/ field level
- On every search index admin page the admin could find a page/ tab "Entity reference updates" (or somethig like that):
-- fieldsets for all bundles of the entity indexed in the search index (or just those bundles indexed)
-- checkboxes for each reference field within this bundle. Referenced entities in the checked fields would cause the parent entity to be re-indexed when updated
I would store these settings in a table in a way that on entity update it is easy to find out if the current entity is referenced from a specific field/ bundle. I will dive into entity reference logic to find out how this should be done without doubled stored data and complex/ big database queries to keep performance in mind.
Comment #26
tobiberlinWhat came into my mind: I just thought of Entity Reference fields (Drupal 7) but this logic would need to respect other relations as well: author, organic groups, terms, entity reference revisions (for paragraphs). Maybe for this something like what is called a plugin in Drupal 8 would be helpful?!
My aim would be to build a logic which could be easily used in Drupal 7 and 8 as I need this for Drupal 7.
Comment #27
drunken monkeyOK, with such a complex settings and storage infrastructure, a separate module might be best after all. I'd have looked for ways to solve this automatically, but I guess letting the user set all this up themselves makes it a lot easier to implement. Basically, you're making a small-scale "Rules" module clone for just the task of re-indexing items when other entities change.
Supporting all the different reference fields shouldn't be too much of a problem, since they all have to integrate with Entity API anyways. So, you can just check if a property's data type is an entity type (or a list thereof).
Restricting this to just the indexed fields would also be an option. I.e., if you want to re-index based on term changes, you have to index the
field_tags
field (or whatever other term reference field(s) you have). Then, check it under "Entity reference updates", maybe select that it's of type "Taxonomy term" (although that should be possible to determine automatically) and then every time a taxonomy term is changed you just search the index withcondition('field_tags', TERM_ID)
and re-index all results.Otherwise, you have the problem of matching the changed taxonomy term to the items that actually contain it – an entity field query might work for that, too, though, haven't really used those in D7.
Comment #28
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedHello guys!
We also have plenty of 'foreign' entity values participating in the search index, i.e. exactly what this issue encompasses. I would like to give it a shot with implementing at least the most basic general solution to this problem.
We run the 8.x-1.x version, so I am going to switch the version of this issue. I'd like to begin with a failing test (see it attached). It exposes the issue itself in automated/testable way. As you can see, I am targeting only 1 level of nesting and I hope the system should be smart enough to mark for re-index only if the referenced entity did effectively change one of the indexed fields (and should not mark for re-index if only non-indexed fields got changed).
I'll see if I manage to write a patch now that makes this test pass :)
Comment #29
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedLet's see how it comes.. Here's the patch that includes the test from above but now it also amplifies the search_api.module so it actually fulfills the behavior the test above expects from the module.
I am not going to include the interdiff because it's essentially same test with/without the application code.
I'd like to say a few notes about the implementation. First of all, whoever is the author of FieldsHelper service -- I owe a biiiig thank you!! The task came so easy since I was able to outsource all the low-level analysis of data types and relationships/nesting into that service.
Also, I chose to add it into the
TrackerPluginBase
. From one hand, this kind of behavior is probably any tracker should be doing (in case some classes inherit from the Basic (or even the base class itself)), so at first I felt tempted to make it part of theTrackerInterface
. But on the other hand, I couldn't say it's actually a public interface of a tracker. It feels more like an implementation detail, not part of a public interface. So eventually I settled this dilemma by making it part of the base class (very likely all trackers will have this behavior while still not going as far as making it public interface).The code came out pretty clean. I really was afraid the outcome will be much uglier. It reads pretty logical. I believe introducing support for more than 1 level of nesting also can be achieved within a reasonable effort. But I stopped here with just 1 level of nesting because I was just afraid of getting myself into a rabbit hole. And I sure would love to get some feedback from the rest of the community before I take it even further.
The best part though -- It really works!! :) On our Drupal instance this piece of code actually even picked up some relationship (and thus necessity to re-index in a spot where I myself wasn't even aware.. I noticed it does pick it while debugging in xdebug).
I ran full set of tests on localhost (except for Javascript). There was only 1 failure due to a module my installation was missing. So I am willing to throw in this patch and I am full of hope it's gonna come out with clean green light.
Comment #30
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedOk :) This one easy.. I just forgot to add the @group to the test class.
Comment #31
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedAlright, so here comes the first 'real world' feedback from our instance of using this patch. I made a minor error - I should have accounted for the fact that a field that is part of the search api index belongs to a different bundle than the one that just got updated. Attaching a revision to my last patch where it's taken care of.
Comment #32
drunken monkeyWow, great job! Thanks a lot for your work! That would really be a great problem to finally fix, or at least alleviate, and your patch looks like a great step in that direction.
My main concern with the patch in its current form is performance: This new code is gonna take a significant amount of time to run, especially for entities unrelated to the index, and it’s going to run every single time any entity is updated or deleted. For most smaller sites, this will probably be a reasonable trade-off (unless they don’t index any related entities’ properties, in which they wouldn’t benefit at all), but for larger sites it would often be an unreasonable burden on performance. So, at the very least, this functionality should be optional. (Not sure what the default would be, though.)
As almost all of the computations can actually be made beforehand, though, a better solution would probably be to compile in advance (for instance, when the indexed fields change, or maybe also for Field API field config changes) a list detailing changes to which fields of which entity types will lead to the need for re-indexing. That way, at entity CRUD time the check would (hopefully) be much faster – especially for entities that won’t trigger any re-indexing.
Don’t know if you have the time/energy left to try your hand at that as well?
Some more high-level comments (with the sheer size of the patch, there’s of course also lots of coding standards problems, typos, etc.):
In theory, this doesn’t have to be the case, if the same entity is referenced via multiple different (Field API) fields.
However, I can’t really think of a common scenario where that would be the case, so it’s probably fine to leave it like that. (This will probably never work for all scenarios anyways, so no need to make it complicated. On the other hand, with the pre-computation as suggested above, this would probably be fixed as well, as a side-effect.)
All of this is specific to content entity datasources and won’t work for (most) others.
The ID generation (which is even specific to our default
entity:*
datasources) could probably be adapted to work for other datasource plugins as well (viaDatasourceInterface::getItemId()
) – however, I don’t really have a solution for the entity query. If the datasource doesn’t contain entities, we can’t really find out which items reference this entity (unless the basic reference field is also indexed, in which case we could use a search query, I guess).So, maybe this would better live in the datasource plugin? We’d move as much as possible to the
DatasourcePluginBase
class, and implement the part that depends on internals in our defaultContentEntity
plugin. Other datasource plugins would then be free to implement this, too. (If we commit it like this, without pre-computation and caching, we would just have to make sure we’d make it easy to add that later without breaking any added API.)Anyways, thanks a lot again for this!
Comment #33
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedHello! :)
Yep, your suggestions make perfect sense. I also felt like doing unnecessary expensive stuff upon each entity change (traversing the index fields looking for any indication on whether any search items need to be rebuilt). Pre-computing it is definitely way better.
I am also much in agreement with your suggestion to split this code between Tracker and the Datasource. So Tracker does most of the abstract logic while data source deals with entity-specific details and thus other data sources at least have some abstract ability to "hook" into this thing and provide their alternative implementation for some stuff related to the inner nature of data that is indexed.
Just a few technical/implementation level questions: where do you prefer to store the "pre-computed" analysis of the search index fields? In
\Drupal::state()
or somewhere within the index config entity? Also, to be able to determine whenever that 'pre-computed' stuff needs to be recomputed again.. Maybe it's possible to leverage Cache API (cache-tags) in some way? Like if your index is "my_index" and it includes "field_entity_reference" and "field_taxonomies" as part of its fields, then we could listen for invalidation of any ofsearch_api_index:my_index
,field_config:field_entity_reference
,field_config:field_taxonomies
cache-tags and re-compute our bits at that time.Hm... having that said... Maybe this thing could rather be stored into cache instead of State API? :) Then we'd get the support for smart cache invalidation right out of the box!
Also, thank you for giving such warm welcome to this patch. I understand it affects relatively deeply the tracking layer, so it's a joy to see overall you are in agreement with the direction that I took in the patch. During the work days I am pretty busy at the day job and hardly I can sell "polishing" this patch in the middle of severe battle we are having here with our Drupal 8 instance. But I could give it a shot on the weekend. I'll try to re-distribute this logic better among Tracker and Datasource. I'll also do the pre-computation and will store it as cache (unless you oppose to this idea and/or propose something else). And I'll make sure this code follows coding standards.
Comment #34
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedHello, drunken monkey
Here I am with a new iteration on this patch. Conceptually it is the exactly the same thing, it was just rather restructured/broken down to more logical & manageable pieces. It was pretty clumsy all stuffed into one single method. I also executed on your suggestions. Additionally, I tried to utilize as much of pre-existing search_api functionality/helper methods as I could find (in order not to re-invent the wheel and to maintain the principle of single responsibility).
I've placed a bunch of TODO items. Of course, by no means they all should be committed but I rather just wanted to highlight the pieces I was unsure about the most (and in case you have any feedback/opinion on those, please do let me know). The only TODO I really address to myself and hope to fix any day soon is this one:
Some specific items I took care of:
field_first_level_entity_reference:entity:field_second_level_entity_reference:entity:field_some_field_that_is_in_the_index
I hope you still like where this thing is moving :) I haven't cleaned up the coding standards yet since I still have plenty of TODO items but I sure can take care of that once we come to the idea the patch is alright as far as functionality/performance goes.
I am looking forward to your feedback!
Comment #35
drunken monkeyThanks a lot again for the great work!
The location of the code now makes more sense, thanks! However, I’m still not completely convinced. In my mind, determining changed items isn’t the responsibility of the tracker, so I think none of the code should live there. However, it is true that a lot of the code could be reused for other datasources, so putting it all into the
ContentEntity
datasource plugin also doesn’t make sense.However, one point to note here is that (at least currently) the
search_api_entity_*()
hook implementations are (as the doc comments also note) not part of the module’s framework/API, but specific to theContentEntity
datasources.If the additional code added here will work for other datasources as well, that’s great, but we’d have to note that somehow, and ideally keep those two code blocks separate.
Maybe moving the actual bodies of those hooks to a new service would make sense? Then we could also move the code that’s now in the tracker there, and maybe separate the parts that are specific to
ContentEntity
datasources into helper methods or something.Anyways, all this “where should the code live to make sense” is hard for you to guess, since it’s mostly just my personal taste. Therefore, probably better just leave it as-is for now and I’ll later do a refactoring of the patch before committing once you’re satisfied. So, please just tell me when the patch is “done” as far as you’re concerned (or you won’t have any more time to work on it in the near future) and then I’ll take over. (Time I did some work, too, instead of just nagging. ;))
Some more remarks on the patch:
Probably, yes. Since we’ll only have a handful of entries, a separate cache table would be overkill, I’d say.
If we use a general-purpose cache bin, we should prefix the key with
search_api:
, though.Please use string keys here, as they make the code easier to read.
I think you misunderstand what cache contexts are for. They are used in some parts of the system (Render API mostly, I think) to determine the cache key (
$cid
) to be used for an item. So, they are for cache variation, not invalidation. (Unless I’m very mistaken – which can always happen, when caching is concerned.)Since I’m pretty sure we need just one cache entry per index, I think we can safely discard the cache contexts.
However, I’m also actually not sure whether we need to include tags and max-age from the individual fields/properties. I don’t really think we use anything that could change. (I don’t think it’s possible to change the target entity type of a field. And while the target bundles might change, it seems to me the code would also work without them – we’d just run an additional entity query in some cases that could be otherwise avoided. But on the other hand we’d avoid invalidating our cache in a lot of cases – so that might be worth it.)
Yes, I was also surprised to find that this is just hard-coded. We do have a constant for the datasource separator in the item ID, so why not for this?
Maybe we should just add such a constant. I don’t think we glue property paths back together often enough, though, to warrant a helper method. (I couldn’t, with a quick search, find a single existing instance.)
Yes, I agree, that could indeed be useful.
However, we now deprecated all our hooks and are using events instead, so this should use an event, too. (See
\Drupal\search_api\Event\SearchApiEvents
and #3023704: Convert hooks to events.)If you are unsure on events, you can leave that to me, too.
In any case, thanks a lot once more! This is really great work!
Comment #36
drunken monkeyOops, almost forgot: I made some tiny changes already, please incorporate those, too!
Comment #37
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedHello!
:) I've addressed all of your input (on top of your #36 patch) and removed the TODO items that you clarified. As a last piece, I've addressed multiple coding standards things (just within the code from this patch).
To be honest, I think yes, it'd be more efficient if you arrange/split the code further per your architecture/vision on the module. Just one additional thing I could suggest is to refactor some of it into a Trait so other trackers/datasources selectively can use the trait instead of 'forcing' this code into the
TrackerPluginBase
class, for example :) But it's all up to you.I really enjoy the layout of Search API architecture, very flexible, very concise. Even the fact eventually this patch (that encompasses a pretty big chunk of functionality) took such reasonable (small) amount of code is a strong indicator of a successful design, in my opinion. Without the correct design it would be not flexible enough to easily accommodate this new feature or I wouldn't be able to reuse so much of the existing code. I am quite grateful for having Search API in Drupal, so thank you for maintaining it and bringing the module to such success! =)
Let me just answer a few points from your input to get my point across:
Yep, at the moment when I left that TODO item I still wasn't clear how to correctly take into consideration the cache contexts. Today I looked up how Renderer service does it (
RenderCache
class, to be specific), and eventually they "flat" cache contexts out as additional piece to the$cid
(thus your cache effectively varies by cache contexts). So I've implemented this same approach onto our case here. I do not insist on committing it though (I'll cover on it more in the next paragraph).Yes, especially from practical standpoint your vision makes a lot of sense. My motivation to include it was to "play fair" -- output of this method actively depends on the field config, so let's actively depend on the field config in our cacheability.
Also, about cache max-age and cache contexts... Giving a proper treat to max-age is pretty inoffensive to performance. Anyway in practice max-age is "permanent" for the cacheability of any field config. As for cache contexts, you make valid concern that it will decrease a bit the cache hit ratio. Yet, especially with the addition of the "mapping foreign entities" event, not propagating max-age or cache-contexts into the cache could be a two-sided sword: what if someone's custom logic does depend per language (or expires after certain time), they would report that into the cacheability object, but the search API module would then simply ignore that stuff. Of course, a tiny private patch would solve the problem, but talking theoretically, it's no longer entirely in the hands of Search API module since the appearance of event, so I'd propose to keep max-age and cache-contexts. But it's not a hard requirement, just me trying to think in abstract way :)
Thank you for reverting so quickly on these patches. Sometimes digging through someone else's code is not so much fun :) I guess at this point I do deem the patch as ready and basically invite you to take over from here.
Comment #38
bucefal91 CreditAttribution: bucefal91 at Ocelot commented.. putting the right issue status ...
Comment #39
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedFixing up the failing test.
Also, while fixing it, I noticed the process of casting cache contexts to cache keys carries some cacheability along with it (you can have a look at the
CacheContextsManager::convertTokensToKeys()
for extra info). So I made it part of our cacheability too.Attaching interdiff and the lumpsum patch. :)
Comment #40
s_leu CreditAttribution: s_leu commentedThe latest patch breaks the node form for me. I have an index on group_content that also indexes fields of the nodes related in the group_content entities. After saving a node via the form i get this:
Comment #41
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedUnfortunately the error message is not enough to be able to pinpoint the problem in your case. I've reviewed the code trying to find some error in logic comparing to the information contained in your error message, but I couldn't identify anything.
I think we need to collect more information about your set up to be able to help. It seems your layout of fields has something in particular that I did not account for in my patch.
First of all, could you show what fields are part of your search index?
Also, please try running the patch I attach to this comment. It's the only clue I can think of currently. I've commented out a small piece of code that might be causing issue in your set up. So I wanna see if without it it runs, then I have some indirect support that my guess was right.
Comment #42
bucefal91 CreditAttribution: bucefal91 at Ocelot commentedFinally this patch got some internal review here at our company, Ocelot. The only thing that was picked up is a minor misspelling in my English in a comment. Attaching a fix for it.
Since we haven't heard yet from s_leu and I have no further clues on what's wrong in that Drupal installation. I guess I could bump it into "needs review" once again. s_leu, if you do come back with additional insights on your Drupal, I'll gladly review them.
Comment #43
drunken monkeyOnce again, thank you so much for your great work!
And great to hear that you found the architecture nice to work with – that’s of course always the goal.
I finally got time again to work on this myself, and due to my specific ideas about what should go where, how things should be formatted, etc., I fear there is a pretty huge amount of changes to that patch and the interdiff probably almost unusable. Sorry about that!
Anyways, almost all of the code itself looked pretty solid to me, so very few functional changes. (Though I still ended up breaking things and having to debug. Well, serves me well for being so pedantic. ;))
I fear that I don’t fully understand the code for gathering the “entity relationships map”, which might proof tricky going forward (i.e., maintaining this code in the future), but as there is solid test coverage I think it should be fine. It’s also better than for some of the code I myself wrote, and I have managed there, too.
One thing I’m not sure about is whether this covers all the different cases for entity changes that could affect the index:
field_tags:entity:name
and changing a tag’s name: this is obviously covered already, as the test shows.uid:entity:field_tags:entity:name
and then you change the tags for a user: does this also lead to re-indexing? (So, are all “stations in-between” also covered in the property path?)I do know that this doesn’t happen in D7, though, so it’s probably the case for D8 as well and another Core bug we have to work around. Just want to make sure we’re not doing more work than necessary.
If all these cases are covered, this should probably also be reflected in the test, to make sure they stay working. Otherwise, we might want to amend that (and then, of course, also add tests) – but maybe in a follow-up, if this would be too much work otherwise.
Speaking of which, I’m not completely convinced of the terminology we’re using: “foreign entities”, “entity relations”, etc. Can we brainstorm different names? E.g., how about we rename
getForeignEntityRelationsMap()
togetIndirectlyIndexedPropertiesMap()
and change the rest of the terminology accordingly? It’s really about the “foreign” entities’ properties being indexed, not the entities themselves. (Or, sometimes the entities themselves, too, I guess.)Not at all sure, though, what’s best here. All the concepts are a bit tricky to convey, and if you only have the method name to describe it (so it can’t be, like, ten words) it gets really tough to keep it both correct and concise (enough).
Also, one bug that I did find was this:
OK, it took me a bit to realize this myself, but this can never work: you are always reading cache with a different
$cid
than writing it, so you will never get a cache hit. If you’d want to use cache contexts, you’d basically need to do a preliminary first pass (every time) just to collect the contexts, and then only generate the data (if necessary) in a second pass. I really don’t think this will be worthwhile here, so I fear we have to abandon cache contexts – even though your reasoning does make sense that this information might depend on other factors.However, I don’t think any of this is set in stone. If people complain that they have this use case, we can still think of something. (I do think there would be a couple of options.)
I totally agree regarding max-age, in any case.
Any other open questions that I forgot to address?
I agree, more information from s_leu would be needed in order to properly investigate the problem they’re experiencing. Simply catching the exception (which would, of course, be much easier if Core correctly documented the exceptions that can be thrown) will at least help prevent this change from breaking the whole site as described, but it would of course be preferrable to fix the problem at the route, to have this new functionality working as well as possible.
The actually executed entity query when the exception occurs would also be very helpful.
In any case, it’s great if as many people as possible actually test this! This is as complex change, so we want to make sure it’s working properly in as many cases as possible, and doesn‘t break anyone’s site.
Comment #44
drunken monkey(Oh, I guess this kinda depends on #3044782: Increase PHP version requirement to 7.0, as it already uses some PHP 7 syntax. But probably also no big deal if we jump the gun on that by a week or so.)
Comment #45
drunken monkeyDoes anyone want to give further feedback on this?
This is a much-requested but very complex and tricky feature. The more testers there are, the more solid the end result will be.
(Also, we might still want to change those method names?)
Comment #46
marassa CreditAttribution: marassa commentedJust stumbled upon a similar problem - adding a tag to a Picture node properly reindexes the Picture per se but not the Artefact the Picture refers to. The patch applied to 8.x-1.15 just fine but it did not solve the problem. Only then I realized that I am using Reverse references in this index - it's the updated Picture that refers to the Artefact that should be reindexed and not vice versa.
I tried the patch the other way around (changed the name of an Artefact) and it worked as advertised: both the Artefact and the Pictures pointing to it were reindexed.
So the patch seems to support "direct" references but not the reverse ones...
Also, after updating the Artefact I found the following bizarre message in my dblog:
Could not load the following items on index Search pictures: "entity:node/0", "entity:node/1", "entity:node/2", "entity:node/3", "entity:node/4", "entity:node/5", "entity:node/6", "entity:node/7", "entity:node/8", "entity:node/9", "entity:node/10", "entity:node/11", "entity:node/12", "entity:node/13", "entity:node/14", "entity:node/15", "entity:node/16", "entity:node/17", "entity:node/18", "entity:node/19", "entity:node/20", "entity:node/21", "entity:node/22", "entity:node/23", "entity:node/24", "entity:node/25", "entity:node/26", "entity:node/27", "entity:node/28", "entity:node/29", "entity:node/30", "entity:node/31", "entity:node/32", "entity:node/33", "entity:node/34", "entity:node/35", "entity:node/36", "entity:node/37", "entity:node/38", "entity:node/39", "entity:node/40", "entity:node/41", "entity:node/42", "entity:node/43", "entity:node/44", "entity:node/45".
As you can see it counts from 0 to 45. Why 46 items? It's the exact number of Pictures referencing the Artefact I updated. As I already said, the Pictures were reindexed properly with the new Artefact name, in spite of the error in the log.
Hope this helps...
Comment #47
drunken monkeyGreat, thanks a lot for testing! With this, I was able to spot a bug in the code: apparently we re-use
$ids_to_reindex
ingetAffectedItemsForEntityChange()
, which would have led to various other bugs as well.Should be fixed with the attached revision, please test again!
Also, again, other testers/reviewers would be very welcome! If this isn’t tested thoroughly, on various setups, this is bound to break something!
(Incidentally, this probably means we should have even more test coverage, too. For a
$foreign_entity_relationship_map
with multiple applicable entries, for example (as that would previously have not worked properly, I think, due to use overwriting$ids_to_reindex
) and maybe with a mocklogger.channel.search_api
logger set that will throw exceptions when used (for anything above INFO, at least, if this would otherwise lead to incorrect fails). That would also have caught the bug.I fear I’m totally swamped, though, and unable to add those myself, currently.
Comment #48
drunken monkeyRe-roll.
Comment #49
marassa CreditAttribution: marassa commentedJust tested, the errors in the log are gone but can't see any change in behavior related to [not] reindexing reverse references. Am I missing something?
Comment #50
drunken monkeyAh, no, sorry. That is expected, supporting that would be a different problem, I guess. Since we don’t have any (framework-level) data on those references, it would have to be code specifically geared to the
ReverseEntityReferences
processor. MaybegetForeignEntityRelationsMap()
could offer an extension point for enabled processors? Would be a rather complex setup, though, probably with a dedicated interface for processors adding further relationships, thatReverseEntityReferences
would then need to implement.Certainly doable, but maybe out of scope for a first version of this functionality.
Comment #51
marassa CreditAttribution: marassa commentedQuite frankly, I forgot that ReverseEntityReferences is a separate processor within the module - the fields supplied by it are so seamlessly managed in "Manage fields for search index" that I did not realize that they are not "in the framework" in a similar way to "normal" references (admittedly did not dig deep enough into the module internals to verify). I will try to dig into it when I have time and maybe come up with something. I believe that as long as the module supports both "direct" and "reverse" references, both should be (ultimately) covered by this auto-reindexing functionality.
Comment #52
lukasss CreditAttribution: lukasss as a volunteer commented#48 not working for me
Comment #53
drunken monkey@ lukasss: Any chance you could be a tad more specific?
@ everyone: Come on, 24 followers and no-one willing to give this a try? If this works properly, it could be a huge step towards making Search API Just Work™. But it’s really pretty complex, so thorough testing is vital before committing this!
Comment #55
drunken monkeyOK, then I’ll just commit this and hope for the best (or, rather, wait for people to complain).
Many, many thanks once again, bucefal91!
Comment #56
penyaskitoThis broke the tests for 9.x. See https://www.drupal.org/pift-ci-job/1785819
Comment #57
3CWebDev CreditAttribution: 3CWebDev as a volunteer and commentedTested patch #46 and it resolved the issue that was driving me crazy for a bit. Now the search index is updated when a commerce product's referenced term is updated. Thanks!
Comment #59
sprocketman CreditAttribution: sprocketman commentedUnfortunately, I think the patch breaks some media library functionality. I've tried both the patch directly on the stable version of search_api and the dev version and both result in the same behavior. This was tested on Drupa 8.9.6.
Scenario:
The error generated in the logs is as follows:
Adding existing media items still works and users can still add new media within /admin/content/media, so there are work-arounds. Restoring the stable version of search_api fixes the issue, but of course leaves us back at not seeing updates in the index when referenced content is updated.
Comment #60
carolpettirossi CreditAttribution: carolpettirossi at Prosple commentedHi guys,
Thanks for the hard work on this. It seems a great feature and it would save my day(weeks tbh) to get this working properly.
I'm testing the dev version of this module but it doesn't work 100% in my case.
I have a site using group/group content module and the different indexes for each group content bundle.
For example:
advertiser_name
=>entity_id:entity:field_advertiser_name
parent_employer_advertiser_name
=>entity_id:entity:field_parent_employer:entity:field_advertiser_name
field_parent_employer
can reference Employer nodes only.organisation_alternative_name
=>entity_id:entity:field_organisation_name:entity:field_advertiser_name
field_organisation_name
can reference Employers or Institutions. But only employer content type has field_advertiser_nameWhen I update Advertiser name field, I would expect that all references would be reindexed. However, only the Employer index gets the update.
Initially I thought it could be related to
field_advertiser_name
being reused by Employer, Campus and Institution content types. So I added a field to the Opportunities index that it's only available and used by Employer content type.parent_employer_org_statement
=>entity_id:entity:field_parent_employer:entity:field_org_statement
I edited an Employer node, adding text to the
field_org_statement
field and I get the same result. Opportunities documents are not updated/re-indexed.I'm wondering if the scenarios above were considered to be covered and they are failing or if this is additional scope somehow.
Any feedback would be great :)
Thanks,
Carol
Comment #61
borutpiletic CreditAttribution: borutpiletic as a volunteer commentedI have also experienced
Error: Call to a member function getColumns() on bool
issues described in the #59 comment, but with referenced Paragraphs after the core upgrade to 8.9.7.Issue came from the query in
ContentEntity::getAffectedItemsForEntityChange
where reference entity IDs for re-indexing are collected.Patch from this issue fixed it: https://www.drupal.org/project/drupal/issues/2893747
Comment #62
maximpodorov CreditAttribution: maximpodorov commentedA switch is needed to add the possibility to turn this feature off as it can break existing sites (imaging loading 1 million nodes which reference a single term).
Ah, it has been done in https://www.drupal.org/node/3199906