- if under search index -> workflow -> bundle filter = a single item (node type) is selected (which containts only a few nodes), all items (node types) need to be indexed in search index -> status.
Could this be limited for only the selected nodes? so not all items need to be indexed.
| Comment | File | Size | Author |
|---|---|---|---|
| #78 | 1184610-78--bundle_setting--interdiff.txt | 1.28 KB | drunken monkey |
| #78 | 1184610-78--bundle_setting.patch | 40.77 KB | drunken monkey |
Comments
Comment #1
drunken monkeyNo, this can't be really be limited, due to various reasons:
- This would make the generic framework dependent on a certain data alteration.
- The data of nodes that are rejected is deleted from the search server when trying to index them. Removing them from the "needs indexing" list would cause the data not to be deleted if they have already been indexed.
- We'd then also have to check in
hook_entity_update()whether to set an entity to "needs indexing", which could have a drastic effect on performance.If you have an idea on how to circumvent all of this, I'd be glad to hear it, otherwise this is "won't fix".
Comment #2
fangel commented[stepping in with my two cents] I'm well aware of the limitations, etc. However, when dealing with large data-sets (in my case, a total of 25.000 nodes), it get's annoying when you have a search-index with a very small (say 1%) fragment of the entire list of nodes. If I need to reindex the index, instead of 5 cron runs, it now takes 500 cron runs to index the few relevant nodes.
So some sort of "I know what I'm doing, please pre-filter based on bundle" flag? Or at the very least, consider that the current solution has issues for larger sites, so it's worth trying to look at.
Comment #3
drunken monkeyIt should be pretty easy to write a tiny module yourself, which automatically sets certain items to "indexed" when they are changed, and/or when the user tells it to. Hardcoding such a thing for a specific use case into the Search API is neither necessary nor an option.
You could even make this a general module, where you select the index, a DB table, the key field and some field/value pairs, and which then automatically sets all matching rows to "indexed", e.g. during cron runs.
Comment #4
berdirIf someone is interested, the following is what I added to one of our custom modules:
It adds an additional button to the status form of each search index if there are bundle filters set and then marks them as indexed if clicked.
Imho, this could even be added to the module directly, as nothing happens automatically and the problem with it is explained...
Comment #5
osopolarSeems the first query in #4 mymodule_mark_indexed() is not complete. In my case $entity_info['entity keys']['bundle'] is empty. Does this mean the entity is malformed (Missing bundle property on entity)?
I took the form submit part out of mymodule_mark_indexed() in mymodule_mark_indexed_submit() to use mymodule_mark_indexed in a drush script and rewrote mymodule_mark_indexed().
[Edit]: I'll attach the code in the next post.
Comment #6
osopolarCurrently attached as module.
This also includes an option for the drush command search-api-index to call:
search-api-index indexname --skip-excluded.
Any interests to include this code into search_api? If so, patch could be provided.
Comment #7
osopolarBefore Executing the queries in mymodule_mark_indexed (comment #4 and #6) we should check if any bundles or items should be excluded.
...
Comment #8
kumar_naga commentedHi osopolar,
Does the above module work when the cron job runs??.Thanks a lot.
Comment #9
osopolarThe cron job will not use the --skip-excluded feature ... so it will index everything. But in my case the most important is the initial indexing ... this could be done with drush. the items will be marked as indexed, so cron won't touch them until they get updated.
Comment #10
giorgio79 commentedI am just facing this issue. I don't really understand why indexing cannot take into account a bundle, for example a content / entity type bundle. I have 500 000 nodes that should be excluded from the index and indexing as well :)
Here are the concerns raised by drunkenmonkey
Um, that's what bundle filters are for no? :)
When creating an index, if I excluded a content type via a bundle why would they ever be indexed?
Setting a field value is a drastic drain on performance? Having to index thousands of excluded nodes isn't? :)
I think this would be a great feature inside this module instead of a random side module.
PS I poked around a bit and it looks like Search API uses the core Search module that is why it cannot limit to node types, but this is coming in D8
#111744: Add configuration to exclude node types from search indexing
Comment #11
giorgio79 commentedNo patch here, just a zip :)
Comment #12
osopolar@giorgio79:
Comment #13
giorgio79 commentedThanks osopolar.
Try disabling the core search module and you will see what I talk about.
I solved the problem with a db query
UPDATE `search_api_item` SET changed = 0 WHERE item_id ....Comment #14
jaydub commentedI agree that there should be the ability to exclude items from having to be loaded when indexing. In my case I want to create several indexes tailored towards specific data needs (fields, bundles, etc) that can be used as Views backends. We will have hundreds of thousands of nodes on the site of various types but in my current case I only care about an index of about 18K nodes that will be used with Views and facets.
It seems to me that if you end up changing any bundle filter settings then the easiest approach would be to delete and recreate the index anyways.
I've taken the example form alter code in #4 and folded into the main module as a patch.
Comment #16
jaydub commented#14: search_api-exclude_items-1184610-14.patch queued for re-testing.
Comment #18
jaydub commentedswitching versions to get this patch tested against current code (currently using with 1.4 release).
Comment #19
jaydub commented#14: search_api-exclude_items-1184610-14.patch queued for re-testing.
Comment #20
drunken monkeySorry, but this won't happen, due to the reasons already explained.
If you have a site this large, enabling/writing a small additional module to deal with this problem shouldn't be such an issue for you. Also, since the "Index now" functionality is now integrated with the Batch API, initial indexing should be a matter of a single button click in any case.
Comment #21
osopolarOn large sites with large/complex entities indexing will take various day. Especially for developing the function is very usefull.
@jaydub: I recomment using module in #6 due it's drush support.
I take drunken monkeys comment as won't fix, am I right?
Comment #22
klausiLet's leave this one open for discussion.
The entity bundle is a very important property, and in our case it determines whether the search API has to got through 80.000 nodes or just a much smaller subset when indexing.
So I propose that we include a bundle selection when an index is created, so that the user does not only select "node" but also "article" if the index is only supposed to operate on articles. I think we could implement that in a backwards compatible way, so that existing indexes are just assumed to be configured for all bundles of an entity type.
This would also be a desperately needed improvement for the field configuration form, where currently all fields of all bundles are thrown in. If we have a bundle restriction from the start only relevant fields could be shown.
Comment #23
klausiComment #24
drunken monkeyOK, this is a much broader feature request, but it's true that it would a) be clean(er), b) fix other issues, too, and c) would be a sensible special case, as bundles are (or can be) really quite fundamental in the way entities are defined/handled.
However, since we don't only allow entities, but all kinds of items now in the Search API, I don't think this would be that easy to implement. First off, we'd have a major API change for datasource controllers, which would now also have to be able to report what bundles (if any) a certain types have, and take bundles into account when keeping track of the items to be indexed.
(Or would you just hard-code this for entity types and don't give other item types the choice to add sub-types? Would make things a bit easier, I guess, but also less flexible.)
Would you allow excluding bundles instead of including them, too? (The "Only the selected"/"All except the selected" option often provided in such cases.) In any case I'd say we shouldn't allow editing of this setting after index creation, so without this option we'd probably be a bit restricted when new bundles are created. But on the other hand, it would make things even more complicated.
In any case, I'm not totally against adding this, but we should give everyone a chance to chime in here before that. (Also, I don't know when I'd have time to work on this.)
Comment #25
klausiI would hardcode the bundle filter for entities only for now. If it shows to be useful/essential for other non-entity items as well, we can always refactor that out later.
For the bundle selection I would go with the usual Drupal way as you described. None selected means all (this is also the backwards compatibility layer). And I still hope we can sell this is API addition instead of API change.
Yep, I agree that the bundle should be fixed and cannot be edited later (same as the entity type cannot be edited later). Could be a feature request follow-up.
Don't feel pressured about this, I just want to get on the same page with this idea.
Comment #26
fangel commentedI can +1 the decision to move the Bundle criteria from a filter to a index-criteria, as it would immensely speed up a lot of our indexing, because we have multiple indexes over nodes, all with only a subset of bundles.
Comment #26.0
fangel commentedoops
Comment #27
drunken monkeyI've wondered this before, I think, but does anyone know whether, by Field API standards, the bundle of an entity can change? Does anyone have a link for that? Would be easy to take into account here, but of course we should still skip it if it's unnecessary.
Comment #28
berdirSee https://api.drupal.org/api/drupal/modules%21field%21field.attach.inc/fun... and the corresponding hook.
Comment #29
drunken monkey@ Berdir: Ah, thanks, good to know! I hope this time I don't forget it again …
So, code to check for this would have to be added in the
trackItemChange()method of the datasource. Apart from that, attached is a patch that has all the basic functionality and seems to work quite well. For all entity types with bundles, an option will be presented when creating an index to select which bundles should be included. The setting cannot be changed afterwards.The following are still missing:
getMetadataWrapper()to only include properties from the excluded bundles. Would likely need some API change, as the "add properties from all bundles" hack is currently hardcoded into the index entity class, for whatever reason.But since the basic functionality is there and the code for those pieces will most likely not change much, tests and reviews are already welcome!
Comment #30
drunken monkeyOh, it seems I misread your comment, Berdir (or, rather, should have checked that function). While it's certainly good to know that there's a way to change the identifier of a whole bundle (we definitely have to react to that!), my question was whether a single entity could change its bundle? Entity type specifics aside, just purely from the Field API's standpoint, could an article node become a basic page, or a taxonomy term change its vocabulary?
(It turned out that this isn't so simple to incorporate in the patch, due to a design flaw in a different (OK, you got me, this) module, so I'd be glad if the answer was "no".)
And another question: the
hook_entity_info()is sadly rather vague on which keys are required and which are optional. Does anyone know whether['entity keys']['id']always has to be present?In related news, check out the latest addition to #2044311: Change workflow plugin system for how we could make such functionality more easy to add in Drupal 8.
Comment #31
berdirTHere is no official API to change the bundle of an entity, but nothing prevents it. There are some issues in 8.x related to that but nothing happened so far AFAIK.
Yes, the id is required I'd say. Depends a bit on you storage controller but it's certainly required for fieldable entities.
Comment #32
drunken monkeyAnd another problem, this time more severe: I didn't realize until now that the datasource controller's
getMetadataWrapper()method doesn't get the index as an argument.This decision made perfect sense at the time (most methods there don't get the index, as they're supposed to be discoupled), but now that there's per-index configuration for them, this is of course a serious shortcoming.
To be honest, I don't really see any way around this, without breaking the API in a major way (which won't happen). It's really a pity, since now the interface won't be completely adapted to the selected bundles after all, but at the moment I just don't see any way to accomplish this (without crazy hacks or major API breaks).
We should definitely fully get on board regarding the per-index configuration for datasource controllers in D8, and don't share them between indexes. (Cf. #2044419: Make datasource controllers more powerful.)
On a scale of one to ten, how important would you rate the adaption of the Fields form to the bundle restrictions? If it's a nine or ten, we could also come up with a different method of defining per-bundle settings that could achieve better integration. For example, let users just create new item types for sets of bundles, so the type's datasource controller automatically has the bundle filter information. (That way, we wouldn't even need any API changes, as far as I can see, as we wouldn't have to implement datasource controller config forms in D7.)
The UX for setting the bundle restrictions would of course be a bit worse then, but at least this functionality would be in.
On the whole, though, I like the per-index configuration variant much better, it's just a lot cleaner. I can't really tell how important the different aspects of this issue are for others, though.
@ Berdir: Wow, thanks a lot for the quick reply!
If it's not explicitly forbidden, I guess we have to support it. Damn … (Though I can't really imagine such a change going well, with the attached fields, etc.)
Comment #33
drunken monkeyOK, attached is a patch resolving all of the previous TODOs, except for the "Fields" tab, as said. I even added a few tests, which pass fine for me locally.
The problem I hit with the entity bundle change is that the
trackItemChange()method only receives the item IDs, not the actual entities and therefore doesn't get access to theoriginalproperty. This property doesn't seem to be included in entities returned fromentity_load()– even worse, it doesn't seem to be specified/consistent whetherentity_load()returns the old or the new entity version! It's a bit hard to believe, but it appears to really be the case.Anyways, therefore I had to move this part of the code to
search_api_entity_update(), as ugly as it is. It works, but I'm not really happy about it. If anyone can think of a better/cleaner solution, it would be very welcome!Anyways, please test/review! Let's make sure this works properly before committing it!
(I was too sloppy with that in the last months, it seems …)
Comment #34
drunken monkeyAnd once again I forgot to filter by
enabledandread_onlywhen loading the affected indexes.Comment #35
user654 commented.
Comment #36
user654 commentedHi drunken monkey,
did you have time to look at this issue ? I know that you are putting your efforts againts drupal 8 port :)
thanks
Comment #37
drunken monkeyDid you clear the cache? And did you update the Search API module itself shortly before?
Your errors seem unrelated, more due to a recent Search API update than this.
Also, attached is a re-roll that should apply to latest dev.
Comment #38
user654 commented.
Comment #39
drunken monkeyAs said in #32, this is how it currently works, there is sadly no (easy/clean) way to also let this setting influence the Fields tab.
Does everything else work fine for you?
The main improvement in comparison with the "Bundle filter" data alteration is now that this setting will also influence the index status on the index's "View" tab, and that indexing will be a lot quicker if only a small subset of nodes are actually indexed.
Comment #40
drunken monkey37: 1184610-37--bundle_setting.patch queued for re-testing.
Comment #41
Exploratus commentedThis is exactly what we are looking for. Seems ridiculous that we need to cycle through every node when we have a bundle filter. I have 1,000,000 nodes, want to index 5,000 and my index needs to run through all 1,000,000 to pick 5,000? huh?
Will test the patch.
Comment #42
Exploratus commentedTried the patch, once I select "content", it only shows the first 6 node types in the bundle list. I have about 15 content types, and only the first ones show. basically, I am not able to select all my node types.
Comment #43
drunken monkeyAre you sure you couldn't scroll or something? I don't see anything in the code that could cause this.
Or, maybe also try what
entity_get_info('node')returns, especially in thebundleskey. Are all your bundles present there?Comment #44
muschpusch commented#42 is just a scrolling issue and works well. A bigger problem is that the patch still processes all nodes of all bundles :/ We are indexing the full view mode of some nodes but search api goes through all 30000 items which ends up in a out of memory. The hackish approach from #5 and #6 is a good workaround until a better solution is found. Why not limiting the items written into search_api_item?
Comment #45
drunken monkeyNo, it doesn't, or at least shouldn't. Are you sure this is the case, after setting the bundles in the index settings?
That's exactly what we are doing in this patch.
Comment #46
muschpusch commentedOk so search_api_item needs to get truncated?!?
Comment #48
Exploratus commentedIve been using this for three months, works great! Please commit! Its such a huge usability improvement. I am able to form different indexes, without having to cycle through all content. This is a HUGE improvement in usability and performance for large databases!
Comment #49
Tomáš Fejfar commentedI've tested it and it works. We use it to index only one content_type. It's useable as-is with patch from #37. Please merge this.
Comment #50
drunken monkeyNo need to use strong language! ;)
But yes, I guess we can be reasonably sure at this point that the patch is working as intended and won't break anyone's site. I'll just have to accept the remaining risk, if I don't want the patch to lie around indefinitely.
So, committed.
Thanks everyone for your support, testing and feedback!
Comment #52
Tomáš Fejfar commentedGreat, thanks! :)
Although I have bad news. There IS an issue with this I noticed today after using it for almost a month. Steps to reproduce:
It looks like immediate indexing "skips" the filtering part.
Second problem is that there is no way to change the "filter". It does not show on the index edit screen. I am not sure how common such thing is, but I only needed it now when I wanted to check why other content types are indexed.
I was not sure if I should open new issue for this or continue here, but it made sense to keep stuff together here.
Comment #53
kopeboyI found different node types indexed as well with big surprise.
I am using index immediately as well, so I guess I would uncheck that and index on cron? Otherwise I would have to add a field to the index (node type = string) and add the filter to all relevant Views :(
Comment #54
drunken monkeyOh. Darn.
Thanks a lot for reporting! I can see this issue, now, too – good thing it was noticed before I created a new release, even though only after the commit.
The bad thing is that I can't immediately see how to fix this problem. I'll take a closer look next week and will try to come up with a fix.
As a workaround, in any case, you can additionally enable the "Bundle filter" data alteration to also filter out those items there.
That's by design. Being able to edit the selected bundles would have been even more complicated. You'll probably have to wait for D8 to fix this. Although, in theory, I guess we can also always add this functionality later, if enough people need it and there is a good way to do it.
Comment #55
Exploratus commentedI don't think I am seeing this problem, but I am NOT using index immediately.
Comment #56
drunken monkeyThe problem only appears when using "Index items immediately", so that's to be expected.
Comment #57
drunken monkeyOK, after a bit of analysis, it seems the attached patch should be the cleanest way to deal with this problem. It couldn't quite be fixed without another API change, but this one is at least only a minor API addition, which shouldn't make troubles for any existing code (I hope).
Please test and verify it solves the problem for you!
(Also, if you are a developer and a bit familiar with the Search API, I'd be glad about code reviews – and/or other suggestions for how to solve this. The problem is simply that
search_api_track_item_insert()always indexes all the items it gets for all indexes with "Index items immediately", and there's no way for the datasource to intervene.)Caveat: The API addition only fixes this for entities, since they are (internally) always inserted separately, one at a time. The problem will remain for any other datasources in contrib, which use datasource configuration forms to restrict the set of items and which sometimes call
search_api_track_item_insert()for several items at once. But I guess that is unlikely enough to ever happen to make it an acceptable shortfall.Comment #58
drunken monkeyComment #59
drunken monkeyCould someone please test and verify this fixes the problem?
Comment #60
kopeboyThanks for your fast fix!
Unfortunately I am just a sitebuilder and don't have time now (this and next week) to test this, I invite others more experienced than me in the meantime, please! :)
Comment #61
drunken monkeySome testers needed, please!
I want to create a new release, and if I can't get a review for this soon I'll be forced to revert the original patch until we have a working version.
Comment #63
drunken monkeyWell, that's a bit of a bummer. Reverted the previous commit – attached is a patch combining it with the fix for the issue with "Index items immediately". Please test so we can get this into the next release!
Comment #64
Exploratus commentedI tested with Index immediately, it seems to work. I don't really use it that way, just did it for testing, so my testing was a bit narrow.
Comment #65
heddnNot have a read-only display of what bundles are index is a little hard though. I just created the index and I can't remember if I selected news articles or pages! And there's no way to tell without deleting/re-adding.
Comment #66
heddnFunctionality wise, indexing by bundle seems to do what is says. I only have content indexed from my selected bundle(s). I'm testing this using the db backend.
Comment #67
joelpittetSorry for my ignorance, I may have not understood the expected change of this patch. I thought it was regarding the bundle filter, but there is already that, then I thought it was an improvement on that.
I am testing with a Solr backend.
Before and after this patch: I don't see a difference on the view index page? I cleared cache, cleared the index and re-indexed all my nodes 500 at a time.

This is my bundle filter setup:

Just a shot in the dark but is this just complicating things or maybe this just needs an IS update to clarify Steps to test this patch?
Comment #68
joelpittetOh you can't edit the index after it's created. That's why I wasn't seeing anything. Sorry for the confusion. I guess it would be nice to change the bundles if the base type is the same. Though I'm still trying to figure out if I should be using a filtered index of the filter ON the index for the bundle type and the advantages/disadvantages of either.
Comment #69
joelpittetAlso want to echo @heddn's note
Comment #70
drunken monkeyOh, you're right, that is a rather bad situation. The attached patch should fix this, adding the information both when viewing and editing an index.
Please test/review again so we can finally commit this!
Comment #71
DeFr commented@drunken monkey: No attachment in comment #70 ?
Comment #72
heddnEchoing @DeFr's comment. Waiting for a patch to test...
Comment #73
drunken monkeySorry, sorry! This happens way too often to me …
Comment #74
heddnI'm not goig to mark as RTBC, since I'd really like more than one set of eyes. But the functionality is a lot better now that I can tell what bundles are included in the index.
Comment #75
heddnOK, I spoke a little too soon. When changing my search backend from solr to db or back again, this was the error results:
Comment #76
jjozwik commentedI have several hundred thousands nodes site so for a work around when I need to reindex by node type I run the following command. Example for node type storefront.
drush sql-query "UPDATE search_api_item a, node b SET a.changed=b.changed where a.item_id=b.nid AND b.type='storefront'"drush search_api_indexComment #77
Exploratus commentedI had the same problem as @heddn.
Comment #78
drunken monkeyOops, seems like a silly copy/paste error. Thanks for catching that!
Revised patch attached.
Thanks for testing, everyone!
Comment #79
heddnRTBC for my use case. Tested switching from db to solr backend and that is no longer a problem. It shows the effected bundles. Looking good.
Comment #80
Exploratus commentedLooks good to me as well.
Comment #81
heddnTwo is enough for me.
Comment #82
drunken monkeyOK, great to hear, thanks for testing!
Finally committed this – let's hope this time it works as intended.
Comment #85
drunken monkeyRe-opening here shortly to point people to #2520684: Bundle-specific indexes with "Index items immediately" will index other bundles. It seems the latest version was still buggy after all. Please help me test there so we can fix this as soon as possible!
(Please do not comment here, I'll re-close this issue in a few days.)
Comment #86
drunken monkey