The initial goal of this module is to make multilingual content managed by the entity_translation.module somehow searchable via the search_api.module. This basically works. But in a very basic and crude way: The module just offers a search field containing renderings of all translations of an entity concatenated in one fulltext field. Works. But is very very very crude with a number of serious drawbacks (no possibility to search only in a specific language, therefore potential confusing mismatches in search results, fully rendered entity search only (no field-specific et-multilingual search), and probably other quirks and drawbacks.)
The short term goal of this module is to provide a better basic way of supporting et-multilingual entities in Search API, by having it at least providing some possibility to distinguish the different translations for search, and therefore make it language-aware: The user should be able to search through and find content in only exactly those languages he wishes to search in.
For this to work, the rather stupid concatenation of all translations has to be refined.
But how?
There are multiple ways to do this - and you're welcome to add good ways to this list!
Currently, i see two realistic ways to support language aware search for field translation / entity translation content via search_api:
(based on the initial discussion at http://drupal.org/node/1323168#comment-5340212 - i already skipped the ways that have been ruled out there)
A) Somehow declaring and using an entity/search property that can carry multiple values, one for each translation's full entity rendering, somewhere in the workings of:
langvalue[0] = 'search_api_language-en my english content'
langvalue[1] = 'search_api_language-de mein deutscher inhalt'
langvalue[2] = 'search_api_language-es mi contenido español'
Not shure if that would work.
This would mean medium work all contained in a contrib module extending search api, very little config work if done correctly, clean and scaling with changes in a site's language settings. But is that possible (see threaded comment for details)?
=> Possible way?
B) Adding additional language-aware item types for each entity type, by customizing the Search API data source controller (based on SearchApiEntityDataSourceController) for the new item types a bit. This would allow for the data source to "know" of the translations and have their search item IDs to carry the language code, for example.
=> Possible Way?
Let us discuss those possible ways towards a language-aware support of Entity Translations in Search API.
It would help a lot to know from you ...
* Do you need this language-awareness for this combination of modules at all?
* What's your preference? Why?
* Do you see other ways of achieving this goal?
* Could you help with the implementation?
Comment | File | Size | Author |
---|---|---|---|
#18 | mysearch_module.txt | 1.75 KB | ThomasH |
Comments
Comment #0.0
danielnolde CreditAttribution: danielnolde commentedcorrected tag typo
Comment #1
Blackice2999 CreditAttribution: Blackice2999 commentedHi DanielNolde,
sounds good. but what is on solr based searches if we want different stemming filter on various languages ?
de =>
i think a solution can be to split out the values into seperate fields ?! is this possible ?!
regards
Dennis
Comment #2
plachThanks Daniel, I appreciate a lot your efforts! I'll try to chime in ASAP. At the moment I'm pretty busy with: #1282018: Improve UX of language-aware entity forms which is going to revamp the entity translation UI.
Comment #3
danielnolde CreditAttribution: danielnolde commentedHey Dennis, as i understand with my little Solr "knowledge", one solr index can only have one configuration set (including stemming filters etc.), so one would have to create one search index with individual solr config for each language ... not very scalable. But you should involve an outspoken Apache Solr expert on this, since i'm not sure at all...
I'm personally seeking the possibility of language dependent solr setting, too. But having one individual index for each site language is not practical, not flexible and not scalable - thus, i think we should avoid this, regarding Search API Entity Translation.
Does anyone have an idea how language-specific solr config for multi languages in parallel could be done within one solr index?
Okay, this got quite solr specific, which is not our primary goal here, but good question anway :)
Comment #4
Blackice2999 CreditAttribution: Blackice2999 commentedHi Daniel,
you are right. Multiple Indexes are no option. But indepent upon solr or not why we cant split out every language into a own "pseudo" field ? title becomes in search index title_ like title_de or title_en. Sure we need to respect this pseudo field also on the query side but i think this will be the best solution.
@see: http://drupal.org/node/1210810#comment-5402130
SolR Specific:
If we use seperate fields we becomes the ability to use on every field a own TokenFilter. Short example:
Fieldtype Definition on schema.xml
Field Definition on schema.xml
The Magic DynamicField part
I think this will be se best solution for indexing but brings new problems on query side we need always add the language code to the field names. But becomes the power of multiple TokenizeFilter
A very good example for the solr part is the apachesolr_multilanguage module.
regards
Dennis
Comment #5
danielnolde CreditAttribution: danielnolde commentedInteresting third alternative, Dennis.
Sounds a little quirk-ish on the Drupal side, meaning:
* the site builder has to add a field for every language :-(
* if the language settings change, the search index setting have to be changed, too :-(
* if the language settings change, the search index would have to be rebuilt :-(
* the search has to be altered to contain the langcode in the search field name
* practically works only with a fulltext field (doing this with more than one field in insane)
But the outlook on the search server side of having different configs possible for each field (at least solr) language may be worth a consideration.
Can you point us to some solr documentation of exactly what can be configured field-dependent, and how?
Comment #6
Blackice2999 CreditAttribution: Blackice2999 commentedHi Daniel,
not exact. On Drupal field side we have the normal field handling (field translation)
we need only on search index side to split them into seperate fields:
There is for the solr no explicit language specific settings. That is our problem. There are multiple concepts to realise a multilanguage searching
> multiple indexes (every index has it own solr schema.xml) < Hard to configure / Hard to maintain the documents
> multiple fields < the best solution currently because of the power of dynamicFields like described before in #4
i have no idea what we can do to make this better as it is. but currently the multilanguage search on solr is a big problem if you use language depending TokenFilter
Comment #7
danielnolde CreditAttribution: danielnolde commentedAh, okay, so the multiplication of every field for every active language would occur only in the layers of Search API and Apache Solr, where they are realized as "dynamic fields" (solr-lingo dfor exactly what?).
So i guess this suggestion is rather a sub-"branch" of way B) implementing a custom entity based data source for translation/language aware fields.
Sounds interesting.
Comment #8
danielnolde CreditAttribution: danielnolde commentedFor introduction of the module and more background also see the blog post on our website: http://wunderkraut.com/en/blog/make-search-api-work-entity-translation
Comment #9
Nick_vhThere is quite some work to do in the schema before supporting all these languagues.
Some options :
1. You can make some kind of generator that will make the schema.xml for you depending on your settings and languages. You surely don't want all fields to be multilingual.
With a generator you can even supply the user with a bunch of extra files making i18n support easier. (stopwords_LANGUAGE.txt, protwords_LANGUAGE.txt, ...)
2. Also interesting to add to this discussion is : http://wiki.apache.org/solr/LanguageDetection. If I understand this correctly you will hardly have to adjust your schema, the field will detect the language used. Interesting isn't it?
3. If you make a dynamic field content_*, and in drupal you actually fill in this field with content_en, content_fr, content_es etc..
Schema example of specifying a language for a fieldType
Comment #10
danielnolde CreditAttribution: danielnolde commentedTo proceed, we need the Feedback of Search API maintainer Thoma Seidl, so we can decide on how to build a full blown support for language aware search on multilingual fields based on Entity Translation. Thomas, it would be great if you could have a look on this thread and provide some hints as to which of the ways discussed would be the best in terms search_api's internal architecture – Thanks!
Comment #11
beanluc CreditAttribution: beanluc commentedAccording to http://wunderkraut.com/en/blog/make-search-api-work-entity-translation , as of May 1, Thomas Siedl *has* contributed thoughts:
I don't know enough about the Search API to have much idea what that implementation might look like, but, here's to progress! This module is critical to the success of Entity Translation. Three cheers!
Comment #12
BerdirI have built a custom version for a project I've been working on, that exposes each field as separate additional fields with a language-specific prefix. However, this only works for us because we have a) a manually built search form where we can automatically switch to the language-specific version of each field, b) do not need a language-aware fulltext search and c) only have two languages. I don't want to imagine the already huge field list when you have 5+ languages ;)
Comment #13
danielnolde CreditAttribution: danielnolde commentedTo be honest, i alone feel not educated enough about SearchAPI itself to work this out and present a solution. We should form a "task force" working on this together.
Who can contribute substantially knowledge-wise (searchapi, solr) and time-wise?
I'll also try to get Thomas Seidl working an this (he's been away from active Search API development for some time for health reasons, but this may interest him to contribute to, too).
Comment #14
ThomasH CreditAttribution: ThomasH commentedHow about the option of using a SOLR core per language? What would decide against this?
Comment #15
cpliakas CreditAttribution: cpliakas commented@ThomasH,
That is a very messy solution IMHO. In addition to having a lot of overhead to manage the connections to multiple indexes, it makes it much more difficult to do something like "search French AND language neutral content" since you would have to do a federated search across multiple indexes. You would also have to implement the logic to select the appropriate index(es) based on the appropriate language. In addition, let's say that only the body field is translated into 40 different languages. If the node references a taxonomy term, you would have to re-index that node 40 times across the various indexes. To me this introduces a lot of overhead that would make this solution unattractive.
Thanks,
Chris
Comment #16
beanluc CreditAttribution: beanluc commentedBesides:
That sounds like "Solr Entity Translation", not "Search API Entity Translation". The goal here is to extend Search API.
Comment #17
ThomasH CreditAttribution: ThomasH commentedOk, next question... How are we going to go ahead and "add" these extra fields if we cannot hook into "search_api_extract_fields"? Or am I missing something?
Comment #18
ThomasH CreditAttribution: ThomasH commentedHere is how it was currently implemented. It moves from having one field with all data in it, to seperate fields. Which are defined in mysearch_module_translatables(). I am not overly excited about the solution, and it seems a bit like a mess, so input would be greatly appreciated.
Comment #19
Sylvain_G CreditAttribution: Sylvain_G commentedI did some solr integration both D6 and D7 here is my feedbacks
To do correct indexing per language you must use a different index, to allow stemming to work as it should, i use to set up one index per core and index language neutral in all indexes to avoid multi index aggregation, pretty simple to do in D7, pain in the a** to do in D6
AFIK entity_translation does not works well with search_api, solr seems to not see those translated node. The solution is quite tempting and elegant, but while it does not make it to search_api, not viable
Comment #20
GaëlGFor another way to handle i18n search, see http://drupal.org/sandbox/gaelg/1826028.
Comment #21
GaëlGWhat do you think about this alternative way? Should I publish a separate module, and then which name? Or should it be included somehow in this module?
Comment #22
derhasi CreditAttribution: derhasi commentedI just stumbled upon this issue, as I'm searching a similiar approach for dealing with taxonomy-dependent data in search api.
In my opinion splitting content in different language specific documents would be a good and clean way. Especially for sites with a lot of languages this might be more feasible.
When it comes to different Solr configurations and so different indexes the Search API multi-index searches module might be a solution. But I haven't tried it out so far, so it's only a clue ;)
Comment #23
drunken monkeyAs written in #1323168-7: Add support for translated fields, I think writing a new datasource controller that provides multiple items for each single entity, one for each language, would be the best option here. You could then filter on the language, like it's already possible if you use Content Translation, or index only items with a certain language. In effect, everything would be like you'd have used Content Translation instead of Entity Translation, at least Search API-wise.
It wouldn't be particularly easy to implement, I fear, but it should be well possible with the Search API in its current form and bring considerable benefits.
However, it of course also has some drawbacks. For example, you wouldn't be able to filter on the values of two languages at once. I don't know if that's a reasonably common use case, though.
Also, it would have to be decided where to include
LANGUAGE_NONE
data – in a separate item or in all of the language-specific items?As for Solr, that's an entirely different issue, and possibly an even harder one. Let me brainstorm here for a second (as I haven't found a Solr-specific issue?):
schema_extra_fields.xml
it would be easy to add additional, language-aware dynamic fields to the schema, liketm-en_*
. You could then set these up with different types with language-specific processing. (You'd probably need to do that for text and string fields.) For a contrib module, you could write a generator which automatically produces a suitableschema_extra_fields.xml
file for a specific site.* E.g., there'd be
tm-en_title
,tm-de_title
andtm_title
, andcopyField
directives to copy all content from the former two into the last. When filtering for the title, it would use the firsttm-en_title
if the query also filters forlanguage = en
,tm-de_title
for German-only queries and the last one if no language filter is present.Solr's language detection feature that Nick linked in #9 sounds very interesting, too. Haven't worked with it yet, though, so I can't really say how well it works in practice. But in case it does a good job, I guess it would solve at least some of the problems. (Filtering on only field values of a specific language wouldn't be possible, though, as far as I can see.)
Sorry this is coming so late, it's hard to keep track of all the countless issues in my queue. I hope it helps getting this on track again, though. Language-awareness is still one of the major pain points of the Search API.
Comment #24
das-peter CreditAttribution: das-peter commentedThis sounds familiar to me ;) , maybe we could base such a datasource controller on the one of Search API Denormalized Entity Index
Comment #25
drunken monkeyI've created a little sandbox with a prototype of how I'd think entity translation support for the Search API could work. The module provides multilingual versions of all entity types that support translation, allows per-index configuration and will provide separate versions for all available languages of each entity.
The module isn't done yet, entity CRUD hook implementations, the admin UI for the index settings and some magic to have fields be indexed and displayed in the right language (probably with
hook_search_api_index_items_alter()
, since we can't do it in an index-specific way in the datasource controller) are still missing. However, before proceeding I wanted to let people with more knowledge about D7 translation/language functionality (and maybe Field API) vet my approach to see if it makes sense.Especially interesting is the datasource controller's
getAllIndexItemIds()
method (and, accordingly, thesearch_api_et_item_languages()
function), which determines which language versions will be created for each entity.Even more interesting would be the
getMetadataWrapper()
method, but that's highly Entity API-specific which probably means other won't fare any better than me trying to make sense of it. However, it seems (at least for nodes) setting$node->language
to a certain language will make the wrapper return all fields in that language. God (and maybe fago) knows what will work for other types …(The current functionality of this module is missing from my sandbox for now, but could easily be changed into a data alteration (which would be the proper way to add that property) and then re-added.)
Reviews welcome!
PS: Does anyone have any test data for a multilingual site with entity translations?
I, for one, don't really know what is a typical setup, and it would also take some time since Devel Generate doesn't seem to support entity translations.
Comment #26
skipyT CreditAttribution: skipyT commentedHi,
We need a solution for search api to search in translations on our current project. Your sandbox looks ok, I will try with my coworkers to work on it next week.
I've cloned the sandbox project locally. We are lucky we have translated content for more than 10 languages and we can test the module on real data. I looked for you on #drupal-contribute to discuss about the sandbox but didn't find you there. If we'll have a working version how shall we push to your sandbox? Patch file? I would propose another branch perhaps to keep the history of the commits if you agree.
Comment #27
drunken monkeyI've talked with this project's maintainer, Daniel Nolde, and he'll soon take a look at my code. If he approves, I'll finish a first version and would then suspect that work would get committed into this project. If you want to help, or later propose additions, you can probably just provide patches.
Only if you want to do that before this gets committed into the main project, then please warn me before you start to work on a certain area. Otherwise, we might both work on implementing the same thing.
Comment #28
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedThomas,
As skipyT has mentioned, we're looking for functionality provided by this module (meaning your sandbox v2) to be used in our current project. I was wondering what is its current state, how far have you managed to progress compared to what is currently available in your sandbox, and how far is it from that first version you have mentioned?
I'm asking as we'd like to help and contribute, but assuming that you're not far away from that first version, or currently working on it, it might be better for us to wait for it being pushed, and then start from there?
Thoughts?
Comment #29
drunken monkeyI'm still waiting for Daniel, my current work state can be seen in the repository. I haven't worked on it since.
Comment #30
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedThanks Thomas.
Meanwhile I have done a quick variation of the original Search API Entity Translation module - providing new Search API fields for each translatable field on each translatable entity - which seems to work fine with solr's dynamic fields and language-based field types - available in my sandbox if anyone's interested.
Now let's have a look at that v2...
Comment #31
rosinegrean CreditAttribution: rosinegrean commentedHello,
While Maciej's variation of the module still does not solve the actual problem(because results will be returned if you search on any of the enabled languages), having the dynamic fields on a language-base helps defining tokenizer's for the asian languages, which in many cases can be very useful.
Comment #32
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedOk, as promised I've started playing with drunken monkey's sandboxed Search API Entity Translation v2, posted few new issues with patches providing missing functionality, which now wait for merge/further discussion.
All those changes are also available in my forked sandbox Search API Entity Translation v2b, which already seems to work fine and return expected results.
Comment #33
klonosThanx guys. It really sounds that you're nearly there! I can't wait till the progress is merged back to the actual project here.
edit/note-to-self: ...coming from #1335394: Search API integration
Comment #34
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedA quick update on this - last week I created my own sandbox (fork of drunken monkey's one) where I have been pushing all the latest changes/new features recently - Search API Entity Translation v2b - anyone feels like giving it a test ride?
Comment #35
danielnolde CreditAttribution: danielnolde commentedwhoa, a lot of forked versions to inspect <:} - sorry for my late awakening in this issue, guys.
One word ahead: Ideally and with D8 in mind, this very Search API ET module will become obsolete, because Search API itself will become field-based translation ready somewhere along the road.
Comment #36
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedThanks Daniel, you're obviously right, although "somewhere along the road" doesn't sounds like it was to happen very soon, so until then let's make it all somehow work nicely with D7 too.
Now, in terms of all those forked versions - my sandbox contains all the code from drunken monkey's sandbox, plus a lot of new things, which essentially make it work properly (at least so it would seem so far) - so if you're thinking about inspecting anything, I'd suggest you inspect Search API Entity Translation v2b.
Also, the work there is still on-going, right now I'm working on add-on
search_api_et_solr
module to be able to store language-specific content in solr's dynamic fields, to give users options to use different tokenizer/stemmer/etc configs for different languages.Comment #37
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedUltimately I've put all solr-related functionality into a completely separate module, as it shouldn't really be packaged together with main search_api_et by default.
For the moment it is available in my sandbox - Search API Entity Translation Solr search - but to make it work it requires a patch applied to Search API Solr search module (to avoid having to extend the
SearchApiSolrService
class and replicating bothindexItems()
andextractResults()
methods just to add 2 small pieces of code).@drunken monkey, any chance of reviewing/merging that patch any time soon please?
Comment #38
danielnolde CreditAttribution: danielnolde commentedOkay, then i'll check your version too or even first, Maciej.
search_api_et_solr sound promising and interesting, too.
Comment #39
danielnolde CreditAttribution: danielnolde commentedI'm off now to the first part of my vacation - end of next week i'll try to inspect and comment your work!
Comment #40
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedWell, exactly like me, a week away starting later today - so any comments would need to wait till I'm back anyway. :)
Comment #41
drunken monkeyAh, OK, I just returned from my vacation. Great to see such large progress here!
It seems you don't really need to review my sandbox anymore, if maciej.zgadzaj has worked off of that and implemented the missing pieces.
Which I'm very glad of, thanks again! The fewer projects I'm involved in, the better. ;)
Regarding your Solr module patch, I'll hopefully look at it soon. As said, I just returned from my vacation, so my inboxes and issue queues are of course all filled to the top.
Comment #42
danielnolde CreditAttribution: danielnolde commentedCan't v2b to work
Can't v2b to work (Bug)
On an existing D7 project, when adding a search_api_et_v2b index on "multilingual node" using either an existing or newly created sapi solr server, i get the following error:
(Then $entity_type argument in the entity_get_controller() throwing the error is "search_api_index ").
(freshly installed search_api_et(v2b), search_api_et_solr and the search_api_solr patch needed).
Of course i will try a clean drupal install later, but i wonder whether anyone got this error too, or know what it is.
For further discussion see seperated v2b issue https://drupal.org/node/2070135
Comment #43
danielnolde CreditAttribution: danielnolde commentedSome general thoughts until i got v2b running on my system later this week, time running out today).
I'm amazed by how many approaches have been tested and implemented since we started the discussion (per-language-fields, per-language-index, and v2(b) with dynamic/prefixed fields).
I think from a practical architectural point of view, v2b might be the way to go _now_ and i think - if its working fine - we should first mature it and then focus on including it in search_api itself (if that's ok for drunken monkey).
For something like search_api-8.x, in an ideal world, though, the language should probably be incorporated in the document id, so that all these workarounds become obsolete. Just my 2cents and whishful thinking.
So, back to v2b, here's some general feedback (without having been able to run the thing successfully on an existing project, see above):
* good: allows for different solr index filter configuration per language (such as stemming etc)
* Can it work search server independent (independent of solr)?
* Regarding required modules/patches:
** Search API Entity Translation Solr (https://drupal.org/sandbox/maciej.zgadzaj/2060211)
*** should be merged into search_api_solr (reducing the gazzilions of search_api_* modules that have to be enabled a.t.m.)
*** required patch to search_api_solr should also be merged in there
*** "add all possible multilingual field variants for all translatable fields being searched" => really good, or shouldn't it only add the currently searched language's field names?
* Let's merge v2b into search_api once its stable
** Merge {search_api_et_item} into {search_api_item} if possible
Comment #44
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedIt's a generic implementation, so I don't really see any reason why it shouldn't - although I admit I haven't tested it with anything else except solr.
Agreed, once it is confirmed it works fine and includes base set of required features. And yes, the same for merging search_api_et into search_api. If that's ok with Thomas obviously.
Will be, most probably very soon.
That's exactly what it does if query is defined to run on selected languages only. I've updated project page to mention it there too.
Comment #45
drunken monkeyI'm not sure about including these modules into Search API and Search API Solr. They add a considerable amount of complexity, while probably not being useful to the majority of users. Maybe I will include it in Drupal 8, but I don't think I will in Drupal 7. (I'd rather be in favor of including the Solr-specific ET module right here, so at least that additional module disappears. It's rather small anyways, but its (apparent) dependency on this module makes it a poor fit for inclusion into Search API Solr.)
But first the module would have to work smoothly, anyways, so we can also discuss this later. I'm pretty swamped right now, so I probably won't have much time to contribute here until at least post-Prague.
Comment #46
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedThe bug mentioned by Daniel in his comment #42 and in issue #2070135: Error when adding an multilingual index turned out to be caused by too old version of Search API module being used. The Search API Entity Translation v2b module requires Search API in version 1.6 or higher - this has now been added to dependencies in its .info file.
Comment #47
danielnolde CreditAttribution: danielnolde commentedsorry for the delays, guys.
To speed things up, I added Maciej as co-maintainer to search_api_et, and kindly ask him to commit his search_api_et_v2b as search_api_et-7.x-2.x-dev.
Comment #48
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedThanks Daniel. I have just added 7.x-2.x branch to your repo and pushed to d.o. Project page still to be updated to reflect this, but it seems we're almost there!
Comment #49
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedI have finally managed to update the project page. Should we mark this as fixed then?
Comment #49.0
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedskipped impossible ways