The initial goal of this module is to make multilingual content managed by the entity_translation.module somehow searchable via the search_api.module. This basically works. But in a very basic and crude way: The module just offers a search field containing renderings of all translations of an entity concatenated in one fulltext field. Works. But is very very very crude with a number of serious drawbacks (no possibility to search only in a specific language, therefore potential confusing mismatches in search results, fully rendered entity search only (no field-specific et-multilingual search), and probably other quirks and drawbacks.)

The short term goal of this module is to provide a better basic way of supporting et-multilingual entities in Search API, by having it at least providing some possibility to distinguish the different translations for search, and therefore make it language-aware: The user should be able to search through and find content in only exactly those languages he wishes to search in.

For this to work, the rather stupid concatenation of all translations has to be refined.

But how?

There are multiple ways to do this - and you're welcome to add good ways to this list!

Currently, i see two realistic ways to support language aware search for field translation / entity translation content via search_api:
(based on the initial discussion at http://drupal.org/node/1323168#comment-5340212 - i already skipped the ways that have been ruled out there)

A) Somehow declaring and using an entity/search property that can carry multiple values, one for each translation's full entity rendering, somewhere in the workings of:
langvalue[0] = 'search_api_language-en my english content'
langvalue[1] = 'search_api_language-de mein deutscher inhalt'
langvalue[2] = 'search_api_language-es mi contenido español'
Not shure if that would work.
This would mean medium work all contained in a contrib module extending search api, very little config work if done correctly, clean and scaling with changes in a site's language settings. But is that possible (see threaded comment for details)?
=> Possible way?

B) Adding additional language-aware item types for each entity type, by customizing the Search API data source controller (based on SearchApiEntityDataSourceController) for the new item types a bit. This would allow for the data source to "know" of the translations and have their search item IDs to carry the language code, for example.
=> Possible Way?

Let us discuss those possible ways towards a language-aware support of Entity Translations in Search API.
It would help a lot to know from you ...

* Do you need this language-awareness for this combination of modules at all?
* What's your preference? Why?
* Do you see other ways of achieving this goal?
* Could you help with the implementation?

Files: 
CommentFileSizeAuthor
#18 mysearch_module.txt1.75 KBThomasH

Comments

Issue summary:View changes

corrected tag typo

Hi DanielNolde,

sounds good. but what is on solr based searches if we want different stemming filter on various languages ?

de =>

i think a solution can be to split out the values into seperate fields ?! is this possible ?!

regards
Dennis

Thanks Daniel, I appreciate a lot your efforts! I'll try to chime in ASAP. At the moment I'm pretty busy with: #1282018: Improve UX of language-aware entity forms which is going to revamp the entity translation UI.

Hey Dennis, as i understand with my little Solr "knowledge", one solr index can only have one configuration set (including stemming filters etc.), so one would have to create one search index with individual solr config for each language ... not very scalable. But you should involve an outspoken Apache Solr expert on this, since i'm not sure at all...

I'm personally seeking the possibility of language dependent solr setting, too. But having one individual index for each site language is not practical, not flexible and not scalable - thus, i think we should avoid this, regarding Search API Entity Translation.

Does anyone have an idea how language-specific solr config for multi languages in parallel could be done within one solr index?

Okay, this got quite solr specific, which is not our primary goal here, but good question anway :)

Hi Daniel,

you are right. Multiple Indexes are no option. But indepent upon solr or not why we cant split out every language into a own "pseudo" field ? title becomes in search index title_ like title_de or title_en. Sure we need to respect this pseudo field also on the query side but i think this will be the best solution.

@see: http://drupal.org/node/1210810#comment-5402130

SolR Specific:
If we use seperate fields we becomes the ability to use on every field a own TokenFilter. Short example:

Fieldtype Definition on schema.xml

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">...</fieldType>
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">...</fieldType>
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">...</fieldType>

Field Definition on schema.xml

<field name="text_en" type="text_en" indexed="true" stored="false" multiValued="true"/>...</field>
<field name="text_de" type="text_de" indexed="true" stored="false" multiValued="true"/>...</field>
<field name="text_es" type="text_es" indexed="true" stored="false" multiValued="true"/>...</field>

The Magic DynamicField part

<!-- Fulltext fields (are always multiValued) -->
<dynamicField name="t_en_*" type="text_en" termVectors="true"/>
<dynamicField name="t_de_*" type="text_de" termVectors="true"/>
<dynamicField name="t_es_*" type="text_es" termVectors="true"/>

I think this will be se best solution for indexing but brings new problems on query side we need always add the language code to the field names. But becomes the power of multiple TokenizeFilter

A very good example for the solr part is the apachesolr_multilanguage module.

regards
Dennis

Interesting third alternative, Dennis.

Sounds a little quirk-ish on the Drupal side, meaning:
* the site builder has to add a field for every language :-(
* if the language settings change, the search index setting have to be changed, too :-(
* if the language settings change, the search index would have to be rebuilt :-(
* the search has to be altered to contain the langcode in the search field name
* practically works only with a fulltext field (doing this with more than one field in insane)

But the outlook on the search server side of having different configs possible for each field (at least solr) language may be worth a consideration.
Can you point us to some solr documentation of exactly what can be configured field-dependent, and how?

Hi Daniel,

not exact. On Drupal field side we have the normal field handling (field translation)

> field_name[LANGUAGE_CODE]

we need only on search index side to split them into seperate fields:

> field_name_LANGUAGE_CODE

There is for the solr no explicit language specific settings. That is our problem. There are multiple concepts to realise a multilanguage searching

> multiple indexes (every index has it own solr schema.xml) < Hard to configure / Hard to maintain the documents
> multiple fields < the best solution currently because of the power of dynamicFields like described before in #4

i have no idea what we can do to make this better as it is. but currently the multilanguage search on solr is a big problem if you use language depending TokenFilter

Ah, okay, so the multiplication of every field for every active language would occur only in the layers of Search API and Apache Solr, where they are realized as "dynamic fields" (solr-lingo dfor exactly what?).

So i guess this suggestion is rather a sub-"branch" of way B) implementing a custom entity based data source for translation/language aware fields.

Sounds interesting.

For introduction of the module and more background also see the blog post on our website: http://wunderkraut.com/en/blog/make-search-api-work-entity-translation

There is quite some work to do in the schema before supporting all these languagues.
Some options :

1. You can make some kind of generator that will make the schema.xml for you depending on your settings and languages. You surely don't want all fields to be multilingual.
With a generator you can even supply the user with a bunch of extra files making i18n support easier. (stopwords_LANGUAGE.txt, protwords_LANGUAGE.txt, ...)

2. Also interesting to add to this discussion is : http://wiki.apache.org/solr/LanguageDetection. If I understand this correctly you will hardly have to adjust your schema, the field will detect the language used. Interesting isn't it?

3. If you make a dynamic field content_*, and in drupal you actually fill in this field with content_en, content_fr, content_es etc..

Schema example of specifying a language for a fieldType

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
...
        <filter class="solr.SnowballPorterFilterFactory" language="<strong>English</strong>" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
...
      </analyzer>
    </fieldType>

To proceed, we need the Feedback of Search API maintainer Thoma Seidl, so we can decide on how to build a full blown support for language aware search on multilingual fields based on Entity Translation. Thomas, it would be great if you could have a look on this thread and provide some hints as to which of the ways discussed would be the best in terms search_api's internal architecture – Thanks!

Thomas, it would be great if you could have a look

According to http://wunderkraut.com/en/blog/make-search-api-work-entity-translation , as of May 1, Thomas Siedl *has* contributed thoughts:

suggested by Search API maintainer Thomas Seidl: A data source plug-in that exposes different field-translated language version of an entity in some distinguishable form to Search API (e.g. as dedicated search items, or as language dependent dynamic fields).

I don't know enough about the Search API to have much idea what that implementation might look like, but, here's to progress! This module is critical to the success of Entity Translation. Three cheers!

I have built a custom version for a project I've been working on, that exposes each field as separate additional fields with a language-specific prefix. However, this only works for us because we have a) a manually built search form where we can automatically switch to the language-specific version of each field, b) do not need a language-aware fulltext search and c) only have two languages. I don't want to imagine the already huge field list when you have 5+ languages ;)

To be honest, i alone feel not educated enough about SearchAPI itself to work this out and present a solution. We should form a "task force" working on this together.

Who can contribute substantially knowledge-wise (searchapi, solr) and time-wise?

I'll also try to get Thomas Seidl working an this (he's been away from active Search API development for some time for health reasons, but this may interest him to contribute to, too).

How about the option of using a SOLR core per language? What would decide against this?

@ThomasH,

That is a very messy solution IMHO. In addition to having a lot of overhead to manage the connections to multiple indexes, it makes it much more difficult to do something like "search French AND language neutral content" since you would have to do a federated search across multiple indexes. You would also have to implement the logic to select the appropriate index(es) based on the appropriate language. In addition, let's say that only the body field is translated into 40 different languages. If the node references a taxonomy term, you would have to re-index that node 40 times across the various indexes. To me this introduces a lot of overhead that would make this solution unattractive.

Thanks,
Chris

Besides:

That sounds like "Solr Entity Translation", not "Search API Entity Translation". The goal here is to extend Search API.

Ok, next question... How are we going to go ahead and "add" these extra fields if we cannot hook into "search_api_extract_fields"? Or am I missing something?

StatusFileSize
new1.75 KB

Here is how it was currently implemented. It moves from having one field with all data in it, to seperate fields. Which are defined in mysearch_module_translatables(). I am not overly excited about the solution, and it seems a bit like a mess, so input would be greatly appreciated.

I did some solr integration both D6 and D7 here is my feedbacks

To do correct indexing per language you must use a different index, to allow stemming to work as it should, i use to set up one index per core and index language neutral in all indexes to avoid multi index aggregation, pretty simple to do in D7, pain in the a** to do in D6

AFIK entity_translation does not works well with search_api, solr seems to not see those translated node. The solution is quite tempting and elegant, but while it does not make it to search_api, not viable

For another way to handle i18n search, see http://drupal.org/sandbox/gaelg/1826028.

What do you think about this alternative way? Should I publish a separate module, and then which name? Or should it be included somehow in this module?

I just stumbled upon this issue, as I'm searching a similiar approach for dealing with taxonomy-dependent data in search api.

In my opinion splitting content in different language specific documents would be a good and clean way. Especially for sites with a lot of languages this might be more feasible.
When it comes to different Solr configurations and so different indexes the Search API multi-index searches module might be a solution. But I haven't tried it out so far, so it's only a clue ;)

As written in #1323168-7: Add support for translated fields, I think writing a new datasource controller that provides multiple items for each single entity, one for each language, would be the best option here. You could then filter on the language, like it's already possible if you use Content Translation, or index only items with a certain language. In effect, everything would be like you'd have used Content Translation instead of Entity Translation, at least Search API-wise.
It wouldn't be particularly easy to implement, I fear, but it should be well possible with the Search API in its current form and bring considerable benefits.

However, it of course also has some drawbacks. For example, you wouldn't be able to filter on the values of two languages at once. I don't know if that's a reasonably common use case, though.

Also, it would have to be decided where to include LANGUAGE_NONE data – in a separate item or in all of the language-specific items?

As for Solr, that's an entirely different issue, and possibly an even harder one. Let me brainstorm here for a second (as I haven't found a Solr-specific issue?):

  • Using schema_extra_fields.xml it would be easy to add additional, language-aware dynamic fields to the schema, like tm-en_*. You could then set these up with different types with language-specific processing. (You'd probably need to do that for text and string fields.) For a contrib module, you could write a generator which automatically produces a suitable schema_extra_fields.xml file for a specific site.
  • When indexing with the method proposed above, you could just then use the language set for the item to decide which set of dynamic fields to use for indexing.
  • For queries, you could look whether there is a filter on the language. If so, just use the same language for determining the field prefixes and things should work. If there isn't a language filter, though, things get tricky. The whole query-building code would probably have to be re-implemented for that case to consider all the possible Solr index fields a certain Search API field could be stored in. Or maybe we could use additional dynamic fields for that, into which we'd pour the content of all the language-specific fields*? Could work, I guess.

* E.g., there'd be tm-en_title, tm-de_title and tm_title, and copyField directives to copy all content from the former two into the last. When filtering for the title, it would use the first tm-en_title if the query also filters for language = en, tm-de_title for German-only queries and the last one if no language filter is present.

Solr's language detection feature that Nick linked in #9 sounds very interesting, too. Haven't worked with it yet, though, so I can't really say how well it works in practice. But in case it does a good job, I guess it would solve at least some of the problems. (Filtering on only field values of a specific language wouldn't be possible, though, as far as I can see.)

Sorry this is coming so late, it's hard to keep track of all the countless issues in my queue. I hope it helps getting this on track again, though. Language-awareness is still one of the major pain points of the Search API.

I think writing a new datasource controller that provides multiple items for each single entity, one for each language, would be the best option here

This sounds familiar to me ;) , maybe we could base such a datasource controller on the one of Search API Denormalized Entity Index

Status:Active» Needs review

I've created a little sandbox with a prototype of how I'd think entity translation support for the Search API could work. The module provides multilingual versions of all entity types that support translation, allows per-index configuration and will provide separate versions for all available languages of each entity.
The module isn't done yet, entity CRUD hook implementations, the admin UI for the index settings and some magic to have fields be indexed and displayed in the right language (probably with hook_search_api_index_items_alter(), since we can't do it in an index-specific way in the datasource controller) are still missing. However, before proceeding I wanted to let people with more knowledge about D7 translation/language functionality (and maybe Field API) vet my approach to see if it makes sense.

Especially interesting is the datasource controller's getAllIndexItemIds() method (and, accordingly, the search_api_et_item_languages() function), which determines which language versions will be created for each entity.
Even more interesting would be the getMetadataWrapper() method, but that's highly Entity API-specific which probably means other won't fare any better than me trying to make sense of it. However, it seems (at least for nodes) setting $node->language to a certain language will make the wrapper return all fields in that language. God (and maybe fago) knows what will work for other types …

(The current functionality of this module is missing from my sandbox for now, but could easily be changed into a data alteration (which would be the proper way to add that property) and then re-added.)

Reviews welcome!

PS: Does anyone have any test data for a multilingual site with entity translations?
I, for one, don't really know what is a typical setup, and it would also take some time since Devel Generate doesn't seem to support entity translations.

Hi,

We need a solution for search api to search in translations on our current project. Your sandbox looks ok, I will try with my coworkers to work on it next week.

I've cloned the sandbox project locally. We are lucky we have translated content for more than 10 languages and we can test the module on real data. I looked for you on #drupal-contribute to discuss about the sandbox but didn't find you there. If we'll have a working version how shall we push to your sandbox? Patch file? I would propose another branch perhaps to keep the history of the commits if you agree.

I've talked with this project's maintainer, Daniel Nolde, and he'll soon take a look at my code. If he approves, I'll finish a first version and would then suspect that work would get committed into this project. If you want to help, or later propose additions, you can probably just provide patches.
Only if you want to do that before this gets committed into the main project, then please warn me before you start to work on a certain area. Otherwise, we might both work on implementing the same thing.

Thomas,

As skipyT has mentioned, we're looking for functionality provided by this module (meaning your sandbox v2) to be used in our current project. I was wondering what is its current state, how far have you managed to progress compared to what is currently available in your sandbox, and how far is it from that first version you have mentioned?

I'm asking as we'd like to help and contribute, but assuming that you're not far away from that first version, or currently working on it, it might be better for us to wait for it being pushed, and then start from there?

Thoughts?

I'm still waiting for Daniel, my current work state can be seen in the repository. I haven't worked on it since.

Thanks Thomas.

Meanwhile I have done a quick variation of the original Search API Entity Translation module - providing new Search API fields for each translatable field on each translatable entity - which seems to work fine with solr's dynamic fields and language-based field types - available in my sandbox if anyone's interested.

Now let's have a look at that v2...

Hello,

While Maciej's variation of the module still does not solve the actual problem(because results will be returned if you search on any of the enabled languages), having the dynamic fields on a language-base helps defining tokenizer's for the asian languages, which in many cases can be very useful.

Ok, as promised I've started playing with drunken monkey's sandboxed Search API Entity Translation v2, posted few new issues with patches providing missing functionality, which now wait for merge/further discussion.

All those changes are also available in my forked sandbox Search API Entity Translation v2b, which already seems to work fine and return expected results.

Thanx guys. It really sounds that you're nearly there! I can't wait till the progress is merged back to the actual project here.

edit/note-to-self: ...coming from #1335394: Search API integration

A quick update on this - last week I created my own sandbox (fork of drunken monkey's one) where I have been pushing all the latest changes/new features recently - Search API Entity Translation v2b - anyone feels like giving it a test ride?

whoa, a lot of forked versions to inspect <:} - sorry for my late awakening in this issue, guys.
One word ahead: Ideally and with D8 in mind, this very Search API ET module will become obsolete, because Search API itself will become field-based translation ready somewhere along the road.

Thanks Daniel, you're obviously right, although "somewhere along the road" doesn't sounds like it was to happen very soon, so until then let's make it all somehow work nicely with D7 too.

Now, in terms of all those forked versions - my sandbox contains all the code from drunken monkey's sandbox, plus a lot of new things, which essentially make it work properly (at least so it would seem so far) - so if you're thinking about inspecting anything, I'd suggest you inspect Search API Entity Translation v2b.

Also, the work there is still on-going, right now I'm working on add-on search_api_et_solr module to be able to store language-specific content in solr's dynamic fields, to give users options to use different tokenizer/stemmer/etc configs for different languages.

Ultimately I've put all solr-related functionality into a completely separate module, as it shouldn't really be packaged together with main search_api_et by default.

For the moment it is available in my sandbox - Search API Entity Translation Solr search - but to make it work it requires a patch applied to Search API Solr search module (to avoid having to extend the SearchApiSolrService class and replicating both indexItems() and extractResults() methods just to add 2 small pieces of code).

@drunken monkey, any chance of reviewing/merging that patch any time soon please?

Okay, then i'll check your version too or even first, Maciej.
search_api_et_solr sound promising and interesting, too.

I'm off now to the first part of my vacation - end of next week i'll try to inspect and comment your work!

Well, exactly like me, a week away starting later today - so any comments would need to wait till I'm back anyway. :)

Ah, OK, I just returned from my vacation. Great to see such large progress here!
It seems you don't really need to review my sandbox anymore, if maciej.zgadzaj has worked off of that and implemented the missing pieces.
Which I'm very glad of, thanks again! The fewer projects I'm involved in, the better. ;)

Regarding your Solr module patch, I'll hopefully look at it soon. As said, I just returned from my vacation, so my inboxes and issue queues are of course all filled to the top.

Can't v2b to work

Can't v2b to work (Bug)

On an existing D7 project, when adding a search_api_et_v2b index on "multilingual node" using either an existing or newly created sapi solr server, i get the following error:

    Fatal error: Class name must be a valid object or a string in /Volumes/daten2/projekte/hbmintranet/code/www/includes/common.inc on line 7779
 

(Then $entity_type argument in the entity_get_controller() throwing the error is "search_api_index ").

(freshly installed search_api_et(v2b), search_api_et_solr and the search_api_solr patch needed).

Of course i will try a clean drupal install later, but i wonder whether anyone got this error too, or know what it is.

For further discussion see seperated v2b issue https://drupal.org/node/2070135

Some general thoughts until i got v2b running on my system later this week, time running out today).

I'm amazed by how many approaches have been tested and implemented since we started the discussion (per-language-fields, per-language-index, and v2(b) with dynamic/prefixed fields).

I think from a practical architectural point of view, v2b might be the way to go _now_ and i think - if its working fine - we should first mature it and then focus on including it in search_api itself (if that's ok for drunken monkey).

For something like search_api-8.x, in an ideal world, though, the language should probably be incorporated in the document id, so that all these workarounds become obsolete. Just my 2cents and whishful thinking.

So, back to v2b, here's some general feedback (without having been able to run the thing successfully on an existing project, see above):

* good: allows for different solr index filter configuration per language (such as stemming etc)
* Can it work search server independent (independent of solr)?
* Regarding required modules/patches:
** Search API Entity Translation Solr (https://drupal.org/sandbox/maciej.zgadzaj/2060211)
*** should be merged into search_api_solr (reducing the gazzilions of search_api_* modules that have to be enabled a.t.m.)
*** required patch to search_api_solr should also be merged in there
*** "add all possible multilingual field variants for all translatable fields being searched" => really good, or shouldn't it only add the currently searched language's field names?
* Let's merge v2b into search_api once its stable
** Merge {search_api_et_item} into {search_api_item} if possible

Can it work search server independent (independent of solr)?

It's a generic implementation, so I don't really see any reason why it shouldn't - although I admit I haven't tested it with anything else except solr.

Search API Entity Translation Solr (...) should be merged into search_api_solr

Agreed, once it is confirmed it works fine and includes base set of required features. And yes, the same for merging search_api_et into search_api. If that's ok with Thomas obviously.

required patch to search_api_solr should also be merged in there (search_api_solr)

Will be, most probably very soon.

"add all possible multilingual field variants for all translatable fields being searched" => really good, or shouldn't it only add the currently searched language's field names?

That's exactly what it does if query is defined to run on selected languages only. I've updated project page to mention it there too.

I'm not sure about including these modules into Search API and Search API Solr. They add a considerable amount of complexity, while probably not being useful to the majority of users. Maybe I will include it in Drupal 8, but I don't think I will in Drupal 7. (I'd rather be in favor of including the Solr-specific ET module right here, so at least that additional module disappears. It's rather small anyways, but its (apparent) dependency on this module makes it a poor fit for inclusion into Search API Solr.)

But first the module would have to work smoothly, anyways, so we can also discuss this later. I'm pretty swamped right now, so I probably won't have much time to contribute here until at least post-Prague.

The bug mentioned by Daniel in his comment #42 and in issue #2070135: Error when adding an multilingual index turned out to be caused by too old version of Search API module being used. The Search API Entity Translation v2b module requires Search API in version 1.6 or higher - this has now been added to dependencies in its .info file.

sorry for the delays, guys.
To speed things up, I added Maciej as co-maintainer to search_api_et, and kindly ask him to commit his search_api_et_v2b as search_api_et-7.x-2.x-dev.

Thanks Daniel. I have just added 7.x-2.x branch to your repo and pushed to d.o. Project page still to be updated to reflect this, but it seems we're almost there!

Status:Needs review» Fixed

I have finally managed to update the project page. Should we mark this as fixed then?

Issue summary:View changes

skipped impossible ways

Status:Fixed» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.