localize apachesolr
| Project: | Apache Solr Search Integration |
| Version: | 6.x-2.x-dev |
| Component: | Language |
| Category: | feature request |
| Priority: | normal |
| Assigned: | mkalkbrenner |
| Status: | needs work |
apachesolr is already a good search solution for English content. But it needs some work to offer the same power to other languages.
Beside a lack of integration of i18n which was already addressed at #436578: Add support for translated (localized) taxonomy facet blocks, there're some problems regarding the index. For example text fields in schema.xml use EnglishPorterFilterFactory and stopwords.txt shipped with current solr distribution contains English stop words only.
To start a discussion about how to localize apachesolr I attached a patch that adds an alternative schema.xml and a draft of a stop word list suitable for German. Therefor the patch introduces a new directory called languages. The idea is to offer adjusted configurations for different languages:
apachesolr
languages
de
schema.xml
stopwords.txt
de_DE
schema.xml
stopwords.txt
de_AT
schema.xml
stopwords.txt
de_CH
schema.xml
stopwords.txt
fr_CH
schema.xml
stopwords.txt| Attachment | Size |
|---|---|
| apachesolr_languages.patch | 22.39 KB |

#1
I don't love this patch, although I think the idea is good.
The problem here is that schema.xml has to be maintained along with the solr core AND updates to apachesolr module. This means a serious hassle in maintenance. Since solr has no concept of an include file, I recommend we do this as patch files to the base schema.xml This will make it much easier to change fields, etc without having to do it in 5-10 or 20 language versions.
best,
J
#2
I'd agree with Jacob - for the config and schema, patches would be much more appropriate. Why not use /translations ?
#3
I know that my patch isn't a perfect solution. It's just a kick off to somehow achieve a localization at the end.
A quick search for xi:include let me to this issue which should be watched:
https://issues.apache.org/jira/browse/SOLR-1167?page=com.atlassian.jira....
I see the risk in maintaining a lot of schema files in parallel. So patch sets might be an approach.
Currently I only exchanged the stemmer. But from my experience with solr in non English projects I see other candidates for modification in the schema files:
- cases sensitve stop word lists
- no lower case filter for spelling
- different settings for WordDelimiterFilter and LengthFilter
@pwolanin:
I choose languages as directory name because this patch is not about translating the user interface but changing apachesolr's behaviour. And stop word lists can't be translated but localized.
Finally more comments on this issue are welcome. Especially thoughts from people of non English countries might be interesting.
#4
Next issue in localization context:
Date facets are created using some hard coded English date formatting. I wrote a patch that introduces a new tab in apachesolr settings for localization issues. The first thing there is the possibility to customize date formats for date facets. Additionally formatting itself now uses drupal's format_date() instead of gmdate() to apply some translations.
BTW I don't now if it's better to add more patches here or to split all localization issues into separate tickets.
#5
Adjusted schema language patch according to #529606: schema.xml: update CharStream and EnglishPorter stemmer and introduced localized protwords.txt.
#6
@mkalkbrenner #4 is at risk of getting lost in this issue, please open a new one. http://drupal.org/node/463886#comment-1664164
#7
#8
subscribing
#9
We are using Apachesolr now for one year. I really like the progress that the Apachesolr project has made.
But with the time i discovered that the search works only best in English. Your post is the first one that i found, that gives a reason for the fact.
We have content in English, French, German and Italian. And the most criticized behavior of Apachesolr is the lack of a plural/singular stemming for the non-English languages.
How did you find the solution to patch Apachesolr? You said, you had some multilingual projects, did you employ this patch there? And are there any How-tos (I did not yet find one).
And did patch does not work like that, right? It's only a proposal how the schema.xml and the stopwords.txt may look, but there is no code that helps Solr to select the right file, right?
I know, there is new book written about Lucene .... http://www.manning.com/hatcher3/
#10
Good news: xi:include will be available in the next builds of solr:
https://issues.apache.org/jira/browse/SOLR-1167
#11
Can you please advise me on this issue:
As far as I understand, xi:include allows me to include language specific stemmers, stop words etc. by altering the include file without otherwise altering the schema.xml.
But how would you ideally handle multilingual sites? Use one index and only one schema.xml, or use one index per language and modify the search functions to handle multiple indexes?
#12
@digi24:
There was a discussion about multilingual search at drupal con 2009 in Paris. I think that there's no common solution right now. I suggest to setup one separate index per language.
I hope to find some time to come up with a patch that explains the usage of xi:include soon ...
#13
Last week Robert and I discussed building up a set of schema files suitable for different languages - however, it would actually be smarter (and less work to keep in sync) if we could use Xinclude directives to just substitute in different analyzers for the text type.
We could similarly possibly improve the solrconfig. There isn't any explicit documentation about this for the schema, but looking at the java source, the same syntax should work there as for solrconfig:
http://wiki.apache.org/solr/SolrConfigXml#XInclude