localize apachesolr date formats [#463886]

apachesolr is already a good search solution for English content. But it needs some work to offer the same power to other languages.

Beside a lack of integration of i18n which was already addressed at #436578: Translated (localized) taxonomy facet blocks, there're some problems regarding the index. For example text fields in schema.xml use EnglishPorterFilterFactory and stopwords.txt shipped with current solr distribution contains English stop words only.

To start a discussion about how to localize apachesolr I attached a patch that adds an alternative schema.xml and a draft of a stop word list suitable for German. Therefor the patch introduces a new directory called languages. The idea is to offer adjusted configurations for different languages:

apachesolr
  languages
    de
      schema.xml
      stopwords.txt
    de_DE
      schema.xml
      stopwords.txt
    de_AT
      schema.xml
      stopwords.txt
    de_CH
      schema.xml
      stopwords.txt
    fr_CH
      schema.xml
      stopwords.txt

Comment	File	Size	Author
#5	apachesolr_languages.patch	22.63 KB	mkalkbrenner
#4	localize_date_facet.patch	5.19 KB	mkalkbrenner
	apachesolr_languages.patch	22.39 KB	mkalkbrenner

Comments

Comment #1

JacobSingh commented 16 May 2009 at 05:15

I don't love this patch, although I think the idea is good.

The problem here is that schema.xml has to be maintained along with the solr core AND updates to apachesolr module. This means a serious hassle in maintenance. Since solr has no concept of an include file, I recommend we do this as patch files to the base schema.xml This will make it much easier to change fields, etc without having to do it in 5-10 or 20 language versions.

best,
J

Comment #2

pwolanin commented 16 May 2009 at 14:03

I'd agree with Jacob - for the config and schema, patches would be much more appropriate. Why not use /translations ?

Comment #3

mkalkbrenner

German

🇩🇪

commented 17 May 2009 at 12:43

I know that my patch isn't a perfect solution. It's just a kick off to somehow achieve a localization at the end.

A quick search for xi:include let me to this issue which should be watched:
https://issues.apache.org/jira/browse/SOLR-1167?page=com.atlassian.jira....

I see the risk in maintaining a lot of schema files in parallel. So patch sets might be an approach.
Currently I only exchanged the stemmer. But from my experience with solr in non English projects I see other candidates for modification in the schema files:
- cases sensitve stop word lists
- no lower case filter for spelling
- different settings for WordDelimiterFilter and LengthFilter

@pwolanin:
I choose languages as directory name because this patch is not about translating the user interface but changing apachesolr's behaviour. And stop word lists can't be translated but localized.

Finally more comments on this issue are welcome. Especially thoughts from people of non English countries might be interesting.

Comment #4

mkalkbrenner

German

🇩🇪

commented 4 June 2009 at 14:46

Status	File	Size
new	localize_date_facet.patch	5.19 KB

Next issue in localization context:

Date facets are created using some hard coded English date formatting. I wrote a patch that introduces a new tab in apachesolr settings for localization issues. The first thing there is the possibility to customize date formats for date facets. Additionally formatting itself now uses drupal's format_date() instead of gmdate() to apply some translations.

BTW I don't now if it's better to add more patches here or to split all localization issues into separate tickets.

Comment #5

mkalkbrenner

German

🇩🇪

commented 30 July 2009 at 12:35

Status	File	Size
new	apachesolr_languages.patch	22.63 KB

Adjusted schema language patch according to #529606: schema.xml: update CharStream and EnglishPorter stemmer and introduced localized protwords.txt.

Comment #6

robertdouglass commented 13 August 2009 at 12:19

@mkalkbrenner #4 is at risk of getting lost in this issue, please open a new one. http://drupal.org/node/463886#comment-1664164

Comment #7

robertdouglass commented 13 August 2009 at 12:19

Version:

6.x-1.x-dev

» 6.x-2.x-dev

Comment #8

ducdebreme commented 24 August 2009 at 14:42

subscribing

Comment #9

ducdebreme commented 24 August 2009 at 15:03

We are using Apachesolr now for one year. I really like the progress that the Apachesolr project has made.
But with the time i discovered that the search works only best in English. Your post is the first one that i found, that gives a reason for the fact.

We have content in English, French, German and Italian. And the most criticized behavior of Apachesolr is the lack of a plural/singular stemming for the non-English languages.

How did you find the solution to patch Apachesolr? You said, you had some multilingual projects, did you employ this patch there? And are there any How-tos (I did not yet find one).
And did patch does not work like that, right? It's only a proposal how the schema.xml and the stopwords.txt may look, but there is no code that helps Solr to select the right file, right?
I know, there is new book written about Lucene .... http://www.manning.com/hatcher3/

Comment #10

mkalkbrenner

German

🇩🇪

commented 2 October 2009 at 19:04

Good news: xi:include will be available in the next builds of solr:
https://issues.apache.org/jira/browse/SOLR-1167

Comment #11

digi24 commented 30 October 2009 at 00:38

Can you please advise me on this issue:
As far as I understand, xi:include allows me to include language specific stemmers, stop words etc. by altering the include file without otherwise altering the schema.xml.

But how would you ideally handle multilingual sites? Use one index and only one schema.xml, or use one index per language and modify the search functions to handle multiple indexes?

Comment #12

mkalkbrenner

German

🇩🇪

commented 30 October 2009 at 09:45

@digi24:
There was a discussion about multilingual search at drupal con 2009 in Paris. I think that there's no common solution right now. I suggest to setup one separate index per language.

I hope to find some time to come up with a patch that explains the usage of xi:include soon ...

Comment #13

pwolanin commented 27 November 2009 at 21:03

Last week Robert and I discussed building up a set of schema files suitable for different languages - however, it would actually be smarter (and less work to keep in sync) if we could use Xinclude directives to just substitute in different analyzers for the text type.

We could similarly possibly improve the solrconfig. There isn't any explicit documentation about this for the schema, but looking at the java source, the same syntax should work there as for solrconfig:

http://wiki.apache.org/solr/SolrConfigXml#XInclude

Comment #14

robertdouglass commented 8 May 2010 at 18:24

Title:	localize apachesolr	» localize apachesolr date formats
Category:	feature	» bug

Just spoke with mkalkbrenner and two of the three issues named are solved in the apachesolr_multilingual module. The one in this issue that still needs fixing is the date formats.

Comment #15

robertdouglass commented 8 May 2010 at 18:26

We should look at i18n variables to see if the problem can be mitigated.

Comment #16

mkalkbrenner

German

🇩🇪

commented 9 May 2010 at 08:29

Status:

Needs work

» Fixed

Language specific schema configuration and text files => provided by Apache Solr Multilingual
xi:include => obsolete, solved by a different approach in Apache Solr Multilingual
localized date facet => solved by #708424: Change gmdate() to Drupal format_date() in date facets to support localization

Comment #17

23 May 2010 at 08:30

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

localize apachesolr date formats

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

News items

Our community

Documentation

Drupal code base

Governance of community