Normally search engines normalize diacritics ( á é í ó ú ö ü ñ, etc) into their base equivalents (a e i o u etc). This is the current behaviour in Drupal search (due to how MySQL indexing works) and also in Google, Yahoo, etc.

Comments

rba’s picture

The way to do this is by tweaking the schema.xml provided by this module to your needs : stop words dictionnary, your choice of filters per field : lowercase, transliteration, plural forms support etc.

I think your main source of information is the Lucene documentation : technically, it's not *exactly* this module's job but Lucene's.

robertdouglass’s picture

@rba: you're correct that this is something to do at the Solr/Lucene level. I'm leaving this issue open in the hope that someone will come back here and document the process when they've met with success.

janusman’s picture

I did some research and it turns out the only (included) Token Filter in Solr for filtering diacritics is one for the ISOLatin character set. So the module would have (for now) to feed ISO-latin equivalents of UTF... eww!

You can turn it on selectively for each field in schema.xml. For example, you can turn it on for "text" type fields, but not for "string" fields.

This brings up an interesting problem:

  • The module is displaying search results from the data stored in Solr (e.g.: node title)
  • If we strip diacritics (transliterate?) from certain fields, they will also show up in search results. (e.g.: "Crüe" would get indexed AND show up as "Crue" or "Cruee" depending on your filter)

It is possible to store two versions of each field. For example: "text" and "text_filtered".

Say, if the user typed in the word "crüe", we could generate a query like this in Solr:

text_filtered:crue

which would match "crue", crué, crüe, etc. and still keep the original text crüe in the text: field

We can also have exact matches fare better, internally building a query with what the user typed, and it's filtered version:

text_normal:crüe^10 OR text_filtered:crue

which would boost exact matches for "crüe" in the listing over matches for crúe, crue, etc.

I'll be looking into this; for now I am taking this approach:

  • Indexed the "text" field as diacritic-free, filtering the text in Drupal before posting to Solr (hook_update_index).
  • Used theme_search_item() to use only the returned nids and showing the current node teasers as search results instead of what's stored in Solr.

If anyone wants more info for the above, please contact me.

janusman’s picture

Status: Active » Needs review
StatusFileSize
new3.26 KB

Have solution; it turns out that ISOLatin1AccentFilterFactory does work with UTF. So the only thing left is to actually add a field definition and field to Solr.

Attached is the patch for schema.xml for DRUPAL-5--1-0-ALPHA3. It adds a new field definition, a new field, copies content from other fields into that field, and makes that new field the default search field. This new field is not stored, only indexed, so search results still retain their original entities.

This is working now in our production site; search for méxico and mexico show same results (although highlighting is different)

janusman’s picture

Version: 5.x-1.0-beta1 » 5.x-1.0-alpha3
robertdouglass’s picture

can you comment on why you renamed the text field to "any"?

robertdouglass’s picture

Priority: Normal » Critical
janusman’s picture

In short, I made a choice, but there are other ways to do it =)

My thinking was this:

The current module implementation is not including term names in the index (what's put into $document->text).

So between patching both apachesolr.module and schema.xml and only patching I opted for the "Just schema.xml" patch. (But perhaps it's the wrong decision).

So I created a diacritic-free Solr field, named "any", with type "text_normalized" (that has the with ISOLatin1AccentFilterFactory filter) and then told Solr that the default searches are done on "any" instead of

In my patch, schema.xml copies title, text and taxonomy_name fields to the new normalized "any" field, and also instructs Solr to use that field for default searching.

P.S: There is still the question of whether or not activating this for other fields; the search "taxonomy_name:MEXICO" right now only matches the exact phrase MEXICO (we are using string types which are not tokenized, nor converted to lowercase, much less de-accented) =) Probably not to worry, as users would not know that search is possible (but keep in mind for future interface changes, like an "advanced search" page)

robertdouglass’s picture

I like the normalized text field. I don't mind changing to "any", but I don't think you copy enough of the fields. Shouldn't any text fields be copied?

   <field name="title" type="string" indexed="true" stored="true"/>
   <field name="body" type="text" indexed="false" stored="true"/>
   <field name="type" type="string" indexed="true" stored="true"/>
   <field name="name" type="string" indexed="true" stored="true"/>
   <field name="taxonomy_name" type="string" indexed="true" stored="true" multiValued="true"/>
 
   <dynamicField name="smfield*"  type="string"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tmfield*"  type="text"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="ssfield*"  type="string"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tsfield*"  type="text"    indexed="true"  stored="true" multiValued="false"/>
janusman’s picture

StatusFileSize
new2.7 KB

Well then, following this then we can decide not to have apachesolr.module actually copying stuff to the "text" field, but have schema.xml do it for us. We can thus do away with the $document->text = $text (and previous building of $text) in the indexing code.

A new patch for D5 (this only patches schema.xml) with all current text fields is included... I also added path.

Tested under 5.x-1.x-DEV, under Solr 1.2.0. http://bibdig.mty.itesm.mx/search/apachesolr_search/biochemistry

janusman’s picture

janusman’s picture

Version: 5.x-1.0-alpha3 » 6.x-1.x-dev
StatusFileSize
new1.46 KB

Can we try just adding the ISOLatinAccent filter to schema.xml? For now that would be enough to allow searches to match whether they have diacritics or not (méxico and mexico would bring the same results).

Attaced is patch for schema.xml

blackdog’s picture

StatusFileSize
new1.25 KB

I've applied the patch in #12, and tried searching for words with different chars in them, but can't seem to get it to work consitently, i.e.

dét returns correct,
Första does not,
fånka does,
dü does

Updated patch without hunks.

pwolanin’s picture

Is this patch useful? This thread suggests a future lucene filter that may be better, but seem development on it has been idle.

http://www.nabble.com/Best-way-to-index-without-diacritics-td18935599.html

janusman’s picture

Issue tags: +transliteration

One probability is that leaving this as a default might be "dangerous"-- that is, making Solr return "wrong" results due to the factory process. To actually make a judgement, I'd say compare this to core search, see if solr.ISOLatin1AccentFilterFactory is too different from MySQL diacritic matching.

Some alternatives to go on:
* to simply add instructions somewhere to comment/uncomment the appropiate lines in schema.xml
* make different schemas that work better for Spanish and perhaps other languages (e.g.: schema-en.xml, schema-es.xml)
* compare to MySQL seach and if the results are similar enough, include it as default in schema.xml
* Use an external transliteration service at index and query time, like http://drupal.org/project/transliteration (which to me sounds logical but perhaps harder work)

At least for spanish-language documents, adding in ISOLatin1AccentFilterFactory seems to work fine =)

Can we, perhaps start out with the first simple alternative (just mention it in the docs) and move on from there? =)

janusman’s picture

Another clue: ASCIIFoldingFilter has been commited to Lucene, but apparently not to Solr yet. It seems to replace ISOLatin1AccentFilter.

------------------------------------------------------------------------
r724053 | markrmiller | 2008-12-06 18:25:42 -0500 (Sat, 06 Dec 2008) | 1 line

LUCENE-1390: Added ASCIIFoldingFilter, a Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. ISOLatin1AccentFilter, which handles a subset of this filter, has been deprecated.
------------------------------------------------------------------------

References:

http://www.lucidimagination.com/search/document/a6789c46dc940793/filters...

http://www.lucidimagination.com/search/document/aa42f6f3189dc792/filteri...

pwolanin’s picture

If Solr is following Lucene trunk, then seem like we'll have this soon?

pwolanin’s picture

pwolanin’s picture

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
will work with:
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
according to erikhatcher

so we can basically sub it into the example <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

janusman’s picture

StatusFileSize
new1.32 KB

Rolled patch for schema.xml

janusman’s picture

Forgot to add some information and testing instructions...

This patch was tested under newest Solr nightly build (solr-2009-02-19.tgz). Remember to copy both solrconfig.xml and schema.xml into your Solr configuration =)

To test you need to stop Solr, patch schema.xml, restart Solr, and delete the Solr index.

Tests done were searches for items with the words: ficción, niños, México. Returned results were the same when issuing search words with and without diacritics. (E.g., "ficcion" returns the same results as "ficción")

pwolanin’s picture

Status: Needs review » Fixed
StatusFileSize
new2.75 KB

committing this patch to 6.x

Status: Fixed » Closed (fixed)
Issue tags: -transliteration

Automatically closed -- issue fixed for 2 weeks with no activity.