Handle searching with/without diacritics [#231200]

Comment	File	Size	Author
#22	iso-map-231200-22.patch	2.75 KB	pwolanin
#20	231200_20.patch	1.32 KB	janusman
#13	231200_D6_jan16.patch	1.25 KB	blackdog
#12	231200_D6_oct30.patch	1.46 KB	janusman
#10	diacritics_231200_D5.patch	2.7 KB	janusman
#4	schema_5--1-0-ALPHA3_accents.patch	3.26 KB	janusman

Comment #1

rba commented 27 March 2008 at 11:53

The way to do this is by tweaking the schema.xml provided by this module to your needs : stop words dictionnary, your choice of filters per field : lowercase, transliteration, plural forms support etc.

I think your main source of information is the Lucene documentation : technically, it's not *exactly* this module's job but Lucene's.

Log in or register to post comments

Comment #2

robertdouglass commented 28 March 2008 at 06:48

@rba: you're correct that this is something to do at the Solr/Lucene level. I'm leaving this issue open in the hope that someone will come back here and document the process when they've met with success.

Log in or register to post comments

Comment #3

janusman commented 28 May 2008 at 22:40

I did some research and it turns out the only (included) Token Filter in Solr for filtering diacritics is one for the ISOLatin character set. So the module would have (for now) to feed ISO-latin equivalents of UTF... eww!

You can turn it on selectively for each field in schema.xml. For example, you can turn it on for "text" type fields, but not for "string" fields.

This brings up an interesting problem:

The module is displaying search results from the data stored in Solr (e.g.: node title)
If we strip diacritics (transliterate?) from certain fields, they will also show up in search results. (e.g.: "Crüe" would get indexed AND show up as "Crue" or "Cruee" depending on your filter)

It is possible to store two versions of each field. For example: "text" and "text_filtered".

Say, if the user typed in the word "crüe", we could generate a query like this in Solr:

text_filtered:crue

which would match "crue", crué, crüe, etc. and still keep the original text crüe in the text: field

We can also have exact matches fare better, internally building a query with what the user typed, and it's filtered version:

text_normal:crüe^10 OR text_filtered:crue

which would boost exact matches for "crüe" in the listing over matches for crúe, crue, etc.

I'll be looking into this; for now I am taking this approach:

Indexed the "text" field as diacritic-free, filtering the text in Drupal before posting to Solr (hook_update_index).
Used theme_search_item() to use only the returned nids and showing the current node teasers as search results instead of what's stored in Solr.

If anyone wants more info for the above, please contact me.

Log in or register to post comments

Comment #4

janusman commented 4 September 2008 at 19:58

Status:

Active

» Needs review

Status	File	Size
new	schema_5--1-0-ALPHA3_accents.patch	3.26 KB

Have solution; it turns out that ISOLatin1AccentFilterFactory does work with UTF. So the only thing left is to actually add a field definition and field to Solr.

Attached is the patch for schema.xml for DRUPAL-5--1-0-ALPHA3. It adds a new field definition, a new field, copies content from other fields into that field, and makes that new field the default search field. This new field is not stored, only indexed, so search results still retain their original entities.

This is working now in our production site; search for méxico and mexico show same results (although highlighting is different)

Log in or register to post comments

Comment #5

janusman commented 4 September 2008 at 20:31

Version:

5.x-1.0-beta1

» 5.x-1.0-alpha3

Log in or register to post comments

Comment #6

robertdouglass commented 5 September 2008 at 10:13

can you comment on why you renamed the text field to "any"?

Log in or register to post comments

Comment #7

robertdouglass commented 5 September 2008 at 12:18

Priority:

Normal

» Critical

Log in or register to post comments

Comment #8

janusman commented 5 September 2008 at 14:45

In short, I made a choice, but there are other ways to do it =)

My thinking was this:

The current module implementation is not including term names in the index (what's put into $document->text).

So between patching both apachesolr.module and schema.xml and only patching I opted for the "Just schema.xml" patch. (But perhaps it's the wrong decision).

So I created a diacritic-free Solr field, named "any", with type "text_normalized" (that has the with ISOLatin1AccentFilterFactory filter) and then told Solr that the default searches are done on "any" instead of

In my patch, schema.xml copies title, text and taxonomy_name fields to the new normalized "any" field, and also instructs Solr to use that field for default searching.

P.S: There is still the question of whether or not activating this for other fields; the search "taxonomy_name:MEXICO" right now only matches the exact phrase MEXICO (we are using string types which are not tokenized, nor converted to lowercase, much less de-accented) =) Probably not to worry, as users would not know that search is possible (but keep in mind for future interface changes, like an "advanced search" page)

Log in or register to post comments

Comment #9

robertdouglass commented 7 September 2008 at 13:01

I like the normalized text field. I don't mind changing to "any", but I don't think you copy enough of the fields. Shouldn't any text fields be copied?

   <field name="title" type="string" indexed="true" stored="true"/>
   <field name="body" type="text" indexed="false" stored="true"/>
   <field name="type" type="string" indexed="true" stored="true"/>
   <field name="name" type="string" indexed="true" stored="true"/>
   <field name="taxonomy_name" type="string" indexed="true" stored="true" multiValued="true"/>
 
   <dynamicField name="smfield*"  type="string"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tmfield*"  type="text"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="ssfield*"  type="string"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tsfield*"  type="text"    indexed="true"  stored="true" multiValued="false"/>

Log in or register to post comments

Comment #10

janusman commented 10 September 2008 at 00:51

Status	File	Size
new	diacritics_231200_D5.patch	2.7 KB

Well then, following this then we can decide not to have apachesolr.module actually copying stuff to the "text" field, but have schema.xml do it for us. We can thus do away with the $document->text = $text (and previous building of $text) in the indexing code.

A new patch for D5 (this only patches schema.xml) with all current text fields is included... I also added path.

Tested under 5.x-1.x-DEV, under Solr 1.2.0. http://bibdig.mty.itesm.mx/search/apachesolr_search/biochemistry

Log in or register to post comments

Comment #11

janusman commented 10 September 2008 at 14:21

Log in or register to post comments

Comment #12

janusman commented 30 October 2008 at 21:25

Version:

5.x-1.0-alpha3

» 6.x-1.x-dev

Status	File	Size
new	231200_D6_oct30.patch	1.46 KB

Can we try just adding the ISOLatinAccent filter to schema.xml? For now that would be enough to allow searches to match whether they have diacritics or not (méxico and mexico would bring the same results).

Attaced is patch for schema.xml

Log in or register to post comments

Comment #13

blackdog commented 16 January 2009 at 10:45

Status	File	Size
new	231200_D6_jan16.patch	1.25 KB

I've applied the patch in #12, and tried searching for words with different chars in them, but can't seem to get it to work consitently, i.e.

dét returns correct,
Första does not,
fånka does,
dü does

Updated patch without hunks.

Log in or register to post comments

Comment #14

pwolanin commented 17 January 2009 at 14:43

Is this patch useful? This thread suggests a future lucene filter that may be better, but seem development on it has been idle.

http://www.nabble.com/Best-way-to-index-without-diacritics-td18935599.html

Log in or register to post comments

Comment #15

janusman commented 30 January 2009 at 17:48

Issue tags:

+transliteration

One probability is that leaving this as a default might be "dangerous"-- that is, making Solr return "wrong" results due to the factory process. To actually make a judgement, I'd say compare this to core search, see if solr.ISOLatin1AccentFilterFactory is too different from MySQL diacritic matching.

Some alternatives to go on:
* to simply add instructions somewhere to comment/uncomment the appropiate lines in schema.xml
* make different schemas that work better for Spanish and perhaps other languages (e.g.: schema-en.xml, schema-es.xml)
* compare to MySQL seach and if the results are similar enough, include it as default in schema.xml
* Use an external transliteration service at index and query time, like http://drupal.org/project/transliteration (which to me sounds logical but perhaps harder work)

At least for spanish-language documents, adding in ISOLatin1AccentFilterFactory seems to work fine =)

Can we, perhaps start out with the first simple alternative (just mention it in the docs) and move on from there? =)

Log in or register to post comments

Comment #16

janusman commented 5 February 2009 at 19:51

Another clue: ASCIIFoldingFilter has been commited to Lucene, but apparently not to Solr yet. It seems to replace ISOLatin1AccentFilter.

------------------------------------------------------------------------
r724053 | markrmiller | 2008-12-06 18:25:42 -0500 (Sat, 06 Dec 2008) | 1 line

LUCENE-1390: Added ASCIIFoldingFilter, a Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. ISOLatin1AccentFilter, which handles a subset of this filter, has been deprecated.
------------------------------------------------------------------------

References:

http://www.lucidimagination.com/search/document/a6789c46dc940793/filters...

http://www.lucidimagination.com/search/document/aa42f6f3189dc792/filteri...

Log in or register to post comments

Comment #17

pwolanin commented 5 February 2009 at 22:49

If Solr is following Lucene trunk, then seem like we'll have this soon?

Log in or register to post comments

Comment #18

pwolanin commented 7 February 2009 at 20:05

Damien point me to: https://issues.apache.org/jira/browse/SOLR-822

Log in or register to post comments

Comment #19

pwolanin commented 18 February 2009 at 16:29

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
will work with:
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
according to erikhatcher

so we can basically sub it into the example <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

Log in or register to post comments

Comment #20

janusman commented 19 February 2009 at 20:56

Status	File	Size
new	231200_20.patch	1.32 KB

Rolled patch for schema.xml

Log in or register to post comments

Comment #21

janusman commented 19 February 2009 at 23:01

Forgot to add some information and testing instructions...

This patch was tested under newest Solr nightly build (solr-2009-02-19.tgz). Remember to copy both solrconfig.xml and schema.xml into your Solr configuration =)

To test you need to stop Solr, patch schema.xml, restart Solr, and delete the Solr index.

Tests done were searches for items with the words: ficción, niños, México. Returned results were the same when issuing search words with and without diacritics. (E.g., "ficcion" returns the same results as "ficción")

Log in or register to post comments

Comment #22

pwolanin commented 20 February 2009 at 02:09

Status:

Needs review

» Fixed

Status	File	Size
new	iso-map-231200-22.patch	2.75 KB

committing this patch to 6.x

Log in or register to post comments

Comment #23

6 March 2009 at 02:10

Status:	Fixed	» Closed (fixed)
Issue tags:	-transliteration

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #24