Search is spliting Indian Languages words inappropriatly. [#160794]

Hi,
Search module's search_simplify function is splitting Indic-language's text inappropriately. Therefore indexing and searching of Indic content is buggy. This is happening around line 300 of search.module file. Below line is responsible for this.
$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);

Reason: As for as we understood, defined PREG_CLASS_SEARCH_EXCLUDE holds many Mn and Mc Unicode chars which are actively used in Indic words. Though many of Mn and Mc characters are already excluded from the list. example 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;Varia;;;

Example for Mn and Mc Chars remaining -- \x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}

Solution: To exclude all Mn and Mc chars from PREG_CLASS_SEARCH_EXCLUDE variable.

Comments

Comment #1

robertdouglass commented 15 April 2008 at 12:03

Version:

5.1

» 7.x-dev

This highlights the need for configurable processing pipelines. Lucene takes the approach of having a series of analyzers and filters apply during indexing. The pipeline is configuratble. A simplified pipeline would suit Drupal well. Perhaps we could use the filter system and have input formats for processing search index text and queries?

Comment #2

nitinksingh commented 15 April 2008 at 14:19

Yes configurable pipeline would be a great idea. I am not sure abour Filter system's use.

Comment #3

BlakeLucchesi commented 10 May 2008 at 19:40

Status:

Active

» Closed (won't fix)

This will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone with unicode/language specific problems to disable the default input filters provided by search and assign their own.

http://drupal.org/node/257007