Hi,
Search module's search_simplify function is splitting Indic-language's text inappropriately. Therefore indexing and searching of Indic content is buggy. This is happening around line 300 of search.module file. Below line is responsible for this.
$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);

Reason: As for as we understood, defined PREG_CLASS_SEARCH_EXCLUDE holds many Mn and Mc Unicode chars which are actively used in Indic words. Though many of Mn and Mc characters are already excluded from the list. example 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;Varia;;;

Example for Mn and Mc Chars remaining -- \x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}

Solution: To exclude all Mn and Mc chars from PREG_CLASS_SEARCH_EXCLUDE variable.

Comments

robertdouglass’s picture

Version: 5.1 » 7.x-dev

This highlights the need for configurable processing pipelines. Lucene takes the approach of having a series of analyzers and filters apply during indexing. The pipeline is configuratble. A simplified pipeline would suit Drupal well. Perhaps we could use the filter system and have input formats for processing search index text and queries?

nitinksingh’s picture

Yes configurable pipeline would be a great idea. I am not sure abour Filter system's use.

BlakeLucchesi’s picture

Status: Active » Closed (won't fix)

This will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone with unicode/language specific problems to disable the default input filters provided by search and assign their own.

http://drupal.org/node/257007

robertdouglass’s picture

Status: Closed (won't fix) » Postponed (maintainer needs more info)

Whether or not that pipeline will happen remains to be seen, so keeping this open.

sun.core’s picture

Status: Postponed (maintainer needs more info) » Postponed

Proper status?

jhodgdon’s picture

Status: Postponed » Active

That other issue got bumped to Drupal 8, so I'm reopening this one.

jhodgdon’s picture

Status: Active » Fixed

I believe this has been fixed.

The boundary character regular expression is now called 'PREG_CLASS_UNICODE_WORD_BOUNDARY', and it's near the top of unicode.inc.

The Mc and Mn characters are excluded from it, including the specific examples cited here.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.