Hi,
Search module's search_simplify function is splitting Indic-language's text inappropriately. Therefore indexing and searching of Indic content is buggy. This is happening around line 300 of search.module file. Below line is responsible for this.
$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);
Reason: As for as we understood, defined PREG_CLASS_SEARCH_EXCLUDE holds many Mn and Mc Unicode chars which are actively used in Indic words. Though many of Mn and Mc characters are already excluded from the list. example 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;Varia;;;
Example for Mn and Mc Chars remaining -- \x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}
Solution: To exclude all Mn and Mc chars from PREG_CLASS_SEARCH_EXCLUDE variable.
Comments
Comment #1
robertdouglass commentedThis highlights the need for configurable processing pipelines. Lucene takes the approach of having a series of analyzers and filters apply during indexing. The pipeline is configuratble. A simplified pipeline would suit Drupal well. Perhaps we could use the filter system and have input formats for processing search index text and queries?
Comment #2
nitinksingh commentedYes configurable pipeline would be a great idea. I am not sure abour Filter system's use.
Comment #3
BlakeLucchesi commentedThis will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone with unicode/language specific problems to disable the default input filters provided by search and assign their own.
http://drupal.org/node/257007
Comment #4
robertdouglass commentedWhether or not that pipeline will happen remains to be seen, so keeping this open.
Comment #5
sun.core commentedProper status?
Comment #6
jhodgdonThat other issue got bumped to Drupal 8, so I'm reopening this one.
Comment #7
jhodgdonI believe this has been fixed.
The boundary character regular expression is now called 'PREG_CLASS_UNICODE_WORD_BOUNDARY', and it's near the top of unicode.inc.
The Mc and Mn characters are excluded from it, including the specific examples cited here.