Search is spliting Indian Languages words inappropriatly.
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | search.module |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | won't fix |
Jump to:
Hi,
Search module's search_simplify function is splitting Indic-language's text inappropriately. Therefore indexing and searching of Indic content is buggy. This is happening around line 300 of search.module file. Below line is responsible for this.
$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);
Reason: As for as we understood, defined PREG_CLASS_SEARCH_EXCLUDE holds many Mn and Mc Unicode chars which are actively used in Indic words. Though many of Mn and Mc characters are already excluded from the list. example 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;Varia;;;
Example for Mn and Mc Chars remaining -- \x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}
Solution: To exclude all Mn and Mc chars from PREG_CLASS_SEARCH_EXCLUDE variable.

#1
This highlights the need for configurable processing pipelines. Lucene takes the approach of having a series of analyzers and filters apply during indexing. The pipeline is configuratble. A simplified pipeline would suit Drupal well. Perhaps we could use the filter system and have input formats for processing search index text and queries?
#2
Yes configurable pipeline would be a great idea. I am not sure abour Filter system's use.
#3
This will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone with unicode/language specific problems to disable the default input filters provided by search and assign their own.
http://drupal.org/node/257007