Search is spliting Indian Languages words inappropriatly.

nitinksingh - July 20, 2007 - 11:30
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:won't fix
Description

Hi,
Search module's search_simplify function is splitting Indic-language's text inappropriately. Therefore indexing and searching of Indic content is buggy. This is happening around line 300 of search.module file. Below line is responsible for this.
$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);

Reason: As for as we understood, defined PREG_CLASS_SEARCH_EXCLUDE holds many Mn and Mc Unicode chars which are actively used in Indic words. Though many of Mn and Mc characters are already excluded from the list. example 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;Varia;;;

Example for Mn and Mc Chars remaining -- \x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}

Solution: To exclude all Mn and Mc chars from PREG_CLASS_SEARCH_EXCLUDE variable.

#1

robertDouglass - April 15, 2008 - 12:03
Version:5.1» 7.x-dev

This highlights the need for configurable processing pipelines. Lucene takes the approach of having a series of analyzers and filters apply during indexing. The pipeline is configuratble. A simplified pipeline would suit Drupal well. Perhaps we could use the filter system and have input formats for processing search index text and queries?

#2

nitinksingh - April 15, 2008 - 14:19

Yes configurable pipeline would be a great idea. I am not sure abour Filter system's use.

#3

BlakeLucchesi - May 10, 2008 - 19:40
Status:active» won't fix

This will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone with unicode/language specific problems to disable the default input filters provided by search and assign their own.

http://drupal.org/node/257007

 
 

Drupal is a registered trademark of Dries Buytaert.