What are the steps required to reproduce the bug?

Assuming hook_preprocess() separates Thai words by space, just let search module index any nodes with Thai words, e.g., "ผู้ ใหญ่ บ้าน".

What behavior were you expecting?

Thai words should be indexed correctly including their vowels, e.g., "ผู้ ใหญ่ บ้าน".

What happened instead?

All Thai vowels are replaced by space because they are categorized as "Mn" in UnicodeData.txt, e.g., "ผ ใหญ บาน". In particular, below vowels should be not treated "Mn" and "Po" in Thai language like Latin-based languages.

0E31;THAI CHARACTER MAI HAN-AKAT;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI HAN-AKAT;;;;
0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;sara uue;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0E47;THAI CHARACTER MAITAIKHU;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI TAI KHU;mai taikhu;;;
0E48;THAI CHARACTER MAI EK;Mn;107;NSM;;;;;N;THAI TONE MAI EK;;;;
0E49;THAI CHARACTER MAI THO;Mn;107;NSM;;;;;N;THAI TONE MAI THO;;;;
0E4A;THAI CHARACTER MAI TRI;Mn;107;NSM;;;;;N;THAI TONE MAI TRI;;;;
0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107;NSM;;;;;N;THAI TONE MAI CHATTAWA;;;;
0E4C;THAI CHARACTER THANTHAKHAT;Mn;0;NSM;;;;;N;THAI THANTHAKHAT;;;;
0E4D;THAI CHARACTER NIKHAHIT;Mn;0;NSM;;;;;N;THAI NIKKHAHIT;nikkhahit;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;;;;;N;THAI YAMAKKAN;;;;
0E4F;THAI CHARACTER FONGMAN;Po;0;L;;;;;N;THAI FONGMAN;;;;

Currently, PREG_CLASS_SEARCH_EXCLUDE includes below vowels.

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

It should be replaced by below.

\x{e3f}\x{e46}\x{e5a}\x{e5b}

Note that I mark it critical because we can't use search module with Thai content at all.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

sugree’s picture

FileSize
1003 bytes
993 bytes

After discussion with Thai linguistic expert, they suggested me to include OE4F in exclusion list. As a result, my patch will replace this:

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

by:

\x{e3f}\x{e46}\x{e4f}\x{e5a}\x{e5b}

sugree’s picture

Version: 6.6 » 7.x-dev

Move this patch to D7.

Status: Needs review » Needs work

The last submitted patch failed testing.

sugree’s picture

Status: Needs work » Needs review
FileSize
1.15 KB

New patch against cvs.

catch’s picture

Status: Needs review » Needs work

This leaves a gap in the wrapping of those lines - should really move the other characters up instead.

Would it be possible to write a test for the search index with some Thai vowels as part of words to be indexed, to see if they're indexed properly as well?

sugree’s picture

Status: Needs work » Needs review
FileSize
2.37 KB
2.38 KB

@catch Thanks for comment. Please take a look this new patch.

I will write a simple test.

Status: Needs review » Needs work

The last submitted patch failed testing.

sugree’s picture

Status: Needs work » Needs review
FileSize
2.42 KB

Oops! Some patches have been committed recently. I submit new patch here.

cburschka’s picture

Status: Needs review » Needs work

It looks good, but there should probably be a test included in the patch.

chx’s picture

Could you please check that #604002: Poor search support of some Unicode scripts fixes your problem and if yes then set this to duplicate?

sugree’s picture

Status: Needs work » Closed (duplicate)

@chx Yes, it is. #604002: Poor search support of some Unicode scripts fixes this issue.