What are the steps required to reproduce the bug?
Assuming hook_preprocess() separates Thai words by space, just let search module index any nodes with Thai words, e.g., "ผู้ ใหญ่ บ้าน".
What behavior were you expecting?
Thai words should be indexed correctly including their vowels, e.g., "ผู้ ใหญ่ บ้าน".
What happened instead?
All Thai vowels are replaced by space because they are categorized as "Mn" in UnicodeData.txt, e.g., "ผ ใหญ บาน". In particular, below vowels should be not treated "Mn" and "Po" in Thai language like Latin-based languages.
0E31;THAI CHARACTER MAI HAN-AKAT;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI HAN-AKAT;;;;
0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;sara uue;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0E47;THAI CHARACTER MAITAIKHU;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI TAI KHU;mai taikhu;;;
0E48;THAI CHARACTER MAI EK;Mn;107;NSM;;;;;N;THAI TONE MAI EK;;;;
0E49;THAI CHARACTER MAI THO;Mn;107;NSM;;;;;N;THAI TONE MAI THO;;;;
0E4A;THAI CHARACTER MAI TRI;Mn;107;NSM;;;;;N;THAI TONE MAI TRI;;;;
0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107;NSM;;;;;N;THAI TONE MAI CHATTAWA;;;;
0E4C;THAI CHARACTER THANTHAKHAT;Mn;0;NSM;;;;;N;THAI THANTHAKHAT;;;;
0E4D;THAI CHARACTER NIKHAHIT;Mn;0;NSM;;;;;N;THAI NIKKHAHIT;nikkhahit;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;;;;;N;THAI YAMAKKAN;;;;
0E4F;THAI CHARACTER FONGMAN;Po;0;L;;;;;N;THAI FONGMAN;;;;
Currently, PREG_CLASS_SEARCH_EXCLUDE includes below vowels.
\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}
It should be replaced by below.
\x{e3f}\x{e46}\x{e5a}\x{e5b}
Note that I mark it critical because we can't use search module with Thai content at all.
Comment | File | Size | Author |
---|---|---|---|
#8 | thai-vowel-wrapped-1.284-D7.patch | 2.42 KB | sugree |
#6 | thai-vowel-wrapped-D6.patch | 2.38 KB | sugree |
#6 | thai-vowel-wrapped-D7.patch | 2.37 KB | sugree |
#4 | thai-vowel-7.x-dev.patch | 1.15 KB | sugree |
#1 | thai-vowel-6.patch | 993 bytes | sugree |
Comments
Comment #1
sugree CreditAttribution: sugree commentedAfter discussion with Thai linguistic expert, they suggested me to include OE4F in exclusion list. As a result, my patch will replace this:
\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}
by:
\x{e3f}\x{e46}\x{e4f}\x{e5a}\x{e5b}
Comment #2
sugree CreditAttribution: sugree commentedMove this patch to D7.
Comment #4
sugree CreditAttribution: sugree commentedNew patch against cvs.
Comment #5
catchThis leaves a gap in the wrapping of those lines - should really move the other characters up instead.
Would it be possible to write a test for the search index with some Thai vowels as part of words to be indexed, to see if they're indexed properly as well?
Comment #6
sugree CreditAttribution: sugree commented@catch Thanks for comment. Please take a look this new patch.
I will write a simple test.
Comment #8
sugree CreditAttribution: sugree commentedOops! Some patches have been committed recently. I submit new patch here.
Comment #9
cburschkaIt looks good, but there should probably be a test included in the patch.
Comment #10
chx CreditAttribution: chx commentedCould you please check that #604002: Poor search support of some Unicode scripts fixes your problem and if yes then set this to duplicate?
Comment #11
sugree CreditAttribution: sugree commented@chx Yes, it is. #604002: Poor search support of some Unicode scripts fixes this issue.