Thai vowels are excluded in search index [#335928]

What are the steps required to reproduce the bug?

Assuming hook_preprocess() separates Thai words by space, just let search module index any nodes with Thai words, e.g., "ผู้ ใหญ่ บ้าน".

What behavior were you expecting?

Thai words should be indexed correctly including their vowels, e.g., "ผู้ ใหญ่ บ้าน".

What happened instead?

All Thai vowels are replaced by space because they are categorized as "Mn" in UnicodeData.txt, e.g., "ผ ใหญ บาน". In particular, below vowels should be not treated "Mn" and "Po" in Thai language like Latin-based languages.

0E31;THAI CHARACTER MAI HAN-AKAT;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI HAN-AKAT;;;;
0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;sara uue;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0E47;THAI CHARACTER MAITAIKHU;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI TAI KHU;mai taikhu;;;
0E48;THAI CHARACTER MAI EK;Mn;107;NSM;;;;;N;THAI TONE MAI EK;;;;
0E49;THAI CHARACTER MAI THO;Mn;107;NSM;;;;;N;THAI TONE MAI THO;;;;
0E4A;THAI CHARACTER MAI TRI;Mn;107;NSM;;;;;N;THAI TONE MAI TRI;;;;
0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107;NSM;;;;;N;THAI TONE MAI CHATTAWA;;;;
0E4C;THAI CHARACTER THANTHAKHAT;Mn;0;NSM;;;;;N;THAI THANTHAKHAT;;;;
0E4D;THAI CHARACTER NIKHAHIT;Mn;0;NSM;;;;;N;THAI NIKKHAHIT;nikkhahit;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;;;;;N;THAI YAMAKKAN;;;;
0E4F;THAI CHARACTER FONGMAN;Po;0;L;;;;;N;THAI FONGMAN;;;;

Currently, PREG_CLASS_SEARCH_EXCLUDE includes below vowels.

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

It should be replaced by below.

\x{e3f}\x{e46}\x{e5a}\x{e5b}

Note that I mark it critical because we can't use search module with Thai content at all.

Comment	File	Size	Author
#8	thai-vowel-wrapped-1.284-D7.patch	2.42 KB	sugree
#6	thai-vowel-wrapped-D6.patch	2.38 KB	sugree
#6	thai-vowel-wrapped-D7.patch	2.37 KB	sugree
#4	thai-vowel-7.x-dev.patch	1.15 KB	sugree
#1	thai-vowel-6.patch	993 bytes	sugree
#1	thai-vowel-7.patch	1003 bytes	sugree
	thai-vowel-7.patch	996 bytes	sugree
	thai-vowel-6.patch	986 bytes	sugree

Comments

Comment #1

sugree commented 19 November 2008 at 01:54

Status	File	Size
new	thai-vowel-7.patch	1003 bytes
new	thai-vowel-6.patch	993 bytes

After discussion with Thai linguistic expert, they suggested me to include OE4F in exclusion list. As a result, my patch will replace this:

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

by:

\x{e3f}\x{e46}\x{e4f}\x{e5a}\x{e5b}

Comment #2

sugree commented 20 November 2008 at 06:28

Version:

6.6

» 7.x-dev

Move this patch to D7.

Comment #3

20 November 2008 at 06:50

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #4

sugree commented 20 November 2008 at 17:42

Status:

Needs work

» Needs review

Status	File	Size
new	thai-vowel-7.x-dev.patch	1.15 KB

New patch against cvs.

Comment #5

catch

he/him

English

commented 22 January 2009 at 00:24

Status:

Needs review

» Needs work

This leaves a gap in the wrapping of those lines - should really move the other characters up instead.

Would it be possible to write a test for the search index with some Thai vowels as part of words to be indexed, to see if they're indexed properly as well?

Comment #6

sugree commented 28 February 2009 at 15:50

Status:

Needs work

» Needs review

Status	File	Size
new	thai-vowel-wrapped-D7.patch	2.37 KB
new	thai-vowel-wrapped-D6.patch	2.38 KB

@catch Thanks for comment. Please take a look this new patch.

I will write a simple test.

Comment #7

28 February 2009 at 15:55

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #8

sugree commented 3 March 2009 at 02:02

Status:

Needs work

» Needs review

Status	File	Size
new	thai-vowel-wrapped-1.284-D7.patch	2.42 KB

Oops! Some patches have been committed recently. I submit new patch here.

Comment #9

cburschka

they

commented 26 April 2009 at 12:34

Status:

Needs review

» Needs work

It looks good, but there should probably be a test included in the patch.

Comment #10

chx commented 4 January 2010 at 22:58

Could you please check that #604002: Poor search support of some Unicode scripts fixes your problem and if yes then set this to duplicate?

Comment #11

sugree commented 24 January 2010 at 03:38

Status:

Needs work

» Closed (duplicate)

@chx Yes, it is. #604002: Poor search support of some Unicode scripts fixes this issue.

Thai vowels are excluded in search index

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

News items

Our community

Documentation

Drupal code base

Governance of community