Thai vowels are excluded in search index

sugree - November 18, 2008 - 16:26
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:critical
Assigned:Unassigned
Status:needs work
Description

What are the steps required to reproduce the bug?

Assuming hook_preprocess() separates Thai words by space, just let search module index any nodes with Thai words, e.g., "ผู้ ใหญ่ บ้าน".

What behavior were you expecting?

Thai words should be indexed correctly including their vowels, e.g., "ผู้ ใหญ่ บ้าน".

What happened instead?

All Thai vowels are replaced by space because they are categorized as "Mn" in UnicodeData.txt, e.g., "ผ ใหญ บาน". In particular, below vowels should be not treated "Mn" and "Po" in Thai language like Latin-based languages.

0E31;THAI CHARACTER MAI HAN-AKAT;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI HAN-AKAT;;;;
0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;sara uue;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0E47;THAI CHARACTER MAITAIKHU;Mn;0;NSM;;;;;N;THAI VOWEL SIGN MAI TAI KHU;mai taikhu;;;
0E48;THAI CHARACTER MAI EK;Mn;107;NSM;;;;;N;THAI TONE MAI EK;;;;
0E49;THAI CHARACTER MAI THO;Mn;107;NSM;;;;;N;THAI TONE MAI THO;;;;
0E4A;THAI CHARACTER MAI TRI;Mn;107;NSM;;;;;N;THAI TONE MAI TRI;;;;
0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107;NSM;;;;;N;THAI TONE MAI CHATTAWA;;;;
0E4C;THAI CHARACTER THANTHAKHAT;Mn;0;NSM;;;;;N;THAI THANTHAKHAT;;;;
0E4D;THAI CHARACTER NIKHAHIT;Mn;0;NSM;;;;;N;THAI NIKKHAHIT;nikkhahit;;;
0E4E;THAI CHARACTER YAMAKKAN;Mn;0;NSM;;;;;N;THAI YAMAKKAN;;;;
0E4F;THAI CHARACTER FONGMAN;Po;0;L;;;;;N;THAI FONGMAN;;;;

Currently, PREG_CLASS_SEARCH_EXCLUDE includes below vowels.

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

It should be replaced by below.

\x{e3f}\x{e46}\x{e5a}\x{e5b}

Note that I mark it critical because we can't use search module with Thai content at all.

AttachmentSizeStatusTest resultOperations
thai-vowel-6.patch986 bytesIdleFailed: Failed to apply patch.View details | Re-test
thai-vowel-7.patch996 bytesIdleFailed: Failed to apply patch.View details | Re-test

#1

sugree - November 19, 2008 - 01:54

After discussion with Thai linguistic expert, they suggested me to include OE4F in exclusion list. As a result, my patch will replace this:

\x{e31}\x{e34}-\x{e3f}\x{e46}-\x{e4f}\x{e5a}\x{e5b}

by:

\x{e3f}\x{e46}\x{e4f}\x{e5a}\x{e5b}

AttachmentSizeStatusTest resultOperations
thai-vowel-6.patch993 bytesIdleFailed: Failed to apply patch.View details | Re-test
thai-vowel-7.patch1003 bytesIdleFailed: Failed to apply patch.View details | Re-test

#2

sugree - November 20, 2008 - 06:28
Version:6.6» 7.x-dev

Move this patch to D7.

#3

System Message - November 20, 2008 - 06:50
Status:needs review» needs work

The last submitted patch failed testing.

#4

sugree - November 20, 2008 - 17:42
Status:needs work» needs review

New patch against cvs.

AttachmentSizeStatusTest resultOperations
thai-vowel-7.x-dev.patch1.15 KBIdlePassed: 10908 passes, 0 fails, 0 exceptionsView details | Re-test

#5

catch - January 22, 2009 - 00:24
Status:needs review» needs work

This leaves a gap in the wrapping of those lines - should really move the other characters up instead.

Would it be possible to write a test for the search index with some Thai vowels as part of words to be indexed, to see if they're indexed properly as well?

#6

sugree - February 28, 2009 - 15:50
Status:needs work» needs review

@catch Thanks for comment. Please take a look this new patch.

I will write a simple test.

AttachmentSizeStatusTest resultOperations
thai-vowel-wrapped-D6.patch2.38 KBIgnoredNoneNone
thai-vowel-wrapped-D7.patch2.37 KBIdleFailed: Failed to apply patch.View details | Re-test

#7

System Message - February 28, 2009 - 15:55
Status:needs review» needs work

The last submitted patch failed testing.

#8

sugree - March 3, 2009 - 02:02
Status:needs work» needs review

Oops! Some patches have been committed recently. I submit new patch here.

AttachmentSizeStatusTest resultOperations
thai-vowel-wrapped-1.284-D7.patch2.42 KBIdlePassed: 10908 passes, 0 fails, 0 exceptionsView details | Re-test

#9

Arancaytar - April 26, 2009 - 12:34
Status:needs review» needs work

It looks good, but there should probably be a test included in the patch.

 
 

Drupal is a registered trademark of Dries Buytaert.