Poor search support of some Unicode scripts [#604002]

Comment	File	Size	Author
#77	search-pedantic-grammar-D7.patch	1.89 KB	Garrett Albright
#76	search-unicode-d6.patch	10.74 KB	Garrett Albright
#75	search-unicode-d6.patch	49.89 KB	Garrett Albright
#73	search_unicode_tests.patch	56.99 KB	chx
#70	search_unicode_tests.patch	55.34 KB	chx
#68	search_unicode_tests_5.patch	56.28 KB	heine
#66	search_unicode_tests.patch	56.27 KB	chx
#64	search_unicode_tests.patch	56.26 KB	chx
#63	search_unicode_tests.patch	56.26 KB	chx
#61	search_unicode_tests.patch	55.08 KB	chx
#59	search_unicode_tests.patch	90.87 KB	chx
#56	allchars.txt	1.31 KB	chx
#56	generator.txt	1.03 KB	chx
#56	search_unicode_tests.patch	135.31 KB	chx
#54	generator.txt	1.04 KB	chx
#54	search_unicode_tests.patch	106.64 KB	chx
#51	search_unicode_tests.patch	86.77 KB	chx
#49	search_unicode_tests.patch	125.11 KB	chx
#47	allchars.txt	1.31 KB	chx
#47	search_unicode_tests.patch	125.04 KB	chx
#45	search_unicode_tests.patch	91.02 KB	chx
#44	allchars.txt	1.28 KB	chx
#44	search_unicode_tests.patch	56.95 KB	chx
#37	search_unicode_pwnd.patch	7.16 KB	chx
#35	search_unicode_pwnd.patch	6.99 KB	chx
#34	search_unicode_pwnd.patch	6.98 KB	chx
#27	search_unicode_pwnd.patch	7.22 KB	chx
#25	search_unicode_pwnd.patch	7.21 KB	chx
#24	search_unicode_pwnd.patch	7.07 KB	chx
#22	604002.diff	6.68 KB	naxoc
#21	classifier.txt	968 bytes	chx
#20	classifier.txt	1.2 KB	chx
#15	search_php51.patch	6.11 KB	chx
#13	search_php51.patch	6.1 KB	chx
#12	search_php51.patch	5.22 KB	chx

Comment #1

kaakuu commented 2 January 2010 at 17:09

Title:

Search with Unicode characters does not work

» Unicode Does Not Work?

This is confirmed long ago - answers or solution seem to be lacking for such a major important thing for Devanagari/Indic?similar unicode sites.

Now apparently chx has solution for this, I have requested to post the solution.
You can see this part of this thread http://drupal.org/node/671566#comment-242514
and since no one is attending this issue you can post a request to chx for the facts on the solution. Thanks.

Edited: chx has looked into this and asked for some info on this bug (http://drupal.org/node/671566#comment-2426384)
@rcross - the following info are needed. So please help with this info.
"crucial information like a) being in the issue b) the Unicode library from the status report page c) PCRE version (from the phpinfo linked from the status report page) d) OS."
Chx also asked to "file a bug report" - so I made this issue http://drupal.org/node/672430

I will also try to make a fresh install again and post these info as soon as I can.

Log in or register to post comments

Comment #2

kaakuu commented 2 January 2010 at 10:06

The info on my part

Drupal 6.15, Usual Lamp stack ( I tried this in three to four common popular webhosts)
Unicode library - PHP Mbstring Extension
PCRE Library Version 7.8 2008-09-05
I have tried just now again with a fresh install of Drupal with the above Unicode lib and PCRE specifications.
I pasted the following Unicode text in a node -
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात ornage
लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange

I indexed my site after making sure my search settings are okay.

Search can find the word orange but cannot find बाळाची

Log in or register to post comments

Comment #3

dave reid

he/him

English

Nebraska USA

commented 2 January 2010 at 16:17

Title:

Unicode Does Not Work?

» Search with Unicode characters does not work

Log in or register to post comments

Comment #4

douggreen commented 3 January 2010 at 03:24

Title:

Unicode Does Not Work?

» Search with Unicode characters does not work

Damien Tournoud suggests in #218403: Duplicate entry errors in search indexer comment#26 what needs to be done for 7.x:

One way to solve that bug is to set the {search_index}.word column to utf8_bin_ci. I just validated on a test site that it solve the problem.

But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).

... In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all 'utf8_general_ci' is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.

@kaakuu, does changing search_index to utf8_bin_ci solve the problem?

Log in or register to post comments

Comment #5

kaakuu commented 3 January 2010 at 03:54

@dougreen, the link says that it solved that issue but apparently this is a different one.
Changing search_index to utf8_bin as suggested by you and various such utf combinations does not solve the problem. More specifically Drupal throws error message when asked to search complex Unicode words in Indic, Devnagari or similar Unicode text.

If there is an working demo example that shows changing search_index to utf8_bin_ci solves this issue it will help us in a way that we can try more tweaking the various settings. However, apparently WP and others just do this out of the box without any maneuvers.

To repeat, keep or change search index to utf8__ as suggested or various other
Then, paste this sample Unicode text or any such text with complex words
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात orange लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange
Now, index your site, run cron, do whatever is needed and search words like त्यामुळे or बाळाची or त्यानंतर

Log in or register to post comments

Comment #6

dave reid

he/him

English

Nebraska USA

commented 3 January 2010 at 04:08

The problem isn't the indexing or database table encoding. Everything works properly. What's going on is the three words at the end of #5 fail search.module's "You must include at least one positive keyword with 3 characters or more." check. I performed several successful searches with words like सशक्त.

Log in or register to post comments

Comment #7

kaakuu commented 3 January 2010 at 04:19

@Dave When we search words like त्यामुळे or बाळाची or त्यानंतर the error message itself is critically erroneous as search word has already included "at least one positive keyword with 3 characters or more."

When you search words like सशक्त you are probably searching a simple word, which behaves like English words - however, Unicode Devnagari or Indic or similar are actually full of complex words and finding complex words on search is a critical necessity.

Can you find these words त्यामुळे or बाळाची or त्यानंतर or similar in search index in database? I am not sure I find those there but this may need more test than just a quick look I had now.

Let us, for example change the sample text to
त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर
(Those who are testing please do the test with this text now on as this is more representative of actual usage)

Now, can you please do a re-test and find whether search works for त्यानंतर त्यामुळे ?
This search term has included "at least one positive keyword with 3 characters or more."

Log in or register to post comments

Comment #8

chx commented 3 January 2010 at 11:30

Is this a duplicate and or variant of #335928: Thai vowels are excluded in search index ? Can someone check which characters are problematic are here and whether we exclude them in error?

Log in or register to post comments

Comment #9

kaakuu commented 3 January 2010 at 12:59

> Is this a duplicate and or variant

No, as far as I comprehend.

> which characters are problematic

It can be vowel or consonant, anything any charaacter combinations that happens in a complex word.

I just now set up a demo wordpress site http://unimode.wordpress.com/ and did the steps in #7 above (except that here indexing is automatic and one has to do nothing setupwise, search happens automatically) - the results are as expected, WP does find the text that contains त्यानंतर त्यामुळे at a single go. May be coders can see how WP does this.

Please let us know if any more info needed ?

Log in or register to post comments

Comment #10

chx commented 3 January 2010 at 13:12

By which characters I meant Unicode codepoints... range of codepoints actually. Then we can peek into search module and compare.

Log in or register to post comments

Comment #11

damien tournoud commented 3 January 2010 at 13:54

Title:

Search with Unicode characters does not work

» [Meta-Issue] Poor search support of some Unicode scripts

This has nothing to do with "Unicode". It's just that we very poorly support some scripts, mostly because of the lack of review and contributions from people actually using them. Let's make that a meta-issue.

chx already cross-linked #335928: Thai vowels are excluded in search index.
Let's make #672430: Drupal search support for Devanagari the issue for Devanagari, but we need a clear description of what needs to be done for this script.

Log in or register to post comments

Comment #12

chx commented 3 January 2010 at 14:21

Status:

Active

» Needs review

Status	File	Size
new	search_php51.patch	5.22 KB

http://php.net/manual/en/regexp.reference.unicode.php

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected.

Log in or register to post comments

Comment #13

chx commented 3 January 2010 at 14:28

Status	File	Size
new	search_php51.patch	6.1 KB

Added test for #7. Hopefully I got it right.

Log in or register to post comments

Comment #14

3 January 2010 at 14:41

Status:

Needs review

» Needs work

The last submitted patch, search_php51.patch, failed testing.

Log in or register to post comments

Comment #15

chx commented 3 January 2010 at 14:50

Status:

Needs work

» Needs review

Status	File	Size
new	search_php51.patch	6.11 KB

Well guys this is interesting and thanks #7 for the interesting search string! I ran

$a = 'तयानतर तयामळ';
preg_match_all('/\pM/u', $a, $matches);
foreach($matches[0] as $match) {
  for ($i=0; $i < strlen($match);$i++) echo ord($match[$i]) ." ";
  echo "\n";
}

and it turns out

224 165 141
224 164 190
224 164 130
224 165 141
224 164 190
224 165 129
224 165 135

seven of them are M. Five are "Non-spacing mark (Mn)" or and two are "Combining spacing mark (Mc)".

Going further http://www.fileformat.info/info/unicode/category/Mc/list.htm we find the wovel marks from several scripts including Devanagari. Same for Mn.

Conclusion: Mc and Mn should not be excluded. Attached patch is the first actual change to search module, it changes \pM to \p{Me}.

Edit: I have repeated the steps in D7 with unpatched search and get the favourite "gimme three" error message because it excludes out most of the characters. I applied the patch and the search succeeded.

Log in or register to post comments

Comment #16

chx commented 3 January 2010 at 14:50

Title:

[Meta-Issue] Poor search support of some Unicode scripts

» Poor search support of some Unicode scripts

Log in or register to post comments

Comment #17

3 January 2010 at 15:01

Status:

Needs review

» Needs work

The last submitted patch, search_php51.patch, failed testing.

Log in or register to post comments

Comment #18

chx commented 3 January 2010 at 15:10

Status:

Needs work

» Needs review

Let me guess. The bot install fails the requirement check. Could someone else please manually check?

Log in or register to post comments

Comment #19

chx commented 3 January 2010 at 15:31

Further on, we can do this fix without Unicode properties... just it's a lot easier with properties.

Edit: we probably need to write a script which recompiles \pC|\p{Lm}|\p{Me}|\p{Nl}|\pP|\pS|\pZ into code points. Hopeless manually.

Log in or register to post comments

Comment #20

chx commented 3 January 2010 at 17:07

Status	File	Size
new	classifier.txt	1.2 KB

Grab http://unicode.org/Public/UNIDATA/UnicodeData.txt and the attachment here here is a PHP script producing an exclude. I am currently including all letters (include Lm that's new), Nd, No, Mc, Mn.

Log in or register to post comments

Comment #21

chx commented 3 January 2010 at 18:10

Status	File	Size
new	classifier.txt	968 bytes

A much nicer script to generate.

Log in or register to post comments

Comment #22

naxoc commented 3 January 2010 at 18:29

Status	File	Size
new	604002.diff	6.68 KB

I tested the patch from #15 and it did fail the requirements check on install. I edited the implementation of hook_requirements some to make the install work. When running the test it fails on what looks like some japanese characters?

Query matching 'ドルーパル'
and
Query matching 'コーヒー'

Log in or register to post comments

Comment #23

3 January 2010 at 18:51

Status:

Needs review

» Needs work

The last submitted patch, 604002.diff, failed testing.

Log in or register to post comments

Comment #24

chx commented 3 January 2010 at 20:57

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_pwnd.patch	7.07 KB

Now with comments.

Log in or register to post comments

Comment #25

chx commented 3 January 2010 at 21:10

Status	File	Size
new	search_unicode_pwnd.patch	7.21 KB

Symbol is now mostly moved to the index -- Sk however are excluded. The rest were clear for some time now: Letters, Numbers and Marks are included, Other, Punctuation and Separator are excluded. So this is hopefully the last one if the comments are OK.

Log in or register to post comments

Comment #26

3 January 2010 at 21:31

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_pwnd.patch, failed testing.

Log in or register to post comments

Comment #27

chx commented 3 January 2010 at 21:43

Status	File	Size
new	search_unicode_pwnd.patch	7.22 KB

Now, come on. I left out a {} and you blow up? Bah :p

Log in or register to post comments

Comment #28

chx commented 3 January 2010 at 21:43

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #29

kaakuu commented 3 January 2010 at 22:16

At Chx - thanks a lot for your very detailed insight and work. It will greatly help if you can kindly post a zip or text of the whole search.module in its new form. It can be tested in details.

Log in or register to post comments

Comment #30

kaakuu commented 3 January 2010 at 22:27

I actually found that (in the existing search.module, not the patch) * Matches Unicode character classes to exclude from the search index. seems to be causing problem.

For example, making the code

define('PREG_CLASS_SEARCH_EXCLUDE',

'\x{3289}');

/**
 * Matches all 'N' Unicode character classes (numbers)
 */
define('PREG_CLASS_NUMBERS',

'\x{3289}\x{32b1}-\x{32bf}\x{ff10}-\x{ff19}');

/**
 * Matches all 'P' Unicode character classes (punctuation)
 */
define('PREG_CLASS_PUNCTUATION',

'\x{3289}');

/**
 * Matches all CJK characters that are candidates for auto-splitting
 * (Chinese, Japanese, Korean).
 * Contains kana and BMP ideographs.
 */

define('PREG_CLASS_CJK', '\x{3289}');

actually improves the unicode search almost by more than 90% to 95%. It still fails on some words which I cannot consistently reproduce. Probably with Chx's patches this should be working 100%.
I need a new search.module text or zip - if thats not entirely impossible please post it.

I have no idea whether removing all those codes, quite a lot, has any security implications or not.

Log in or register to post comments

Comment #31

sun

German

Karlsruhe

commented 3 January 2010 at 22:40

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+function search_requirements($phase) {

Missing PHPDoc.

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+      'title' => $t('PHP PCRE unicode support'),

s/unicode/Unicode/

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+      'description' => t('The PCRE library your PHP is linked with does not support Unicode properties.'),

"The PCRE library, PHP is linked with, does not support Unicode properties."

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * See: http://unicode.org/glossary

s/See:/@see/

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * The index only contains the following character categories / properties.

s///and/

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * @TODO: Enhance based on http://unicode.org/reports/tr29/.

s/@TODO:/@todo/

Powered by Dreditor.

Log in or register to post comments

Comment #32

dmitrig01 commented 4 January 2010 at 00:10

@sun - I believe that in third one (inserting commas), no commas should be inserted, only "that" replaced with "your"

Log in or register to post comments

Comment #33

dave reid

he/him

English

Nebraska USA

commented 4 January 2010 at 00:13

Yeah that suggestion is even odder.
"The PCRE library that PHP is linked with does not support Unicode properties."

Log in or register to post comments

Comment #34

chx commented 9 January 2010 at 14:20

Status	File	Size
new	search_unicode_pwnd.patch	6.98 KB

I have removed the test. We are not here to test PCRE tables for correctness. I have also moved all Symbol characters to the index. This is a matter of preference. For example, including Sc means that you can search separately on "100$" and "100¢" but 100 won't match them. Not including Sc would mean that searching on 100$ finds 100¢ too which smells wrong to me. What do we want?

Log in or register to post comments

Comment #35

chx commented 9 January 2010 at 16:25

Status	File	Size
new	search_unicode_pwnd.patch	6.99 KB

https://twitter.com/CatherineOmega/status/7560974517 makes sense. Moved S to exclude.

Log in or register to post comments

Comment #36

sun

German

Karlsruhe

commented 9 January 2010 at 17:41

+++ modules/search/search.install	2010-01-09 13:46:16 +0000
@@ -7,6 +7,21 @@
+ * Implements hook_requirements.

Missing ().

+++ modules/search/search.install	2010-01-09 13:46:16 +0000
@@ -7,6 +7,21 @@
+      'description' => t('PCRE has not been compiled with Unicode property support. Please Google pcre unicode properties [your operating system] here for more or use PHP from php.net'),

"Google" needs to be removed, suggested search string should be in quotes to delimit it...

'Please search for "pcre unicode properties [your operating system]" on the net or install PHP from php.net.'

(also note trailing period)

Powered by Dreditor.

Log in or register to post comments

Comment #37

chx commented 9 January 2010 at 17:48

Status	File	Size
new	search_unicode_pwnd.patch	7.16 KB

Added sun's fixes, renamed Mark to more precise Combining mark and added some more explanation from php handbook.

Log in or register to post comments

Comment #38

damien tournoud commented 9 January 2010 at 18:23

Status:

Needs review

» Reviewed & tested by the community

This is obviously not perfect (implementing word-splitting properly would require implementing the whole TR#29... and even some fancier machine-learning algorithms), but it is without any doubt an improvement over the current implementation.

Log in or register to post comments

Comment #39

webchick

she/they

English

Vancouver 🇨🇦

commented 9 January 2010 at 23:44

Status:

Reviewed & tested by the community

» Needs work

Wow. That's an insanely awesome code clean-up. I definitely want this for 7.x. I'm not sure how Gabor will feel about changing requirements 15 point releases in, but I guess we can find out. :)

However, I would really like a version of this patch that includes the test. While today this is implemented in PCRE, tomorrow it might be something else, and since it took us literally like 4 years or so to finally get a well-written bug report that successfully nailed this (thanks for that, kaakuu), I do not feel comfortable without a test that ensures it does not break again.

Log in or register to post comments

Comment #40

kaakuu commented 9 January 2010 at 23:51

Yay! #39 Webchick - Thanks!

Yes, it does need more test. As apart from the above sample text there seems to be at least three or more representative sample texts to test apart from whether anything else is broke. I did have one or two zero results with a few words but that cannot be consistently reproduced. Wish I could do more tests but won't be having some time right now (till this month's third week, which is past the alpha release date).
Anyway, big thanks to Chx for the very analytical insight and the ultimate help in this - it will be a big step forward,

Log in or register to post comments

Comment #41

rcross commented 10 January 2010 at 02:11

glad to see the power of the issue queue again. amazing how long something can sit in the forums festering, when a simple post on the queue actually gets things accomplished. glad I could bring this to light, but kudos to everyone who did the heavy lifting.

Log in or register to post comments

Comment #42

chx commented 10 January 2010 at 06:40

Re Drupal 6.x as said above we won't change requirements, we will use ugly regexp I already posted the script and instructions that compile the nice regexp to an ugly one. Test... HM. OK.

Log in or register to post comments

Comment #43

chx commented 10 January 2010 at 06:41

@rcross sorry but not a simple post got this rolling but an actual reproducible bug report!

Log in or register to post comments

Comment #44

chx commented 10 January 2010 at 07:34

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	56.95 KB
new	allchars.txt	1.28 KB

With tests. Also included is the script used to generate the UnicodeCharacters.txt file. Uses the Unicode character database linked from above.

Edit: the unichr() function in the generate comes from Moodle which is GPL.

Log in or register to post comments

Comment #45

chx commented 10 January 2010 at 07:44

Status	File	Size
new	search_unicode_tests.patch	91.02 KB

Hmmm the patch did not add UnicodeCharacters.txt. I removed chr(0) from the beginning, that placated diff. I am testing \0 separately.

Log in or register to post comments

Comment #46

10 January 2010 at 08:11

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #47

chx commented 10 January 2010 at 08:15

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	125.04 KB
new	allchars.txt	1.31 KB

Changed the generating script. The previous patch was too small :p

Log in or register to post comments

Comment #48

10 January 2010 at 08:31

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #49

chx commented 10 January 2010 at 08:53

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	125.11 KB

Blargh, bah, bah! I have removed all the ASCII control characters hoping that patch won't die on me. I actually tested running patch now, too.

Log in or register to post comments

Comment #50

10 January 2010 at 09:21

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #51

chx commented 10 January 2010 at 09:23

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	86.77 KB

Maybe restricting to the BMP helps? (The original regexp only dealt with that, anyways) Note that all these patches just pass fine for me.

Log in or register to post comments

Comment #52

10 January 2010 at 09:51

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #53

chx commented 10 January 2010 at 11:30

Further investigation shows that 2502 characters are wrongly classified by PCRE. Stay tuned. 2494 of them are Cn. Hm, I guess I need the Unicode 4.01 UnicodeData maybe http://unicode.org/Public/4.1.0/ucd/UnicodeData.txt from here.

Log in or register to post comments

Comment #54

chx commented 10 January 2010 at 12:04

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	106.64 KB
new	generator.txt	1.04 KB

Well, guys, PCRE is buggy. Who would have thought? Even rolling back to 4.1.0 found a few characters which are unassigned per PCRE. Also I do not want to fudge around with not knowing which Unicode we are compatible with or not. PCRE 7.0 and 7.5 contained significant fixes / changes to which Unicode is supported. So we are back to a per-codepoint regexp, but way way more precise than the one found in D7 currently.

It must be noted that for the numbers-followed-by-punctuation we still use the PCRE properties and we do not have a test. However, during my testing I found extremely few N and P problems with PCRE so I let it rest.

I *really* hope this passes. The previous tests failed exactly because of the PCRE version mismatches and therefore different behaviour on my computer where I generate and the testbot.

Log in or register to post comments

Comment #55

10 January 2010 at 12:21

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #56

chx commented 10 January 2010 at 12:51

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	135.31 KB
new	generator.txt	1.03 KB
new	allchars.txt	1.31 KB

Bah, we saw that before, didn't we? I have restarted and forgot to exclude the bottom of the list. Issue summary:

We have excluded too many characters in search.module. We tried writing a much shorter regexp using PCRE properties but then it turned out that various PHP versions ship with various PCRE versions supporting different Unicode versions and containing bugs in that support. So instead we ourselves generate our regexp. Then using another script, we generate a text file containing the concatenation of every Unicode character above U+001F to avoid patch freaking out in the Unicode 5.2.0 character database in UTF-8 encoding. Then in a test compare search_simplify() results with the previously stored version of the search_simplify()'d version of this file.

We currently exclude 5321 characters out of the 21829 in the character database.

Log in or register to post comments

Comment #57

chx commented 10 January 2010 at 12:57

For comparison, here are the beginnings of the current regexp:

'\x{0}-\x{2f}\x{3a}-\x{40}\x{5b}-\x{60}\x{7b}-\x{bf}\x{d7}\x{f7}\x{2b0}-\x{385}'.

compare this to

  '\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
  '\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .

It's clearly visible that the new regexp is much more fine grained in what to exclude and what is included, however the beginning is quite the same.

Log in or register to post comments

Comment #58

10 January 2010 at 13:21

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #59

chx commented 10 January 2010 at 13:35

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	90.87 KB

I dunno. I am out of ideas. I am posting one with only the BMP (ie only up to U+FFFF) but my hopes are quite low at this point. Up until now I understood the problems of the testbot.

Log in or register to post comments

Comment #60

10 January 2010 at 14:01

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #61

chx commented 10 January 2010 at 15:34

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	55.08 KB

Well, now we will see where this fails. I generated a file separated by chr(10) characters the parts alternate between includes and excludes. I got 334 passes, 0 fails, and 0 exceptions and we will see what the testbot delivers. And, it still only takes 4 sec on my laptop.

Log in or register to post comments

Comment #62

10 January 2010 at 16:01

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #63

chx commented 10 January 2010 at 16:44

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	56.26 KB

Bot test.

Log in or register to post comments

Comment #64

chx commented 10 January 2010 at 16:51

Status	File	Size
new	search_unicode_tests.patch	56.26 KB

With less typos in testBotTellMeWhyDoYouFail

Log in or register to post comments

Comment #65

10 January 2010 at 17:11

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #66

chx commented 10 January 2010 at 17:13

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	56.27 KB

Sigh.

Log in or register to post comments

Comment #67

10 January 2010 at 18:31

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Log in or register to post comments

Comment #68

heine commented 10 January 2010 at 18:36

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests_5.patch	56.28 KB

Use \x syntax to see whether encoding issues between testbot and d.o. cause this.

Log in or register to post comments

Comment #69

10 January 2010 at 19:02

Status:

Needs review

» Needs work

The last submitted patch, search_unicode_tests_5.patch, failed testing.

Log in or register to post comments

Comment #70

chx commented 10 January 2010 at 19:25

Status:

Needs work

» Needs review

Status	File	Size
new	search_unicode_tests.patch	55.34 KB

poor, poor issue. ~~mb_strtolower mixes up ohm with omega... and other snafus~~. There are a few Unicode characters where the lowercase character has UTF-8 bytes so the above tests using strlen instead of drupal_strlen were doomed for failure. The previous tests were wrong because my machine did not have mbstring compiled in so my machine generated uppercase characters and the testbot have lowercased them so the identical failed...

Log in or register to post comments

Comment #71

aspilicious commented 10 January 2010 at 19:54

It passes!

Log in or register to post comments

Comment #72

heine commented 10 January 2010 at 20:36

Status:

Needs review

» Reviewed & tested by the community

ahem.

Log in or register to post comments

Comment #73

chx commented 10 January 2010 at 20:53

Status	File	Size
new	search_unicode_tests.patch	56.99 KB

Revert the number-punctuation regexp from properties to code points. At this point, the patch straight applies to D6 too.

Log in or register to post comments

Comment #74

webchick

she/they

English

Vancouver 🇨🇦

commented 10 January 2010 at 21:48

Version:	7.x-dev	» 6.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)

Excellent work! Not only do we fix a bug in Drupal core for a few billion people, but we also can file bug reports upstream for PCRE. :D While I was expecting tests that just ran a couple more strings through the existing search tests, chx tells me that these tests are bullet-proof and ensure we get no further regressions in this tweaky, obtuse area of code, which sounds great to me.

Committed to HEAD. Since this was fixed in such a way that it does not require changes to requirements/APIs, also moving down to 6.x for consideration.

Log in or register to post comments

Comment #75

Garrett Albright commented 12 January 2010 at 02:44

Status:

Patch (to be ported)

» Needs review

Status	File	Size
new	search-unicode-d6.patch	49.89 KB

EDIT: Ignore this stupid patch.

Log in or register to post comments

Comment #76

Garrett Albright commented 12 January 2010 at 02:45

Status	File	Size
new	search-unicode-d6.patch	10.74 KB

(Well, I'm not sure how I pulled that off, but anyway, here's a reroll with just the search-related stuff.)

D6 patch! Without tests, obviously, but I was able to successfully get results when using the Thai text in #5, and, if accepted, this also eliminates the need for a D6 port of #493770: Search incorrectly splits some katakana words (I was able to get expected results using some of the hiragana examples in that issue).

Log in or register to post comments

Comment #77

Garrett Albright commented 12 January 2010 at 02:56

Status	File	Size
new	search-pedantic-grammar-D7.patch	1.89 KB

Patch to fix some niggly grammatical issues introduced in the D7 patch in #73.

Log in or register to post comments

Comment #78

kaakuu commented 12 January 2010 at 03:01

Is it somehow and kindly possible to post search.module, patched and in entirety,for D6 and D7 as a text attachment, please?

Log in or register to post comments

Comment #79

tstoeckler

he/him

German

Essen, Germany

commented 12 January 2010 at 03:41

Since the D7 patch was committed, you can just go to the Drupal project page (http://drupal.org/project/drupal) and download Drupal 7.

Log in or register to post comments

Comment #80

jhodgdon

she/her

English

commented 4 May 2010 at 16:14

Just a note that some of this for D7 will be moved out of the search module and used in trucate_utf8(), if this issue gets fixed:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

And also to see #56 above to learn how the unicode character file was generated for the tests. Note that it ends up being alternate lines of word/boundary characters, which is how the test works (the latest version of the tests for D7 have a lot more comments in them on how they work).

Log in or register to post comments

Comment #81

3 November 2010 at 12:21

Status:

Needs review

» Needs work

The last submitted patch, search-pedantic-grammar-D7.patch, failed testing.

Log in or register to post comments

Comment #82

jhodgdon

she/her

English

commented 4 November 2010 at 18:41

Version:	6.x-dev	» 7.x-dev
Status:	Needs work	» Needs review

I just took a look at the patch in #77, since this is a D7 patch. As a note, it's not actually suggesting grammar fixes per se -- it's line wrapping, extra spaces, and capitalization.... let's see.

The first two sections are inconsistent:

  * Characters with the following General_category (gc) property values are
  * excluded from the search index. Also, they are used as word boundaries.
- * While this does not fully conform to the  Word Boundaries algorithm
- * described in http://unicode.org/reports/tr29, as PCRE does not contain the
- * Word_Break property table, this simpler algorithm has to do.
+ * While this does not fully conform to the Word Boundaries algorithm described
+ * in http://unicode.org/reports/tr29, as PCRE does not contain the Word_Break
+ * property table, this simpler algorithm has to do.
  * - Cc, Cf, Cn, Co, Cs: Other.
  * - Pc, Pd, Pe, Pf, Pi, Po, Ps: Punctuation.
  * - Sc, Sk, Sm, So: Symbols.
  * - Zl, Zp, Zs: Separators.
  *
  * Consequently, the index only contains characters with the following
- * General_category (gc) property values:
+ * General_Category (gc) property values:

this fixes General_category -> General_Category in only one of the two spots it appears.... I think we should just leave it as-is. Also, this hunk has moved to unicode.inc and has been reworded, so the other changes suggested here have already been taken care of.

-  // search behavior with acronyms and URLs. No need to use the unicode modifer
+  // search behavior with acronyms and URLs. No need to use the Unicode modifier

This is a suggestion to capitalize Unicode in search.module, but it's not complete and doesn't apply to the current code. There are two spots where this could be done. But they're in code comments (not docblocks) so I don't think this is very high priority. Let's leave it.

So I guess we can proceed to Drupal 6, and review the patch in #76. Setting status appropriately (will review that patch in a separate comment).

Log in or register to post comments

Comment #83

jhodgdon

she/her

English

commented 4 November 2010 at 18:41

Version:

7.x-dev

» 6.x-dev

whoops, wrong version.

Log in or register to post comments

Comment #84

jhodgdon

she/her

English

commented 4 November 2010 at 18:43

Status:

Needs review

» Needs work

I cannot get the patch in #76 to apply to the current Drupal 6. We need a new patch.

Log in or register to post comments

Comment #85

udvranto commented 12 January 2011 at 22:51

subscribing

Log in or register to post comments

Comment #86

udvranto commented 13 January 2011 at 04:45

I applied the patch manually to 6.20. Still does not work for Bengali characters. Do I need to update the index database?

Log in or register to post comments

Comment #87

jhodgdon

she/her

English

commented 13 January 2011 at 13:54

Yes, after applying the patch, you would definitely need to reindex your site, because this would change how your site is indexed as well as searched.

Log in or register to post comments

Comment #88

jhodgdon

she/her

English

commented 29 March 2011 at 15:25

Someone just reported another example of this for Tamil at #1108194: Drupal unicode search does not work! (closed as duplicate)

Log in or register to post comments

Comment #89

jhodgdon

she/her

English

commented 28 August 2014 at 16:01

Version:	6.x-dev	» 7.x-dev
Issue summary:	View changes
Status:	Needs work	» Fixed

Talked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... putting this back to D7 / fixed.

Log in or register to post comments

Comment #90

11 September 2014 at 16:10

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Poor search support of some Unicode scripts

Comments