In an attempt to get chinese search working, I've added this module to my installation:

function cjksearch_search_preprocess($text)
{
   $text = preg_replace('/(['. PREG_CLASS_CJK .'])/u', ' \\1 ', $text);
   return $text;
}

I also had to get rid of

  // $text = preg_replace_callback('/['. PREG_CLASS_CJK .']+/u', 'search_expand_cjk', $text);

because it was messing up the indexing process -- but that's a different issue.

The issue I'm reporting here is that after hours and hours of headaches, I discovered this.

In do_search, the first line is search_parse_query. Let's say I searched for 村. The return value from that has $query[1][0] == ' 村 ‘ with an extra space before and after the character. Then the query that ends up getting run has (d.data LIKE '% 村 %') -- which looks for the character surrounded by two spaces. That doesn't appear in the string, and no matter what, no results are returned.

I've attached the hacks that I added to get the whole thing working.

CommentFileSizeAuthor
patch_14.txt1.25 KBWesley Tanaka

Comments

Wesley Tanaka’s picture

Status: Active » Needs work

any suggestions on what the "correct" fix might be?

Steven’s picture

Status: Needs work » Closed (works as designed)

This by design to ensure only whole tokens/words are matched. All words in the search_dataset table are separated by spaces, and each field starts and ends with a space (look at search_index(), the $accum variable).

If you index the phrase "the quick brown fox", then search_dataset will contain " the quick brown fox " (note the leading and trailing spaces). So, if you search for "fox", it will match on "% fox %" and the item is found. If you search for 'ox', it will match on "% ox %" and nothing is found. Your patch would cause "ox" to also match "fox".

The CJK expander takes a string of CJK values and expands it into overlapping pairs, separated by spaces. This should provide more accurate results than simply treating each unique CJK character as a separate token. It is also the approach taken by popular search software (e.g. Lucene).

If I had to guess, I'd say your problem is the minimum keyword length which is set too high. Especially if you are treating each character as a word, you'd need minimum keyword length 1. Though with the default code, and set at 2, it should work.

Wesley Tanaka’s picture

Status: Closed (works as designed) » Needs work

If I had to guess, I'd say your problem is the minimum keyword length which is set too high. Especially if you are treating each character as a word, you'd need minimum keyword length 1. Though with the default code, and set at 2, it should work.

Both my minimum word length settings are set to 1. I did a complete re-index after resetting them, and the index status has gone up to "100% indexed". {search_index}'s word field has a lot of single character chinese characters.

This by design to ensure only whole tokens/words are matched.

I thought that the spaces in the '%% %s %%' pattern would deal with that, but I guess I was wrong.

The CJK expander takes a string of CJK values and expands it into overlapping pairs, separated by spaces. This should provide more accurate results than simply treating each unique CJK character as a separate token. It is also the approach taken by popular search software (e.g. Lucene).

From what I understand, in drupal, that would prevent me from searching for "chicken" (the single character 鸡), or any other meaningful one-character word.

The problem with the expander was, once I enabled that patch to add spaces around all chinese characters, it would somehow eat up all my single characters. I haven't investigated this at all yet, but if i had to guess, it's just some minor bug -- it would just need to pass through single character words untouched rather than trying to convert them?

Steven’s picture

Status: Needs work » Fixed

After your recent explanation in private, I think I understand the issue:

When you let the preprocessor add spaces, the resulting words would always be treated as a phrase. E.g. with the 'overlapping pairs' preprocessor, searching for ABCD EFG (imagine these are CJK characters) meant searching for "AB BC CD" "EF FG".

This is an easy solution, but is not ideal if the user does not use spaces when searching either. So, I changed CVS so that the split words inherit their status from the original. Now, searching for ABCD EFG means searching for AB BC CD EF FG. Searching for "ABCD" EFG means searching for "AB BC CD" EF FG. Etc.

But, your mail made me realize that the current hardcoding of overlapping pairs is a bit silly. If you have minimum word length set to 3 or higher, it would not index any CJK because each CJK token was hardcoded to 2 characters long. So I changed it so that it uses overlapping words of "minimum word length". If this is set to 2, it behaves as before (ABCDE to AB BC CD DE). If set to 3 for example, it will use 3 characters per piece (ABCDE to ABC BCD CDE). If set to 1, it will split into loose characters (ABCDE to A B C D E).

So in fact, you can now do what you want to do without touching any code. Still, I added a checkbox to admin/settings/search to turn off the standard CJK code, so you can do anything you want.

In the end, the search_expand_cjk() idea is still a hack: the only way to properly search CJK is to use dictionary-based splitters. The default is simply meant so that search will return at least some useful CJK results rather than not working at all.

By the way, I think the problem with your own preprocessor was that it added a space to the beginning and end of the string as well.

Wesley Tanaka’s picture

Thanks, I'll try it out! Would it be possible to add one more setting, something that would effectively allow different minimum word lengths for CJK and non CJK characters? If you think about it, that would make sense -- most Chinese "words" (both search-meaningful and not) are 1-3 characters long, with a rare few being 4 or 5), whereas most meaningful English search words are longer.

I was thinking about this last night and I do think a 1 character minimum is meaningful for Chinese as the default case. Consider these meaningful nouns that are 1 character long:

There are a lot of meaningful nouns that are 1 character long in Chinese: Cat, dog, bowl, fire, book, tree, road (the search that I was trying that made me notice this in the first place), light, door, lock, water...
Most common verbs and adjectives are also 1 character long: hot/cold, scalding, frozen, light/heavy, bright, quick, run, walk....

In the end, the search_expand_cjk() idea is still a hack: the only way to properly search CJK is to use dictionary-based splitters. The default is simply meant so that search will return at least some useful CJK results rather than not working at all.

I still disagree that the dictionary based splitter idea is a magic wand for Chinese -- but here I may only be able to speak for Chinese-as-a-second-language speakers. I don't know how native speakers would search for things. With dictionary splitting, you will still run into a class of problem like the following. Consider these words:

鸡: chicken -- the whole, usually live, animal
烤鸡 roast chicken
小鸡 chick (baby chicken lit: small chicken)
鸡肉 chicken meat -- in the sense you'd see "chicken" on a menu
鸡退 drumstick
鸡蛋 egg
火鸡 turkey
鸡女 prostitute

Should a search for 鸡 return only hits for the first "word?" all of those? Only the ones that related to actual chickens? I wouldn't want 鸡肉 to get excluded completely, just because it got bound as a separate word by the dictionary splitter.

By the way, I think the problem with your own preprocessor was that it added a space to the beginning and end of the string as well.

So for future reference, the idea would be to do the trim() in the preprocessor before the string was returned? trim(preg_replace(...))

Wesley Tanaka’s picture

So I changed it so that it uses overlapping words of "minimum word length".

I grabbed CVS (2005-11-29 15:11 +08:00) but didn't see this fix. Is it in there?

Anonymous’s picture

Status: Fixed » Closed (fixed)