In an attempt to get chinese search working, I've added this module to my installation:
function cjksearch_search_preprocess($text)
{
$text = preg_replace('/(['. PREG_CLASS_CJK .'])/u', ' \\1 ', $text);
return $text;
}
I also had to get rid of
// $text = preg_replace_callback('/['. PREG_CLASS_CJK .']+/u', 'search_expand_cjk', $text);
because it was messing up the indexing process -- but that's a different issue.
The issue I'm reporting here is that after hours and hours of headaches, I discovered this.
In do_search, the first line is search_parse_query. Let's say I searched for 村. The return value from that has $query[1][0] == ' 村 ‘ with an extra space before and after the character. Then the query that ends up getting run has (d.data LIKE '% 村 %') -- which looks for the character surrounded by two spaces. That doesn't appear in the string, and no matter what, no results are returned.
I've attached the hacks that I added to get the whole thing working.
| Comment | File | Size | Author |
|---|---|---|---|
| patch_14.txt | 1.25 KB | Wesley Tanaka |
Comments
Comment #1
Wesley Tanaka commentedany suggestions on what the "correct" fix might be?
Comment #2
Steven commentedThis by design to ensure only whole tokens/words are matched. All words in the search_dataset table are separated by spaces, and each field starts and ends with a space (look at search_index(), the $accum variable).
If you index the phrase "the quick brown fox", then search_dataset will contain " the quick brown fox " (note the leading and trailing spaces). So, if you search for "fox", it will match on "% fox %" and the item is found. If you search for 'ox', it will match on "% ox %" and nothing is found. Your patch would cause "ox" to also match "fox".
The CJK expander takes a string of CJK values and expands it into overlapping pairs, separated by spaces. This should provide more accurate results than simply treating each unique CJK character as a separate token. It is also the approach taken by popular search software (e.g. Lucene).
If I had to guess, I'd say your problem is the minimum keyword length which is set too high. Especially if you are treating each character as a word, you'd need minimum keyword length 1. Though with the default code, and set at 2, it should work.
Comment #3
Wesley Tanaka commentedBoth my minimum word length settings are set to 1. I did a complete re-index after resetting them, and the index status has gone up to "100% indexed". {search_index}'s word field has a lot of single character chinese characters.
I thought that the spaces in the '%% %s %%' pattern would deal with that, but I guess I was wrong.
From what I understand, in drupal, that would prevent me from searching for "chicken" (the single character 鸡), or any other meaningful one-character word.
The problem with the expander was, once I enabled that patch to add spaces around all chinese characters, it would somehow eat up all my single characters. I haven't investigated this at all yet, but if i had to guess, it's just some minor bug -- it would just need to pass through single character words untouched rather than trying to convert them?
Comment #4
Steven commentedAfter your recent explanation in private, I think I understand the issue:
When you let the preprocessor add spaces, the resulting words would always be treated as a phrase. E.g. with the 'overlapping pairs' preprocessor, searching for
ABCD EFG(imagine these are CJK characters) meant searching for"AB BC CD" "EF FG".This is an easy solution, but is not ideal if the user does not use spaces when searching either. So, I changed CVS so that the split words inherit their status from the original. Now, searching for
ABCD EFGmeans searching forAB BC CD EF FG. Searching for"ABCD" EFGmeans searching for"AB BC CD" EF FG. Etc.But, your mail made me realize that the current hardcoding of overlapping pairs is a bit silly. If you have minimum word length set to 3 or higher, it would not index any CJK because each CJK token was hardcoded to 2 characters long. So I changed it so that it uses overlapping words of "minimum word length". If this is set to 2, it behaves as before (
ABCDEtoAB BC CD DE). If set to 3 for example, it will use 3 characters per piece (ABCDEtoABC BCD CDE). If set to 1, it will split into loose characters (ABCDEtoA B C D E).So in fact, you can now do what you want to do without touching any code. Still, I added a checkbox to admin/settings/search to turn off the standard CJK code, so you can do anything you want.
In the end, the search_expand_cjk() idea is still a hack: the only way to properly search CJK is to use dictionary-based splitters. The default is simply meant so that search will return at least some useful CJK results rather than not working at all.
By the way, I think the problem with your own preprocessor was that it added a space to the beginning and end of the string as well.
Comment #5
Wesley Tanaka commentedThanks, I'll try it out! Would it be possible to add one more setting, something that would effectively allow different minimum word lengths for CJK and non CJK characters? If you think about it, that would make sense -- most Chinese "words" (both search-meaningful and not) are 1-3 characters long, with a rare few being 4 or 5), whereas most meaningful English search words are longer.
I was thinking about this last night and I do think a 1 character minimum is meaningful for Chinese as the default case. Consider these meaningful nouns that are 1 character long:
There are a lot of meaningful nouns that are 1 character long in Chinese: Cat, dog, bowl, fire, book, tree, road (the search that I was trying that made me notice this in the first place), light, door, lock, water...
Most common verbs and adjectives are also 1 character long: hot/cold, scalding, frozen, light/heavy, bright, quick, run, walk....
I still disagree that the dictionary based splitter idea is a magic wand for Chinese -- but here I may only be able to speak for Chinese-as-a-second-language speakers. I don't know how native speakers would search for things. With dictionary splitting, you will still run into a class of problem like the following. Consider these words:
鸡: chicken -- the whole, usually live, animal
烤鸡 roast chicken
小鸡 chick (baby chicken lit: small chicken)
鸡肉 chicken meat -- in the sense you'd see "chicken" on a menu
鸡退 drumstick
鸡蛋 egg
火鸡 turkey
鸡女 prostitute
Should a search for 鸡 return only hits for the first "word?" all of those? Only the ones that related to actual chickens? I wouldn't want 鸡肉 to get excluded completely, just because it got bound as a separate word by the dictionary splitter.
So for future reference, the idea would be to do the trim() in the preprocessor before the string was returned?
trim(preg_replace(...))Comment #6
Wesley Tanaka commentedI grabbed CVS (2005-11-29 15:11 +08:00) but didn't see this fix. Is it in there?
Comment #7
(not verified) commented