By jb605 on
I am trying to setup a chinese portal website using drupal. In my experimenting I seems to have problem searching chinese. I know the phpbb has no problem searching chinese, so it should not be a php issue. Does anybody have any clue?
Thanks.
Comments
More information required
What Drupal version are you using? Do you have cron setup? What do you mean with "a problem"?
problem search multibytes language
Hi, Thanks for getting back so quick.
I just downloaded and installed drupal from the front page of this site today, and it says 4.2.0.
If I do a search for an English term from an article posted today, I can get it. But if I search for a chinese term, it's not coming up.
I do have cron enabled. I also ran it by accessing cron.php.
I checked the search_index table via PhpMyAdmin, which sounds to me is all the terms that are going to be searched. It turns out that is has very little chinese term in it. My site will be mainly in chinese, so I need some sort of way to search chinese terms. Chinese characters are represented as two bytes.
BTW, I have changed all default coding to gb2312 which will be the setting for all my pages. Is that a possible problem? I am not sure how staying with utf-8 will help searching multibyte characters though, because the way you index things is to break them into words (delimiter as word boundary which won't work for multibyte language), and then put the word into the search_index table.
I have played with phpBB before, didn't pay much attention to it because it does search chinese word or phrase. I just went back and chekced the code, it indexes english word or chinese phrase that fall into the english definition of word (who happen to have an english word boundary), but nothing about the real chinese character or phrase. It tests and if it's multibyte language text it will then use LIKE instead of the regular = as searching match creteria. Of course this way it is searching against the actual text of each post (node title/content here), because the multibyte language text can not be indexed.
What do you think? Is it possible that this be considered here? Say, if the default language is a multibyte language, then use the different approach. I saw a few chinese website out there using drupal (apparently a modified one) and has the capability to search chinese, but don't see anybody discuss this here.
Of course it would be great if we can search multibytes language text and maintain the same kind of efficiency as the current searching mechanism.
Thank you very much for reading this.
Database charset
What charset/encoding does your database use? I think this is a matter of changing the default charset of your database to match that of the pages generated by Drupal.
Could you type some multibyte Chinese characters that I could use for testing (or is that impossible using UTF-8)?
database charset
I don't know how to find out database charset. I have been using chinese in a few sites with mysql database, and I never noticed that I need to set the charset. So, I am assuming it just receive whatever I send in. If I feed the database with gb2312 (the common charset for simplified chinese), it will happily give out them as I feed in.
I can type chinese here, but it will be in utf-8, not in gb2312. I think this is because IE or windows tries to detect the charset and then input the corresponding characters of that charset. The characters will be in the current charset of this page. If you view it with the same charset, it will still be readable. But if you viewing it with a different charset from the original one, it will be shown as garbage. Here is one sentence:
这是一个非常好的软件。
So, the input into database, storage and output is not problem. Only the search is not working, because of the way the index is created.
More thoughts and pointers
Some more thoughts and pointers (but no fixes):
You should check whether both charsets match; if your database uses ISO-8859-1 and the data you want to search for is provided in gb2312, then such search might not yield (m)any results.
I think that is why searching for 这是一个非常好的软件。 on drupal.org yields no result. It is stored in the database as è¿æ¯ä¸ä¸ªé常好ç软件ã.
Maybe use this page as a starting point.
using gb2312 in mysql with php
Thanks. I checked my database, and it has all kinds of (more than 10) charsets in it. (using command mysqladmin variables). I guess my hosting company just put everying they can find. It has gb2312 in it also. So, I guess that's why my data can be retrived from the database and shown. I looked at the data in my database through phpmyadmin, they are readable chinese. However, I can search some chinese phrase, but not all.
I am not sure why the chinese sentence is stored in your database as a different form, maybe it's because gb2312 is not a subset of utf-8? My database has gb2312 as one of it's charset, so the storing/retrieving is not a problem.
I will read in that page and see what I can do with the search.
Using utf-8 is fine in most situation, but in case my user has IE set to gb2312 because he only read chinese, then he will have problem. It will be fine if all users have auto select encoding on in IE, but I am not sure how many people have their computer set up like that.
I will look into the multi-byte string support also.
Thank you very much for your information.
Search module
Take a look at the calls to
preg_replace()in the search module near line 240 if you know some PHP. It think these might be the culprit.If you are not going to contribute a fix, I suggest filing a bug report along with a summary of (y)our findings. This will help us fix the problem. Without a bug report, we might forget to look into this.
yes,there r 2 big problem let
yes,there r 2 big problem let me transfer to xoops again:
1.Upcase of every title's first letter,this cause a error in multi-byte charset.
2.cannot search multi-byte content.
wait ur flexible solution.
i think when change default charset,all will be set correctly by system.
thanks!
Legacy software and UTF8
The thing is, Unicode/UTF8 is the most portable way of encoding data. It's the best way of encoding stuff... if we were using Latin-1 or something, you wouldn't have been able to post your Chinese sentence in the first place. There's also a Russian thread going on here. UTF8 shows its advantages easily.
What happens when you paste text from a MBCS application into a Drupal form and submit it? It should normally work, but I'm not sure.
If you make a PHP script that simply echoes the data sent to it, you could use it to wrap around the browser and use its * -> UTF8 conversion. A bit complicated, but I don't think there's an easier way aside from using an external utility.
I found this utility:
http://freshmeat.net/projects/autoconvert/?topic_id=849
You will need to compile it though, but it seems to do the conversions you need.
Googling for Big5 and UTF8 revealed a lot of chinese pages too, but I couldn't make much of them ;).
Default charset doesn't matter.
It doesn't matter what charset the user has IE set to (unless he's forcing charsets, which he shouldn't be). We're sending it charset utf-8 (both in the HTML meta tags, and the HTTP header), so it'll render fine anywhere, no matter what their default charset is - the specified one overrides that, always.
You need to make sure your database supports UTF-8 (although it shouldn't actually affect searching, anyway, just functions that require do things like substring() within SQL, as it'll get lengths of things wrong).
I suspect that the issue is that something in Drupal is doing something to the search string in a non-multibyte compatible fashion.
multibyte charset search
I have figured a work around, which does not use the search table, because I can not get php to index chinese yet, so I searched against the node table directly instead. See bug report:
http://drupal.org/node/view/2142
have to use gb2312 because cannot input chinese in utf-8
I was trying to switch to utf-8 for chinese, but just found out that I can not input chinese into utf-8 charset. All my program that can edit text file will treat my input of chinese as gb2312. I don't think windows have such a setting that I can specify charset when editing text file. But if I use gb2312 charset to open a text file with chinese entered in Windows, I have no problem.
The only way I can enter chinese is through the web interface. I used to be able to input chinese into the php code directly if I use gb2312 charset, but with utf-8, they won't be displayed correctly.
Not a Drupal problem?
Are you saying that it works fine as long you are using Drupal, but that things go wrong as soon you use another application (that doesn't support utf-8)?
it is not drupal problem
I understand that. I was just trying to answer why I can not use utf-8. It works fine as long as I use drupal by itself. So, specifying utf-8 with drupal is fine theoretically. But life is never so simple. Just like Drupal has to work with sql, php, and some operating system, I have to deal with text in other situation other than Drupal itself, and I will have to make all these work together.
Tool?
What tool fails to work and how are you using it to input/export data? I'd like to know so I can understand the problem.
input/output chinese into/from mysql/Drupal
I use Gvim to edit text file, which is a windows version of vim. I use Putty to connect to the server, and within the linux box I can use vim and type chinese directly into it. I am working from windows xp English version, and I set all non-unicode proram defaults to simplified chinese. So, I can either edit text file and php script on my computer with gvim and even notepad. I can input chinese, and then upload it onto the server, and it will work just fine. I guess gvim and notepad are probably considered non-unicode program. Maybe I am wrong, but I rmember that if I set non-unicode program defaults to English, then neither gvim nor notepad will show chinese properly.
Anyway, I think this is probably more a task for the administrator of a Drupal site than for the Drupal develop group.
看鸟语真累
我也遇到同样的问题,不知道那里出了错,node不能完全用中文。我都看了2天还没看明白为什么会出现乱码。
我的配置:apache2 + php5 +myslq 4.2.12
jb605我可以给你发邮件么?
Chinese character display in content
jb605, do you have any solution for chinese to display in content?
Phrase splitting and strtolower
A Japanese user had similar problems with search.module. Aside from encoding issues, the main problem was that like Chinese, Japanese does not use spaces, so entire sentences are being indexed rather than words. Unless I'm mistaken, Drupal only searches for identical word matches unless you explicitly tell it not to, using wildcards.
For Japanese, there is a utility called kakasi which splits sentences into words (actually I believe its main purpose is Kanji-to-Kana conversion, but it also does the splitting). Perhaps there is a similar utility for Chinese?
Another problem is the call to
strtolowerto convert the search string to lowercase. It seems PHP automatically assumes a Latin-1 character set, and thus converts a bunch of bytes in the 128-255 range (which are used for non-ASCII UTF8 encoding) to their Latin-1 uppercase/lowercase equivalent, e.g. Á (0xC1) to á (0xE1).I checked it using the following script:
The results probably depend on your system's language/locale settings, but over here I get a significant amount of conversions in the upper ASCII range.
This is not really a major problem, because the indexed words are lowercased too. It simply means that data in the search index is in a somewhat encoded form. On a rare occasion, it could cause extra unrelated search results, though often the
strtolowered characters are either invalid, or located in a completely different character range, making overlap with other unrelated words very unlikely.In fact, Chinese and Japanese characters seem mostly lowercase-safe. However Russian for example gets completely mangled.
tools for spliting no-space languages
I guess there might be some tools for splitting chinese also. Currently, I am just searching against the node table directly, skipping the search index etc completely. It is ok for me because my site is not going to have that many entries nor is it that dynamic. But for people who expect big database, indexing probably is the only way out.
how to do that?
Can you provide more information how do do that, thanks. I was confused with Chinese search for a long time.