Hi,

We can easily post unicode text in drupal and it looks nice so far. We did not use translation faculity as the website will be in only one language (90% non English). We are using 5.2 and its very good except search in unicode. We even have texonomy worked in Unicode.

If we try to search any unicode text, it down not show anything but same in google shown our web.

1) Can soembody suggest, if drupal search can search unicode.

2) Also, another samll problem is that pathauto module do not create unicode alias however, unicode string is permiteed in URL except in domain names. That's not a big problem though.

Am I missing naything ?

Thanks,
-Amit

Comments

cog.rusty’s picture

Unicode search should work ok.

Take a look at your database using phpmyadmin, at the "Collation" column, to see if all your tables are in utf8.

desimind’s picture

I have the same problem with Unicode text. My database was created by Drupal itself and I double checked it and it seems to set correctly. I am able to save Unicode text but search still does not return any results.

I am using Drupal 5.3. PHP 5.2.5 and MYSQL 5.0.45 with PHP Mbstring Extension at the siteground hosting.

I have tried the same thing on my own test server with Drupal 6.1 and its the same problem. On this same server I have joomla installed and Unicode search works perfectly there so I take it its not PHP or my web browser or even the webserver (Apache in this case).

When I search for unicode text I always get the following error

"You must include at least one positive keyword with 3 characters or more."

Any help would be much appreciated.

Samir

cog.rusty’s picture

It is possible that your content has not been indexed yet. Indexing happens gradually every time cron runs, and this is set up in the admin/settings/search page.

Try to run cron.php manually with your browser, to see if it makes a difference.

desimind’s picture

Thanks for your response c.r.

Yup, I had thought that could be it too and did run it several times before I checked this out. Infact I also added some English content and reran cron and this newly added content seems to show up but the old Unicode content is still not showing up.

cog.rusty’s picture

Have you tried to check the collation of your database with phpmyadmin? Search depends on comparison, and comparison depends on collation.

Are the collations of all your tables utf8_general_ci (or even utf8_unicode_ci)?

desimind’s picture

Its set to utf8_general_ci

I have 3 installations of Drupal.

1. My laptop; Drupal 5.2
2. A Test server Drupal 6.1
3. A website hosted at siteground. Drupal 5.3

In all the three cases the tables were created by Drupal's install.php script.

Should I try setting it to utf8_unicode_ci?

Also the message I get when I search is "You must include at least one positive keyword with 3 characters or more."

Now I don't claim to be a php expert but I looked at the code for the search module and it seems like this message is displayed even before any database activity. Until this message is display, the code seems too be parsing the input search text and setting up some arrays etc.

cog.rusty’s picture

If the collation of *all* the tables is utf8_general_ci then it should be OK. The only difference of utf8_unicode_ci is how it sorts equivalent letters in some languages. So, I don't know what the problem is...

desimind’s picture

My site contains some text in an Indian language called Marathi.

Here is my search phrase मराठी भाषांतर. In the search module this gets broken down into two words and each word is passed into a function called search_simplify

I put logs into this function and found at around line 367 there is the following call

$text = preg_replace('/['. PREG_CLASS_SEARCH_EXCLUDE .']+/u', ' ', $text);

The first word before this line is मराठी but after this line is executed it becomes मर ठ

Naturally this text is not found in the database.

The PREG_CLASS_SEARCH_EXCLUDE is a define at the beginning of the search.module which is causing some character to get replaced...

Does anyone have any thoughts on how I could proceed from here on....?

desimind’s picture

I commented out the line that I reported in my previous post and yet the search term was not found, which led me to the database once again. I ran a simple query on the table "search_index" to see if any part of the search term actually existed in the DB.... and to my surprise it did not. Even tho' there were many Marathi words in the DB. Then I searched for a word which existed in the table "search_index" and sure enough the search returned valid results.

I added another Page in Marathi, this time something really small (only a few lines) and re-ran cron.php. Then I searched some for text from this new page and the search seemed to function fine.

There is another problem which I have experienced which I did not think was relevant to this search issue but now I think it it. Here is how I reproduced this problem.

I deleted all content from my site (English and Marathi). I then added a small Marathi article (with an English title) and ran cron.php. Then I searched for some text from this article and the searched worked fine. I ofcourse restored the search.module to the original one which came with Drupal. Now here is where it gets tricky. I added a new Page with relatively large Unicode content in it. When I opened this content in WordPad on Windows, it was about 18 pages, I saved this Page and also set the flag to show it on the "Promote on Front Page". When I goto the Front Page, the summary shows up fine, I click the "read more" link, but no content shows up. The page opens up and shows on title of this large article that I posted and nothing from the body of the article shows up.

I ran cron.php and search for some text from this new Article but get no results. Also the number of rows in the table "search_index" has not changed before and after running the cron.php. I re-ran cron.php again, many times (probably about 10 times) and still nothing gets added to the table "search_index". I modified the Search Settings as per the documentation to index only 10 items and ran cron.php but still no difference, then I modified it to 500 items and ran cron.php and still no difference.

I then added a third small piece of context, again in Marathi and ran cron.php. I noticed that the "search_index" does get updated and the searched for some text from this new content and it search works fine but still nothing from the, second large content shows up. If I search for the title (which is in English) of the second text, the search works fine but looks like nothing from the body of the second article has gotten indexed.

I also added the same article as a Story and still the same problem exisits. I looked at the contents of the table called node_revisions (using phpmyadmin) and the article seems to have been saved fully and properly saved.

I know this was a very long post, but I hope someone can suggest something I can try.

Thanks in anticipation.

desimind’s picture

I have a workaround for this above situation.

Let me summarize the problems first.

1. Marathi content longer than a certain length does not display at all. The display page shows up empty. Even tho' editing and saving works just fine. Also the content is correctly stored in the database.
2. The indexing code of the core search seems to ignore this large unicode text. Also, I am not sure but not all words of the Marathi text seem to get indexed.

Now for the workaround.

1. When storing large amounts of text (which do not show up properly) set the "Input Format" to "PHP code".
2. Use the trip search module a.k.a SQL Search module. This does not rely on the core search module and goes directly to the database for searching.

Not all folks might want to hear this but --- Joomla support for Unicode seems to be much better, none of the problems I mentioned about seem to exist in Joomla. Unfortunately, my site is up and I have enough content in there that this workaround is quicker to implement than actually migrating to Joomla.

Once I get to understanding the Drupal code better, I will attempt to fix these issues and will be sure to post here. I only got one person to respond to my post so not sure if anyone else is interested in this issue. If you are, please get in touch with me and I can share more details, content and steps to reproduce the issues I faced. I could also let you see the problem on my test site.

Thanks!!!

cog.rusty’s picture

From what you found out, it seems to be a bug. It is a good idea to post an issue for this bug and describe what you found out so far, so that a developer can see it and work on it.

Forum posts lose attention quicker, and developers read them only by chance -- too many forum posts.

By the way, don't allow the php input format to anyone else, because anyone could change the admin's username/password with just one line of code.

desimind’s picture

Thanks CR. I will post a bug report.

As for providing access, I will be careful but I only offered it because this is a machine which I use for testing. So even if I was to reformat it, that would be OK :-)

hychanhan’s picture

That is not working for me at all, when i try to search with Khmer Unicode it always return no results (0 result)
and the message appeared: "You must include at least one positive keyword with 3 characters or more."

Do you have any idea?
Please kindly help me.

Thanks,

Regards,

ChanHan Hy

Email: hy.chanhan@gmail.com

lesmorgan’s picture

I can confirm that the same problem exists when searching for content in the Devanagari writing system which is used for multiple Indic languages. Drupal will store Devanagari script and display it with no problems, but searching for it gives the "You must include at least one positive keyword with 3 characters or more." error message reported for Khmer, etc. I am suspecting this has something to do with the complex script handling, since Unicode Devanagari (used for Marathi) is a syllabic script and not an alphabetic script. The Unicode implementation for Devanagari is documented at http://www.unicode.org/charts/PDF/U0900.pdf . Marathi is also written using Modi script which I have not tested. Is there an outstanding bug number for this?