Add mbstring support to Drupal [#26688]

The search module contains some bugs when using native UTF-8 database connection. To correct this some of the strtolower function should be changed to some UTF-8 compatible versions of the function like mb_strtolower.
The following lines should be changed:
398:
- $word = strtolower($word);
+ $word = mb_strtolower($word, 'UTF-8');
493:
- $arguments[] = str_replace('?', '%', strtolower($word));
+ $arguments[] = str_replace('?', '%', mb_strtolower($word, 'UTF-8'));
497:
- $arguments[] = strtolower($word);
+ $arguments[] = mb_strtolower($word, 'UTF-8');

Comment	File	Size	Author
#1	multibyte.patch	30.04 KB	Steven

Comments

Comment #1

Steven commented 15 July 2005 at 02:33

Title:	Search module UTF-8 compatibility	» Add mbstring support to Drupal
Version:	4.6.0	»

Status	File	Size
new	multibyte.patch	30.04 KB

This minor issue is part of a bigger problem: lack of clear string handling in Drupal. There was a discussion a while ago on the mailing list (someone can dig it up if interested) about mbstring. We agreed on a plan of action, so now, here's the patch.

Sorry for the long text, but it does not make sense to review this patch without knowing the how's and why's. ;)

Situation

Because PHP's string handling does not support Unicode out of the box, we need the mbstring extension to do stuff like case conversion. But, mbstring is not a required extension (and considering its state, it never will be) so we can't rely on it. But it would be nice to use it if it's there.

Another important thing to consider is that PHP 5.1 will be getting real Unicode support (see this comment for more info). While the functionality looks great, it introduces some new syntax elements and changes string handling in significant ways. Because we don't want to mess up Drupal's code too much when it happens, I decided that this was a good opportunity to move all the encoding-specific stuff into a unicode.inc file. That way the transition later will be easier. Also, the encoding-stuff has been growing steadily, so it's not a bad idea to separate it from the rest regardless.

My patch performs various checks on startup (in the common.inc stage, i.e. after bootstrap) to see if mbstring is present. Aside from the literal check, we also have to do some configuration checks. I also took the liberty of introducing a check for the search/PCRE-compatibilty issue ("Characters > 255 are not yet supported..."). It's quite simple. Errors are reported on admin/settings, and I also added a small status item there with info about which string handling is used. It suggests that people install mbstring for better Unicode support if it's not yet present.

Now, in order to use mbstring, there is the option of overloading the standard string functions automatically. This is not really appropriate for Drupal because it prevents access to the original string APIs. Furthermore, if mbstring can already handle UTF-8, then it does not make sense for us to call custom fallback routines. So, I decided to make wrappers for the string handling operations that need them (substr, strtolower, strtoupper, ucfirst). All of the wrappers deal in characters and character indices only, so for all intents and purposes they give you Unicode support already. This means the transition to PHP Unicode will be easier for module authors. Of course, when appropriate, you can still use the plain PHP string APIs. For example, when doing a strpos + substr.

I also added a call to setlocale() to do away with any remaining vagueries in non-mbstring string handling (which was locale dependant before). As a bonus, I was able to code a very simple routine for doing case conversion on the Latin-1 character set. This means the capabilities of non-mbstring string handling are back to the level of before we switched to UTF-8. Of course most non-english sites will still benefit from mbstring.

Finally I fixed some possible case conversion issues: in a few places, we compared the SQL function LOWER() with PHP's strtolower. Because the behaviour of either depends on the setup, we should never mix them. So now, you either put drupal_strtolower'd data in the database, and strtolower your condition before you select it, or you use LOWER() on both the column and the value.

Note: In theory I could alter routines like truncate_utf8() and mime_header_encode() to take more advantage of mbstring APIs... this might result in a (minor) speed up, but it would complicate the code further without additional features, and would mean more dependance on the PHP configuration. So I didn't do this.

Remaining questions

One issue to deal with is database encodings. If the database encoding is UTF-8, then the (VAR)CHAR column lengths are expressed in characters. In that case, strings should be chopped off based on character counts (i.e. using drupal_substr). However, if the database encoding is an old 8-bit one like Latin-1 (ISO-8859-1), then the column lengths are expressed in bytes. In that case, strings should be chopped off based on byte counts (i.e. using truncate_utf8).

To make matters worse, PHP only recently added a function call to get the database character set from MySQL to PHP HEAD, so it is not usable for us. If we continue to use truncate_utf8, then we might chop off too much. If we switch to drupal_substr, then we might chop off too little.

Also, if the database character set is not UTF-8, then it means any Drupal function that relies on LOWER() will not work on non-ASCII data. Unfortunately we cannot rely on drupal_strtolower everywhere without wasting database space: often, we want to do a case-insensitive comparison, but still keep the original case of the data (e.g. a username).

Any ideas here?

Also, we got rid of ucfirst() in the page title a long time ago. Browser support for capitalization is still shady, so it might not be a bad idea to add it back. On the other hand, we could just capitalize the menu item titles directly.

Comment #2

Steven commented 15 July 2005 at 02:53

Component:

search.module

» base system

By the way, even if we don't end up using ucfirst in core, the function (and the other case conversions) are used a lot in contrib. In the interests of l10n, the idea of identically behaving, pluggable Unicode-aware replacements of PHP's borky functions is nice.

Comment #3

Poetro commented 15 July 2005 at 11:31

I would appreciate these wrappers like drupal_strtolower and for module developers this would uniform the using of string conversion.
I think what are needed:

strtolower
strtoupper
ucfirst,
ucwords
strpos
strrpos
stripos
substr
strlen
split
substr_count

And the wrapper should deside which function to call in the unicode.inc.
For Developers it would mean a search and replace for the previous functions to the drupal_* function. This would extend the usability of previous modules to get full usage of the UTF-8 characters even though they don't have to know how it works, what converison do the called function do, they only get the result they wanted. In times the mb_string functions is not available there should be functions that would convert the text in a way that is most acceptable. In my language (Hungarian) the LOWER and UPPER SQL functions work fine (assumed that UTF-8 is nativly used in the database [SET NAMES 'utf-8';SET CHARACTER SET utf8] mentioned in http://drupal.org/node/26549), which is I think is advised if You are using MySQL 4.1+, but in this case the wrappers should be used insted the PHPs embeded functions, to prevent the problems using them.
A good codebase I think would extend the API, and would bring some help to those who want to use their special native characters using UTF-8 to store data.

Comment #4

Steven commented 15 July 2005 at 16:08

It doesn't make sense to replicate every PHP string function... a lot of them are not useful, and several wouldn't do anything differently.

For example, if you use strpos, you'll probably want to do a substr later. Or you just want to check if the result is FALSE or not. As this is done on known character boundaries with a known substring, the regular PHP functions (with byte offsets) can be used safely. Yes, I could write UTF-8 compatible versions which count characters, but they would be a lot slower and they would bloat the codebase unnecessarily.

And functions like split(), and substr_count() behave exactly the same on bytes or characters. There is nothing to be gained from wrapping them.

A lot of the multibyte problems stem from the annoying properties of Japanese and Chinese multibyte encodings. UTF-8 does not suffer from them (which is why it is relatively easy to implement).

Comment #5

Steven commented 25 July 2005 at 20:37

Status:

Needs review

» Fixed

Commited to HEAD.

Version:		» 4.6.0
Status:	Fixed	» Closed (fixed)

Add mbstring support to Drupal

Comments

Comment #1

Situation

Remaining questions

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

News items

Our community

Documentation

Drupal code base

Governance of community