In drupal, substr() function is used in many place.

But it does not consider multi-byte strings.

In utf-8, characters are encoded from 1 byte to 3 bytes. For example, 'U+0041'(alphabet 'A') is encoded as "0x41", and 'U+AC00'(가(=ga) in Korean) is encoded as "0xEA 0xB0 0x80".

If you call "substr('0x41 0x41 0xEA 0xB0 0x80 0x41', 0, 3)", it returns a broken(!) string "0x41 0x41 0xEA". It should be trimmed to "0x41 0x41" or something.

Comments

moshe weitzman’s picture

Version: » master
Priority: Critical » Normal

At bottom of this PHP manual page, a chinese user proposes a replacement for substr().

function dbyte_substr($str, $start, $len=''){
       if($len == ''){
               $outstr = substr($str, $start);
       }else{
            $outstr = substr($str, $start, $len);
            // Check the end bound is an double byte first byte or not
            if(preg_match("/[\x80-\xFF]$/", $outstr)){
                   $outstr = substr("$outstr", 0, -1);
            }
     }
       return $outstr;
}

I don't know how valid this solution is.

Anonymous’s picture

The suggested method does work only for the EUC encoding.

This bug is not only related to asian languages. non-ASCII characters, such as accent grave in French or umlaut in German, also cause the problem.

cdpark’s picture

mb_strcut() is the solution. It is only supported for (php 4 >= 4.0.6). Becuase it is an extended module, it may not be available.

http://www.php.net/manual/function.mb-strcut.php

We may need to backport(or reinvent) this routine.

Bug #2230 is also related.

cdpark’s picture

Title: substr breaks utf-8 strings » reinvented(?) wheels.

Instead of substr($str, 0, $length), use this function instead. It may solve the problem.

cdpark’s picture

Title: reinvented(?) wheels. » substr breaks utf-8 strings
al’s picture

The proper solution to this problem is to compile PHP with multibyte string support (--enable-mbstring) [see http://www.php.net/manual/en/ref.mbstring.php] and specify mbstring.func_overload in PHP.ini and/or .htaccess to be equal to 7 (overload on all functions).

--enable-mbstring is supposed to be enabled by default on PHP 4.3+, but the comment at the bottom of that page seems to imply that it actually isn't.

moshe weitzman’s picture

Title: substr breaks utf-8 strings » Document fix for 'substr breaks utf-8 strings'
Version: master »
Category: bug » task

Al suggests that the fix for this requires no code change in Drupal. Changing title to reflext that this is a documentation issue.

killes@www.drop.org’s picture

Fixed by Steven.

Anonymous’s picture