Document fix for 'substr breaks utf-8 strings'
Anonymous (not verified) - July 19, 2003 - 05:18
| Project: | Drupal |
| Component: | base system |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
Jump to:
Description
In drupal, substr() function is used in many place.
But it does not consider multi-byte strings.
In utf-8, characters are encoded from 1 byte to 3 bytes. For example, 'U+0041'(alphabet 'A') is encoded as "0x41", and 'U+AC00'(가(=ga) in Korean) is encoded as "0xEA 0xB0 0x80".
If you call "substr('0x41 0x41 0xEA 0xB0 0x80 0x41', 0, 3)", it returns a broken(!) string "0x41 0x41 0xEA". It should be trimmed to "0x41 0x41" or something.

#1
At bottom of this PHP manual page, a chinese user proposes a replacement for substr().
function dbyte_substr($str, $start, $len=''){if($len == ''){
$outstr = substr($str, $start);
}else{
$outstr = substr($str, $start, $len);
// Check the end bound is an double byte first byte or not
if(preg_match("/[\x80-\xFF]$/", $outstr)){
$outstr = substr("$outstr", 0, -1);
}
}
return $outstr;
}
I don't know how valid this solution is.
#2
The suggested method does work only for the EUC encoding.
This bug is not only related to asian languages. non-ASCII characters, such as accent grave in French or umlaut in German, also cause the problem.
#3
mb_strcut() is the solution. It is only supported for (php 4 >= 4.0.6). Becuase it is an extended module, it may not be available.
http://www.php.net/manual/function.mb-strcut.php
We may need to backport(or reinvent) this routine.
Bug #2230 is also related.
#4
Instead of
substr($str, 0, $length), use this function instead. It may solve the problem.#5
#6
The proper solution to this problem is to compile PHP with multibyte string support (--enable-mbstring) [see http://www.php.net/manual/en/ref.mbstring.php] and specify mbstring.func_overload in PHP.ini and/or .htaccess to be equal to 7 (overload on all functions).
--enable-mbstring is supposed to be enabled by default on PHP 4.3+, but the comment at the bottom of that page seems to imply that it actually isn't.
#7
Al suggests that the fix for this requires no code change in Drupal. Changing title to reflext that this is a documentation issue.
#8
Fixed by Steven.
#9