Document fix for 'substr breaks utf-8 strings'

Anonymous (not verified) - July 19, 2003 - 05:18
Project:Drupal
Component:base system
Category:task
Priority:normal
Assigned:Unassigned
Status:closed
Description

In drupal, substr() function is used in many place.

But it does not consider multi-byte strings.

In utf-8, characters are encoded from 1 byte to 3 bytes. For example, 'U+0041'(alphabet 'A') is encoded as "0x41", and 'U+AC00'(가(=ga) in Korean) is encoded as "0xEA 0xB0 0x80".

If you call "substr('0x41 0x41 0xEA 0xB0 0x80 0x41', 0, 3)", it returns a broken(!) string "0x41 0x41 0xEA". It should be trimmed to "0x41 0x41" or something.

#1

moshe weitzman - July 19, 2003 - 08:43
Version:<none>» HEAD
Priority:critical» normal

At bottom of this PHP manual page, a chinese user proposes a replacement for substr().

function dbyte_substr($str, $start, $len=''){
       if($len == ''){
               $outstr = substr($str, $start);
       }else{
            $outstr = substr($str, $start, $len);
            // Check the end bound is an double byte first byte or not
            if(preg_match("/[\x80-\xFF]$/", $outstr)){
                   $outstr = substr("$outstr", 0, -1);
            }
     }
       return $outstr;
}

I don't know how valid this solution is.

#2

Anonymous - July 19, 2003 - 09:07

The suggested method does work only for the EUC encoding.

This bug is not only related to asian languages. non-ASCII characters, such as accent grave in French or umlaut in German, also cause the problem.

#3

cdpark - July 19, 2003 - 22:43

mb_strcut() is the solution. It is only supported for (php 4 >= 4.0.6). Becuase it is an extended module, it may not be available.

http://www.php.net/manual/function.mb-strcut.php

We may need to backport(or reinvent) this routine.

Bug #2230 is also related.

#4

cdpark - July 20, 2003 - 00:08
Title:substr breaks utf-8 strings» reinvented(?) wheels.

Instead of substr($str, 0, $length), use this function instead. It may solve the problem.

#5

cdpark - July 20, 2003 - 00:09
Title:reinvented(?) wheels.» substr breaks utf-8 strings

#6

al - July 21, 2003 - 08:13

The proper solution to this problem is to compile PHP with multibyte string support (--enable-mbstring) [see http://www.php.net/manual/en/ref.mbstring.php] and specify mbstring.func_overload in PHP.ini and/or .htaccess to be equal to 7 (overload on all functions).

--enable-mbstring is supposed to be enabled by default on PHP 4.3+, but the comment at the bottom of that page seems to imply that it actually isn't.

#7

moshe weitzman - October 27, 2003 - 13:51
Title:substr breaks utf-8 strings» Document fix for 'substr breaks utf-8 strings'
Version:HEAD» <none>
Category:bug report» task

Al suggests that the fix for this requires no code change in Drupal. Changing title to reflext that this is a documentation issue.

#8

killes@www.drop.org - July 23, 2004 - 17:32

Fixed by Steven.

#9

Anonymous - August 6, 2004 - 19:20
 
 

Drupal is a registered trademark of Dries Buytaert.