In file "includes/unicode.inc" there is a function "drupal_convert_to_utf8" which has the following code:
if (function_exists('iconv')) {
$out = @iconv($encoding, 'utf-8', $data);
}
According to http://uk.php.net/function.iconv this returns the converted string, or FALSE on failure. The string is cut at the first illegal character.
I think it should be either changed to:
if (function_exists('iconv')) {
$out = @iconv($encoding, 'utf-8//TRANSLIT', $data);
}
So that "when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters" instead of just being cut.
or
if (function_exists('iconv')) {
$out = @iconv($encoding, 'utf-8//IGNORE', $data);
}
to ignore that single illegal character, but continue to convert whatever is past that point.
Either would do for me, but I guess it is better to ask those in the know which would be better.
(I have no idea what component to file this against. I have chosen "Base System")
| Comment | File | Size | Author |
|---|---|---|---|
| #1 | unicode.patch | 500 bytes | naheemsays |
Comments
Comment #1
naheemsays commentedAttached is a patch to change it to transliterate illegal characters.
Comment #2
naheemsays commentedTesting the above patch - iconv now functions like mbstring as it does not just cut out any processing past the point of the illegal character.
Without the patch, there can be dataloss if a an illegal chracacter is encountered as no characters after it wil be processed. Should it be marked critical? (the same patch would also need to be applied to previous versions of Drupal).
Comment #3
robin monks commentedIs there a testcase available with some text that shows this problem? It would really help those trying to review the patch.
Robin
Comment #4
naheemsays commentedI think the original issue was when converting characters like en and em dashes that have been added to iso 8859-1 that were multi-byte and where the second byte on its own was not a proper character.
(my original test case was a backup database of around 100MB I was converting from phpbb to drupal...)
Comment #5
pillarsdotnet commentedA similar patch was marked WONTFIX in #1018840: Drupal provides no error-free way to sanitize text from untrusted sources.