In file "includes/unicode.inc" there is a function "drupal_convert_to_utf8" which has the following code:

  if (function_exists('iconv')) {
    $out = @iconv($encoding, 'utf-8', $data);
  }

According to http://uk.php.net/function.iconv this returns the converted string, or FALSE on failure. The string is cut at the first illegal character.

I think it should be either changed to:

  if (function_exists('iconv')) {
    $out = @iconv($encoding, 'utf-8//TRANSLIT', $data);
  }

So that "when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters" instead of just being cut.

or

  if (function_exists('iconv')) {
    $out = @iconv($encoding, 'utf-8//IGNORE', $data);
  }

to ignore that single illegal character, but continue to convert whatever is past that point.

Either would do for me, but I guess it is better to ask those in the know which would be better.

(I have no idea what component to file this against. I have chosen "Base System")

CommentFileSizeAuthor
#1 unicode.patch500 bytesnaheemsays

Comments

naheemsays’s picture

Status: Active » Needs review
StatusFileSize
new500 bytes

Attached is a patch to change it to transliterate illegal characters.

naheemsays’s picture

Title: Function drupal_convert_to_utf8 cuts string at first illegal character. » Function drupal_convert_to_utf8 cuts string at first illegal character causing dataloss

Testing the above patch - iconv now functions like mbstring as it does not just cut out any processing past the point of the illegal character.

Without the patch, there can be dataloss if a an illegal chracacter is encountered as no characters after it wil be processed. Should it be marked critical? (the same patch would also need to be applied to previous versions of Drupal).

robin monks’s picture

Is there a testcase available with some text that shows this problem? It would really help those trying to review the patch.

Robin

naheemsays’s picture

I think the original issue was when converting characters like en and em dashes that have been added to iso 8859-1 that were multi-byte and where the second byte on its own was not a proper character.

(my original test case was a backup database of around 100MB I was converting from phpbb to drupal...)

pillarsdotnet’s picture

Status: Needs review » Closed (won't fix)