Looking at the CVS logs, this bug seems to be affecting 4.4.0 and HEAD too.

The problem is that the substr() used for extracting a subject from the comment string is unfortunately not utf-8 aware, and therefore it could split strings in the middle of utf-8 multibyte sequences. This produces broken output, and strange display in some browsers (Mozilla 1.5 just hides the bogus title from me). Since PHP does not provide a generic solution for handling utf-8 strings (except mbstring, which Drupal should not specify as a requirement IMHO), I guess we need to add some utf-8 substring functionality into common.inc (so other modules can also use it).

I am willing to work on providing a patch, if this approach is acceptable.

Here is some explanation on how utf-8 multibyte sequences can be detected: http://www.frech.ch/man/man7/utf8.7.html

CommentFileSizeAuthor
#2 truncate_utf8.patch1.12 KBkilles@www.drop.org

Comments

Steven’s picture

I whipped up a UTF-8-safe truncator which should work for the problem areas.

/*
 * UTF-8-safe string truncation
 * If the end position is in the middle of a UTF-8-sequence, it scans backwards until
 * the beginning of the sequence.
*/
function truncate_utf8($string, $len) {
  $slen = strlen($string);
  if ($slen <= $len) {
    return $string;
  }
  if ((ord($string[$len]) < 0x80) || (ord($string[$len]) >= 0xC0)) {
    return substr($string, 0, $len);
  }
  while (ord($string[--$len]) < 0xC0) { };
  return substr($string, 0, $len);
}

Works fine for a couple of test strings here, don't have time to whip up a full patch.

killes@www.drop.org’s picture

Assigned: Unassigned » killes@www.drop.org
StatusFileSize
new1.12 KB

here is a patch

Steven’s picture

Committed a modified version of this function to CVS: the problem applied to many more places than just comment.module.

Anonymous’s picture