Although neither http://api.drupal.org/api/function/_xmlrpc/7 nor http://api.drupal.org/api/function/drupal_http_request/7 defines a character set, all data in Drupal is supposed to be UTF-8.

http://php.net/manual/en/function.htmlspecialchars.php, however, defaults to ISO-8859-1

Comments

sun’s picture

StatusFileSize
new829 bytes

This does not seem to fix my Unicode encoding problem, but fixing it would still be a good idea.

Furthermore, that's really one of most lovely comments I've ever read. Which blogging clients? And which versions? ...

#31301: Apostrophe returns '#039' on client using xmlrpc introduced the change from check_plain() to htmlspecialchars() in 2005, and, Steven was highly opposed to that.

5 years later, those blogging clients most probably no longer exist or hopefully have been fixed in the meantime.

dries’s picture

Status: Needs review » Reviewed & tested by the community

This looks like the right thing to do do me; we shouldn't work around broken clients because it breaks valid clients.

UTF-8 is the new ISO-8859-1, and has been for a while (at least in Drupal).

sun’s picture

Issue tags: +Needs backport to D6
dries’s picture

Oh my ...

We've been discussing this some more and found that not all valid UTF-8 characters are valid XML characters. See http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

Given that we're returning XML, additional filtering is required. This probably affects all of our XML-generating code; including format_rss_item().

chx’s picture

That's not true. All you have found is that not all valid UTF-8 byte streams are valid XML characters. Guess what? They are not Unicode characters either.

chx’s picture

Status: Reviewed & tested by the community » Needs work

OK so now I understand. What you can not do is to take an arbitrary bytestream, encode it with the algorithm called UTF-8 and expect the results to be valid characters by the Unicode standard. UTF-8 is just that, an encoding of a bunch bytes into another bunch of bytes. What we need here, I think is preg_replace('/.+/u', '\0', htmlspecialchars($xmlrpc_value->data, ENT_COMPAT, 'UTF-8')).

Also, don't we need tests?

Edit: or we actually need to take #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] and turn it into a preg to be sure.

sun’s picture

Shouldn't check_plain() do this also for XHTML strings?

damien tournoud’s picture

Status: Needs work » Reviewed & tested by the community

Everyone is tired? check_plain() is just perfectly what we need here.

If the input stream is not valid UTF-8, check_plain() will kill it. Which is what we want anyway.

sun’s picture

Status: Reviewed & tested by the community » Needs review

Hrm. The current function body:

function check_plain($text) {
  static $php525;

  if (!isset($php525)) {
    $php525 = version_compare(PHP_VERSION, '5.2.5', '>=');
  }
  if ($php525) {
    return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
  }
  return (preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES, 'UTF-8') : '';
}

If PHP >= 5.2.5, we don't perform that filtering, and even for lower versions, check_plain() actually checks the beginning of the string only, not caring for invalid characters elsewhere. Or at the very least, if check_plain() is supposed to filter out invalid characters, then it is not doing what we expect it should be doing on my local test site on PHP 5.2.6/Win32.

damien tournoud’s picture

If PHP >= 5.2.5, we don't perform that filtering,...

Because PHP does that for us since 5.2.5.

and even for lower versions, check_plain() actually checks the beginning of the string only, not caring for invalid characters elsewhere. Or at the very least, if check_plain() is supposed to filter out invalid characters, then it is not doing what we expect it should be doing on my local test site on PHP 5.2.6/Win32.

preg_match() will check the validity of the whole input string before doing the matching.

damien tournoud’s picture

So indeed, we are letting characters pass-thru that might not be valid in XML, ie. which are not in:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Most characters in [#x1-#x1F] are not valid, and we should strip them.

Question: should we also remove those from HTML output? It might be a costly operation.

sun’s picture

I've tried to look up what the HTML5 spec defines, but it does not seem to define precise character ranges, just Text must consist of Unicode characters, and must not contain U+0000 characters, noncharacters, or control characters (except spaces).

Technically, I think that check_plain() is responsible for removing invalid characters, also for HTML, even if it may be expensive.

Of course, that would mean that such characters would be stored, but never ever be displayed again. Thus, removing invalid characters prior to validation and prior to storage would...

sun’s picture

Issue tags: +Needs tests
StatusFileSize
new2.58 KB

I've manually tested drupal_validate_utf8() with various invalid byte strings now. preg_match() only "silently fails" in the documented edge-case of characters #xC0-#xFF, which actually are valid utf8 byte sequences, if I get this right.

Also tried to come up with a unit test, but that's slightly over my head.

Status: Needs review » Needs work

The last submitted patch, drupal.xmlrpc-charset.13.patch, failed testing.

damien tournoud’s picture

+    // @todo chr() #fail
+    $invalid = array_merge(
+      $invalid,
+      array_map('chr', range(hexdec('D800'), hexdec('DFFF')))
+    );

You have to encode those in UTF-8. chr($code) is the UTF-8 representation of $code only for the lowest 7 bits.

damien tournoud’s picture

It seems that the XML extension has a nice utility function to do that:

http://php.net/manual/en/function.utf8-encode.php

smk-ka’s picture

I've tried to look up what the HTML5 spec defines, but it does not seem to define precise character ranges, just Text must consist of Unicode characters, and must not contain U+0000 characters, noncharacters, or control characters (except spaces).

I'd say the specs are actually pretty precise: any noncharacter or control characters need to be encoded as numeric entities (see 8.1.4 Character references). So we're not talking about removing any characters, but properly encoding them.
Scratch that, it clearly says they're not even allowed as entities.

smk-ka’s picture

Probably related: folks over at Views Bonus Pack (XML export) are struggling with a similar issue: #603420: XML export mangles <, >, and &

sun’s picture

Assigned: Unassigned » sun

Assigning to myself for now to not lose track of this bug.

sun’s picture

Status: Needs work » Needs review
Issue tags: -Needs backport to D6, -Needs tests

#13: drupal.xmlrpc-charset.13.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs backport to D6, +Needs tests

The last submitted patch, drupal.xmlrpc-charset.13.patch, failed testing.

sun’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +Needs backport to D7

utf8_encode() didn't really work for me to generate Unicode characters. Searching for alternatives, the simplest and most sane code I found is the unichr() implementation from a php.net comment on chr():

function unichr($u) {
  return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
sun’s picture

Assigned: sun » Unassigned

I'm not using XML-RPC anymore. :)

greggles’s picture

Version: 8.0.x-dev » 7.x-dev
Issue summary: View changes

After #1285726-48: Remove XML-RPC moving this to Drupal 7.

But it could also be appropriate to move it to the contrib xmlrpc module.

Status: Needs work » Closed (outdated)

Automatically closed because Drupal 7 security and bugfix support has ended as of 5 January 2025. If the issue verifiably applies to later versions, please reopen with details and update the version.