XML-RPC request (string) values are not safe for UTF-8 [#882298]

Comment	File	Size	Author
#13	drupal.xmlrpc-charset.13.patch	2.58 KB	sun
#1	drupal.xmlrpc-charset.1.patch	829 bytes	sun
	drupal.xmlrpc-charset.0.patch	1.01 KB	sun

Comment #1

German

Karlsruhe

commented 13 August 2010 at 15:22

Status	File	Size
new	drupal.xmlrpc-charset.1.patch	829 bytes

This does not seem to fix my Unicode encoding problem, but fixing it would still be a good idea.

Furthermore, that's really one of most lovely comments I've ever read. Which blogging clients? And which versions? ...

#31301: Apostrophe returns '#039' on client using xmlrpc introduced the change from check_plain() to htmlspecialchars() in 2005, and, Steven was highly opposed to that.

5 years later, those blogging clients most probably no longer exist or hopefully have been fixed in the meantime.

Log in or register to post comments

Comment #2

dries commented 13 August 2010 at 18:42

Status:

Needs review

» Reviewed & tested by the community

This looks like the right thing to do do me; we shouldn't work around broken clients because it breaks valid clients.

UTF-8 is the new ISO-8859-1, and has been for a while (at least in Drupal).

Log in or register to post comments

Comment #3

sun

German

Karlsruhe

commented 13 August 2010 at 18:54

Issue tags:

+Needs backport to D6

Log in or register to post comments

Comment #4

dries commented 14 August 2010 at 01:21

Oh my ...

We've been discussing this some more and found that not all valid UTF-8 characters are valid XML characters. See http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

Given that we're returning XML, additional filtering is required. This probably affects all of our XML-generating code; including format_rss_item().

Log in or register to post comments

Comment #5

chx commented 14 August 2010 at 03:20

That's not true. All you have found is that not all valid UTF-8 byte streams are valid XML characters. Guess what? They are not Unicode characters either.

Log in or register to post comments

Comment #6

chx commented 14 August 2010 at 03:31

Status:

Reviewed & tested by the community

» Needs work

OK so now I understand. What you can not do is to take an arbitrary bytestream, encode it with the algorithm called UTF-8 and expect the results to be valid characters by the Unicode standard. UTF-8 is just that, an encoding of a bunch bytes into another bunch of bytes. What we need here, I think is preg_replace('/.+/u', '\0', htmlspecialchars($xmlrpc_value->data, ENT_COMPAT, 'UTF-8')).

Also, don't we need tests?

Log in or register to post comments

Comment #7

sun

German

Karlsruhe

commented 14 August 2010 at 10:56

Shouldn't check_plain() do this also for XHTML strings?

Log in or register to post comments

Comment #8

damien tournoud commented 14 August 2010 at 11:19

Status:

Needs work

» Reviewed & tested by the community

Everyone is tired? check_plain() is just perfectly what we need here.

If the input stream is not valid UTF-8, check_plain() will kill it. Which is what we want anyway.

Log in or register to post comments

Comment #9

sun

German

Karlsruhe

commented 14 August 2010 at 12:34

Status:

Reviewed & tested by the community

» Needs review

Hrm. The current function body:

function check_plain($text) {
  static $php525;

  if (!isset($php525)) {
    $php525 = version_compare(PHP_VERSION, '5.2.5', '>=');
  }
  if ($php525) {
    return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
  }
  return (preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES, 'UTF-8') : '';
}

If PHP >= 5.2.5, we don't perform that filtering, and even for lower versions, check_plain() actually checks the beginning of the string only, not caring for invalid characters elsewhere. Or at the very least, if check_plain() is supposed to filter out invalid characters, then it is not doing what we expect it should be doing on my local test site on PHP 5.2.6/Win32.

Log in or register to post comments

Comment #10

damien tournoud commented 14 August 2010 at 12:47

If PHP >= 5.2.5, we don't perform that filtering,...

Because PHP does that for us since 5.2.5.

and even for lower versions, check_plain() actually checks the beginning of the string only, not caring for invalid characters elsewhere. Or at the very least, if check_plain() is supposed to filter out invalid characters, then it is not doing what we expect it should be doing on my local test site on PHP 5.2.6/Win32.

preg_match() will check the validity of the whole input string before doing the matching.

Log in or register to post comments

Comment #11

damien tournoud commented 14 August 2010 at 14:00

So indeed, we are letting characters pass-thru that might not be valid in XML, ie. which are not in:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Most characters in [#x1-#x1F] are not valid, and we should strip them.

Question: should we also remove those from HTML output? It might be a costly operation.

Log in or register to post comments

Comment #12

sun

German

Karlsruhe

commented 14 August 2010 at 15:31

I've tried to look up what the HTML5 spec defines, but it does not seem to define precise character ranges, just Text must consist of Unicode characters, and must not contain U+0000 characters, noncharacters, or control characters (except spaces).

Technically, I think that check_plain() is responsible for removing invalid characters, also for HTML, even if it may be expensive.

Of course, that would mean that such characters would be stored, but never ever be displayed again. Thus, removing invalid characters prior to validation and prior to storage would...

Log in or register to post comments

Comment #13

sun

German

Karlsruhe

commented 15 August 2010 at 14:28

Issue tags:

+Needs tests

Status	File	Size
new	drupal.xmlrpc-charset.13.patch	2.58 KB

I've manually tested drupal_validate_utf8() with various invalid byte strings now. preg_match() only "silently fails" in the documented edge-case of characters #xC0-#xFF, which actually are valid utf8 byte sequences, if I get this right.

Also tried to come up with a unit test, but that's slightly over my head.

Log in or register to post comments

Comment #14

15 August 2010 at 15:10

Status:

Needs review

» Needs work

The last submitted patch, drupal.xmlrpc-charset.13.patch, failed testing.

Log in or register to post comments

Comment #15

damien tournoud commented 15 August 2010 at 15:46

+    // @todo chr() #fail
+    $invalid = array_merge(
+      $invalid,
+      array_map('chr', range(hexdec('D800'), hexdec('DFFF')))
+    );

You have to encode those in UTF-8. chr($code) is the UTF-8 representation of $code only for the lowest 7 bits.

Log in or register to post comments

Comment #16

damien tournoud commented 15 August 2010 at 15:52

It seems that the XML extension has a nice utility function to do that:

http://php.net/manual/en/function.utf8-encode.php

Log in or register to post comments

Comment #17

smk-ka commented 17 August 2010 at 16:27

I've tried to look up what the HTML5 spec defines, but it does not seem to define precise character ranges, just Text must consist of Unicode characters, and must not contain U+0000 characters, noncharacters, or control characters (except spaces).

I'd say the specs are actually pretty precise: any noncharacter or control characters need to be encoded as numeric entities (see 8.1.4 Character references). So we're not talking about removing any characters, but properly encoding them.
Scratch that, it clearly says they're not even allowed as entities.

Log in or register to post comments

Comment #18

smk-ka commented 17 August 2010 at 15:12

Probably related: folks over at Views Bonus Pack (XML export) are struggling with a similar issue: #603420: XML export mangles <, >, and &

Log in or register to post comments

Comment #19

sun

German

Karlsruhe

commented 22 September 2010 at 17:18

Assigned:

Unassigned

» sun

Assigning to myself for now to not lose track of this bug.

Log in or register to post comments

Comment #20

sun

German

Karlsruhe

commented 13 January 2011 at 13:26

Status:	Needs work	» Needs review
Issue tags:	-Needs backport to D6, -Needs tests

#13: drupal.xmlrpc-charset.13.patch queued for re-testing.

Log in or register to post comments

Comment #21

13 January 2011 at 14:02

Status:	Needs review	» Needs work
Issue tags:		+Needs backport to D6, +Needs tests

The last submitted patch, drupal.xmlrpc-charset.13.patch, failed testing.

Log in or register to post comments

Comment #22

sun

German

Karlsruhe

commented 11 July 2011 at 13:51

Version:	7.x-dev	» 8.x-dev
Issue tags:		+Needs backport to D7

utf8_encode() didn't really work for me to generate Unicode characters. Searching for alternatives, the simplest and most sane code I found is the unichr() implementation from a php.net comment on chr():

function unichr($u) {
  return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}

Log in or register to post comments

Comment #23

sun

German

Karlsruhe

commented 22 July 2012 at 23:36

Assigned:

sun

» Unassigned

I'm not using XML-RPC anymore. :)

Log in or register to post comments

Comment #24

greggles

he/him

English

Denver, Colorado, USA

commented 1 August 2014 at 12:48

Version:	8.0.x-dev	» 7.x-dev
Issue summary:	View changes

After #1285726-48: Remove XML-RPC moving this to Drupal 7.

But it could also be appropriate to move it to the contrib xmlrpc module.

Log in or register to post comments

Comment #25

1 August 2014 at 12:48

Status:

Needs work

» Closed (outdated)

Automatically closed because Drupal 7 security and bugfix support has ended as of 5 January 2025. If the issue verifiably applies to later versions, please reopen with details and update the version.

Log in or register to post comments

XML-RPC request (string) values are not safe for UTF-8

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

News items

Our community

Documentation

Drupal code base

Governance of community