Although neither http://api.drupal.org/api/function/_xmlrpc/7 nor http://api.drupal.org/api/function/drupal_http_request/7 defines a character set, all data in Drupal is supposed to be UTF-8.
http://php.net/manual/en/function.htmlspecialchars.php, however, defaults to ISO-8859-1
| Comment | File | Size | Author |
|---|---|---|---|
| #13 | drupal.xmlrpc-charset.13.patch | 2.58 KB | sun |
| #1 | drupal.xmlrpc-charset.1.patch | 829 bytes | sun |
| drupal.xmlrpc-charset.0.patch | 1.01 KB | sun |
Comments
Comment #1
sunThis does not seem to fix my Unicode encoding problem, but fixing it would still be a good idea.
Furthermore, that's really one of most lovely comments I've ever read. Which blogging clients? And which versions? ...
#31301: Apostrophe returns '#039' on client using xmlrpc introduced the change from check_plain() to htmlspecialchars() in 2005, and, Steven was highly opposed to that.
5 years later, those blogging clients most probably no longer exist or hopefully have been fixed in the meantime.
Comment #2
dries commentedThis looks like the right thing to do do me; we shouldn't work around broken clients because it breaks valid clients.
UTF-8 is the new ISO-8859-1, and has been for a while (at least in Drupal).
Comment #3
sunComment #4
dries commentedOh my ...
We've been discussing this some more and found that not all valid UTF-8 characters are valid XML characters. See http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
Given that we're returning XML, additional filtering is required. This probably affects all of our XML-generating code; including format_rss_item().
Comment #5
chx commentedThat's not true. All you have found is that not all valid UTF-8 byte streams are valid XML characters. Guess what? They are not Unicode characters either.
Comment #6
chx commentedOK so now I understand. What you can not do is to take an arbitrary bytestream, encode it with the algorithm called UTF-8 and expect the results to be valid characters by the Unicode standard. UTF-8 is just that, an encoding of a bunch bytes into another bunch of bytes. What we need here, I think is
preg_replace('/.+/u', '\0', htmlspecialchars($xmlrpc_value->data, ENT_COMPAT, 'UTF-8')).Also, don't we need tests?
Edit: or we actually need to take #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] and turn it into a preg to be sure.
Comment #7
sunShouldn't check_plain() do this also for XHTML strings?
Comment #8
damien tournoud commentedEveryone is tired? check_plain() is just perfectly what we need here.
If the input stream is not valid UTF-8, check_plain() will kill it. Which is what we want anyway.
Comment #9
sunHrm. The current function body:
If PHP >= 5.2.5, we don't perform that filtering, and even for lower versions, check_plain() actually checks the beginning of the string only, not caring for invalid characters elsewhere. Or at the very least, if check_plain() is supposed to filter out invalid characters, then it is not doing what we expect it should be doing on my local test site on PHP 5.2.6/Win32.
Comment #10
damien tournoud commentedBecause PHP does that for us since 5.2.5.
preg_match() will check the validity of the whole input string before doing the matching.
Comment #11
damien tournoud commentedSo indeed, we are letting characters pass-thru that might not be valid in XML, ie. which are not in:
Most characters in [#x1-#x1F] are not valid, and we should strip them.
Question: should we also remove those from HTML output? It might be a costly operation.
Comment #12
sunI've tried to look up what the HTML5 spec defines, but it does not seem to define precise character ranges, just Text must consist of Unicode characters, and must not contain U+0000 characters, noncharacters, or control characters (except spaces).
Technically, I think that check_plain() is responsible for removing invalid characters, also for HTML, even if it may be expensive.
Of course, that would mean that such characters would be stored, but never ever be displayed again. Thus, removing invalid characters prior to validation and prior to storage would...
Comment #13
sunI've manually tested drupal_validate_utf8() with various invalid byte strings now. preg_match() only "silently fails" in the documented edge-case of characters #xC0-#xFF, which actually are valid utf8 byte sequences, if I get this right.
Also tried to come up with a unit test, but that's slightly over my head.
Comment #15
damien tournoud commentedYou have to encode those in UTF-8.
chr($code)is the UTF-8 representation of $code only for the lowest 7 bits.Comment #16
damien tournoud commentedIt seems that the XML extension has a nice utility function to do that:
http://php.net/manual/en/function.utf8-encode.php
Comment #17
smk-ka commentedI'd say the specs are actually pretty precise: any noncharacter or control characters need to be encoded as numeric entities (see 8.1.4 Character references). So we're not talking about removing any characters, but properly encoding them.Scratch that, it clearly says they're not even allowed as entities.
Comment #18
smk-ka commentedProbably related: folks over at Views Bonus Pack (XML export) are struggling with a similar issue: #603420: XML export mangles <, >, and &
Comment #19
sunAssigning to myself for now to not lose track of this bug.
Comment #20
sun#13: drupal.xmlrpc-charset.13.patch queued for re-testing.
Comment #22
sunutf8_encode() didn't really work for me to generate Unicode characters. Searching for alternatives, the simplest and most sane code I found is the unichr() implementation from a php.net comment on chr():
Comment #23
sunI'm not using XML-RPC anymore. :)
Comment #24
gregglesAfter #1285726-48: Remove XML-RPC moving this to Drupal 7.
But it could also be appropriate to move it to the contrib xmlrpc module.