While writing a couple unit tests over at #296314: TestingParty08: format_rss_item needs a test, I discovered that core's format_xml_elements() blindly trusts that element names passed in are safe.
check_plain() is not an acceptable method for this task because & characters cannot be in element names in XML, and any encoding would produce them. Instead, this patch will check element names' validity with the following regex: /^[\w:][\w\-:]*$/ .
This will prevent elements with invalid names from being output to the outgoing XML (such as a full set of extra script tags, for example, or an element name with a < or > in it.)
Unfortunately, it will also prevent any Unicode characters from being used, such that 'mötley:crüë' will also be left out.
I'd appreciate any guidance on a better way to handle this.
| Comment | File | Size | Author |
|---|---|---|---|
| invalid_element_names.patch | 1.29 KB | Steven Merrill |
Comments
Comment #1
c960657 commentedThe list of allowed characters in tag names is bigger than that (though most sane people will probably avoid them):
http://www.w3.org/TR/REC-xml/#sec-starttags
http://www.w3.org/TR/REC-xml/#NT-NameStartChar
Comment #2
mfer commentedhave you looked into using \pL\pN_ instead of \w. \w is letters, numbers, and _. \pL is unicode letters, \pN is unicode numbers, and then _? Note, I have not tested these.
Comment #3
Steven Merrill commentedI did try /pL and /pN in a regex with the /u switch to no avail. (I was testing 'mötley:crüë' as one of the test cases, and it would always fail.)
Comment #4
valthebaldThe question is what to do if $key (or $value['key']) contains invalid characters? Discard the key? Or replace with "safe" value? I would vote for using safe value instead of complete ignorance
Comment #5
valthebaldSorry, bumping version
Comment #6
diceu commentedA related but different issue is in the character set of allowable rss/xml
http://www.w3.org/TR/xml11/#charsets
$output = preg_replace('%[^\x9\xA\xD\x20-\xD7FF\xE000-\xFFFD]%', '', $output);
I don't know how many invalid characters are in 'mötley:crüë' but i believe that invalid chars should be removed since that would be the normal behavior of most syndication anyway. Also find a safe representation of some of the really off the wall control characters like bell ^G would be ridiculous, removal "complete ignorance" is the only acceptable "safe" that fits with most of the issues.
Comment #7
dawehnerThis function no longer exists, see https://www.drupal.org/node/2468139