While writing a couple unit tests over at #296314: TestingParty08: format_rss_item needs a test, I discovered that core's format_xml_elements() blindly trusts that element names passed in are safe.

check_plain() is not an acceptable method for this task because & characters cannot be in element names in XML, and any encoding would produce them. Instead, this patch will check element names' validity with the following regex: /^[\w:][\w\-:]*$/ .

This will prevent elements with invalid names from being output to the outgoing XML (such as a full set of extra script tags, for example, or an element name with a < or > in it.)

Unfortunately, it will also prevent any Unicode characters from being used, such that 'mötley:crüë' will also be left out.

I'd appreciate any guidance on a better way to handle this.

CommentFileSizeAuthor
invalid_element_names.patch1.29 KBSteven Merrill

Comments

c960657’s picture

Status: Needs review » Needs work

The list of allowed characters in tag names is bigger than that (though most sane people will probably avoid them):
http://www.w3.org/TR/REC-xml/#sec-starttags
http://www.w3.org/TR/REC-xml/#NT-NameStartChar

mfer’s picture

have you looked into using \pL\pN_ instead of \w. \w is letters, numbers, and _. \pL is unicode letters, \pN is unicode numbers, and then _? Note, I have not tested these.

Steven Merrill’s picture

I did try /pL and /pN in a regex with the /u switch to no avail. (I was testing 'mötley:crüë' as one of the test cases, and it would always fail.)

valthebald’s picture

Assigned: Unassigned » valthebald
Status: Needs work » Active

The question is what to do if $key (or $value['key']) contains invalid characters? Discard the key? Or replace with "safe" value? I would vote for using safe value instead of complete ignorance

valthebald’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +Needs backport to D7

Sorry, bumping version

diceu’s picture

A related but different issue is in the character set of allowable rss/xml

http://www.w3.org/TR/xml11/#charsets
$output = preg_replace('%[^\x9\xA\xD\x20-\xD7FF\xE000-\xFFFD]%', '', $output);

I don't know how many invalid characters are in 'mötley:crüë' but i believe that invalid chars should be removed since that would be the normal behavior of most syndication anyway. Also find a safe representation of some of the really off the wall control characters like bell ^G would be ridiculous, removal "complete ignorance" is the only acceptable "safe" that fits with most of the issues.

dawehner’s picture

Version: 8.0.x-dev » 7.x-dev
Issue summary: View changes

This function no longer exists, see https://www.drupal.org/node/2468139

Status: Active » Closed (outdated)

Automatically closed because Drupal 7 security and bugfix support has ended as of 5 January 2025. If the issue verifiably applies to later versions, please reopen with details and update the version.