When using the filter 'limit allowed HTML tags' for text formats, the attribute xml:lang transforms to lang.

For example <blockquote xml:lang="en">, output as <blockquote lang="en">.

Note that when the 'Correct faulty and chopped off HTML' is turned on (default setting), it by surprise corrects to <blockquote lang="en" xml:lang="en"> (that is correct behavior as the use of both xml:lang and lang is recommended)

The same issue for for xml:id, this will output as id. (Unluckily 'Correct faulty and chopped off HTML' doesn't correct this one :).

Steps to reproduce:
1. install clean Drupal 8
3. add page with body text '<em xml:id="1234">test</em>
4. it will show as <em id="1234">test</em>

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Hanno’s picture

Issue summary: View changes

minor typo

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Priority: Minor » Normal
Status: Active » Needs review
Issue tags: +language of parts
FileSize
783 bytes

Did some research and it happens in _filter_xss_attributes($attr) in common.inc,
in particular in this regular expression: '/^([-a-zA-Z]+)/'.

Changing this to '/^([-a-zA-Z:]+)/' works.

Is this a safe solution? If ':' is only used with the combination 'xml:', this expression could be probably more narrowed? Patch provided but needs review.

Status: Needs review » Needs work

The last submitted patch, patch_commit_faadf30b7d1c.patch, failed testing.

Hanno’s picture

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Title: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss » 'xml:lang' transforms to 'lang' due to 'limit allowed HTML tags'

Not exclusively related to the filter module

Hanno’s picture

Title: 'xml:lang' transforms to 'lang' due to 'limit allowed HTML tags' » attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss
Component: filter.module » base system

better title

Gábor Hojtsy’s picture

Title: 'xml:lang' transforms to 'lang' due to 'limit allowed HTML tags' » attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss
Status: Needs work » Needs review

Setting for testing of last patch.

Gábor Hojtsy’s picture

Issue tags: +D8MI

Also tagging.

Gábor Hojtsy’s picture

Does this even apply to Drupal 8? Drupal 8 now outputs HTML 5 (see http://groups.drupal.org/node/187004) and reading the spec (http://www.whatwg.org/specs/web-apps/current-work/#the-lang-and-xml:lang...), it says we should not use it in a forward looking manner.

Authors must not use the lang attribute in the XML namespace on HTML elements in HTML documents. To ease migration to and from XHTML, authors may specify an attribute in no namespace with no prefix and with the literal localname "xml:lang" on HTML elements in HTML documents, but such attributes must only be specified if a lang attribute in no namespace is also specified, and both attributes must have the same value when compared in an ASCII case-insensitive manner.

The attribute in no namespace with no prefix and with the literal localname "xml:lang" has no effect on language processing.

However, the HTML 5 patch seems to have carried over xml:lang without adding lang. (At #1077566: Convert html.tpl.php to HTML5).

Status: Needs review » Needs work

The last submitted patch, patch_commit_faadf30b7d1c.patch, failed testing.

Gábor Hojtsy’s picture

Status: Needs work » Needs review

I've opened #1330922: Drupal 8 HTML 5 page template should use lang not xml:lang for the xml:lang -> lang work/question.

Gábor Hojtsy’s picture

Version: 8.x-dev » 7.x-dev
Issue tags: -D8MI

All right, so seems like we are not going to use XHTML(5) in Drupal 8, so this will not apply there as far as I see. It does apply to Drupal 7 and 6 though.

Hanno’s picture

interesting about html5 and the lang attribute.
While it make sense to only apply this for D7, in my opinion a filter for cross site scripting shouldn't alter attributes from XHTML-style to HTML5 style. Instead, that could be done in the Correct faulty HTML filter.

And there is a bug for the Correct faulty HTML filter. It changes lang='xx' to lang='xx' xml:lang='xx' (i.e. it will print these both attributes). While this is correct XHTML for D7, from it doesn't appear correct html5 for D8 (see comment #8).

EDIT: created an issue: #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5

Gábor Hojtsy’s picture

Well, yeah, depending on how you look at it. Security functions should filter data for the type of input/output expected (depending on where the filtering happens), and since we handle HTML5 in Drupal 8, we should set the tight filter for that, right? We don't really know what kind of namespaced values might cause issues for clients, so we set our whitelist to the tightest possible. I think that makes sense from a security point of view. I agree on the point of #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 as well though and will definitely be needed for converting existing data.

Hanno’s picture

Do we know if contributed modules are using filter_xss for other filtering than html5? When views or other modules handle XML-related data using this filter, it might give unexpected results in some cases.
Curious to know if the colon isn't allowed by accident. Maybe this issue can be handled together with this issue: #928948: _filter_xss_attributes should ignore attribute values for "value" and "flashvars".

wanderingstan’s picture

Just my two cents, but I'm hitting this problem even in Drupal 6: I need to allow the "fb:profile-pic" tag, but filter_xss won't allow any tag with a colon in it's name. Seems this is a bug, as the HTML5 spec clearly allows for tags (and attribute) namespaces.

Hanno’s picture

Well, good to report another case. Found also this duplicate issue #952964: String stripped from title and alt attribute if contains a colon

wanderingstan’s picture

For now I "fixed" the problem with a (gasp) patch to _filter_xss_split() in filter.module (D6), adding colons and hyphens as allowed characters. So "fb:profile-pic" is recognized as a tag name.

Line 1059:

if (!preg_match('%^(?:<\s*(/\s*)?([a-zA-Z0-9:-]+)([^>]*)>?|(<!--.*?-->))$%', $string, $matches)) {
//                                          ^^ added here.
iStryker’s picture

To me, this will never get fixed. xml:XXX is not a safe attribute. See https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Preventio...

iStryker’s picture

Issue summary: View changes

steps to reproduce added

Hanno’s picture

joseph.olstad’s picture

disabling "Limit allowed HTML tags and correct faulty HTML" does avert this

With that said, I'd like to push this issue along and help resolve this for everyone.

joseph.olstad’s picture

Version: 7.x-dev » 9.3.x-dev
joseph.olstad’s picture

Version: 9.3.x-dev » 7.x-dev

ok some confusion on my part, xhtml spec says xml:lang is desired, but it is not wanted in HTML5

http://www.w3.org/TR/xhtml1/#C_7

I will create a new issue for D9.3.x if one doesn't already exist.