Character double-encoding/erroneous rawurlencode() [#1672932]

Since Pathologic runs as the last filter, other sanitizing filters may have run before Pathologic, such as replacing "&" with "&" in HTML attributes. Additionally, WYSIWYG editors automatically perform this replacement (e.g. CKEditor and inserted links, image sources).

When URLs are passed through Pathologic (and subsequently run through url()), both paths and query parameters are ultimately run through rawurlencode(), turning a valid (htmlspecialchars()-escaped) URL containing e.g. "&" to "&amp%3B" (the semicolon is URL-encoded as %3B). The situation is exacerbated for query parameters due to how parse_str() splits the query array and url() recombines it:

$parts = array('query' => 'foo=bar&amp;bar=baz');
parse_str($parts['query'], $parts['qparts']);
// Because parse_str() simply explodes on "&", $parts['qparts'] becomes:
// Array
// (
//     [foo] => bar
//     [amp;bar] => baz
// )

The fix is (inside _pathologic_replace()):

// Now parse the URL
---  $parts = parse_url($matches[2]);
+++  $parts = parse_url(htmlspecialchars_decode($matches[2]));

Now that $parts['query'] no longer has a key of "amp;bar", when url() recombines the query string, all items are simply implode('&', $query)'d with both keys and values rawurlencode()'d (note: not safe for attribute values!):

This is no good for href/src/etc. attribute values:
?foo=bar&bar=baz

So, add check_url() on the final output (which should be done anyway, since we're always rewriting the direct HTML attribute value, and url() or something in hook_url_outbound_alter() may have returned an unfit string for an attribute value):

+++  // Query parameters OR hook_url_outbound_alter() may have made this URL unsafe
+++  // for the attribute locations we are putting it in (href|src|action|longdesc),
+++  // so run check_url() against it:
+++  $url = check_url($url);
+++
  // $matches[1] will be the tag attribute; src, href, etc.
  return "{$matches[1]}=\"{$url}";

Arguably, the check_url() should be added to the code regardless of the double-encode fix, since we're outputting a modified/rebuilt attribute value and nothing else is sanitizing it for us.

Comment	File	Size	Author
#14	pathologic-double-rawurlencode-fix.patch	667 bytes	jay.dansand
#11	Socially-Awkward-Penguin-Meme-Template-Blank.jpg	87.39 KB	Garrett Albright
#6	facepalm.jpg	23.54 KB	jay.dansand
#5	epic-jackie-chan-template.png	31.72 KB	Garrett Albright

Comments

Comment #1

Garrett Albright commented 8 July 2012 at 04:29

Hmm. Is there a legitimate case where characters in paths would be HTML-encoded before hitting Pathologic? I'm pretty sure that's your main problem here.

Comment #2

jay.dansand commented 9 July 2012 at 13:34

Sure, for instance on our site there are a few images with "&" in the name. It's a valid (though very annoying) character, and according to the RFCs it just needs to be escaped as per normal href/src/etc. attribute escaping. The Media module escapes the URL fine for the IMG's src attribute, but Pathologic deconstructs it and after url() runs, it outputs a mixed HTML/URI-encoded src attribute that (for obvious reasons) will never load.

I could put a filter in Media somewhere to remove non-word (i.e. [A-Za-z0-9_]) characters from filenames, but that seems an unwarranted band-aid (and difficult, since each Media plugin may need its own patch, such as Media Browser Plus which handles its own uploads); the characters are valid and there are plenty of corner cases where they make a lot of sense to keep.

On a separate note, I really think check_url() should be run anyway (even if nothing else about my fix is implemented): it's entirely possible url() returns a string unfit for an attribute value - that's okay (and intentional), because url() can't know its output context (sometimes the output isn't meant for an HTML attribute value). But, since we're only rewriting attributes with Pathologic, we know our context and the onus is on us to properly sanitize it.

Comment #3

Garrett Albright commented 9 July 2012 at 13:40

I'm still confused. You say the Media module URL encodes URLs, which is fine, but your problem in the OP seems to stem from the & character being HTML encoded.

Comment #4

jay.dansand commented 9 July 2012 at 13:56

Sorry for the ambiguity... it HTML-encodes them, so you end up with "&"

Comment #5

Garrett Albright commented 9 July 2012 at 14:09

Status	File	Size
new	epic-jackie-chan-template.png	31.72 KB

what

Comment #6

jay.dansand commented 9 July 2012 at 15:50

Status	File	Size
new	facepalm.jpg	23.54 KB

The Internet is hard :( Luckily, this problem isn't too difficult to solve. htmlspecialchars_decode() above and check_url() below the parse_str() and url() methods and we're done. We don't need to worry about anything else for proper attribute values, unless of course some stupid editor html_entities() stuff that should be left as UTF-8 like é or something. We do have some faculty (and courses) with e.g. ø in their name, but in that case, there's only one answer...

Comment #7

Garrett Albright commented 10 July 2012 at 09:57

Look, by your own admission, the problem is that the paths in your content are getting HTML encoded somehow, right? So your solution is to have me change the code of my module to undo it, rather than figure out what's causing it and stopping it from happening in the first place on your site?

In other words, I ask again; is there any rational case in which having HTML encoded characters in a path is expected behavior?

Comment #8

jay.dansand commented 10 July 2012 at 14:58

Whoa hey, I think we've had a misunderstanding. We're all friends here, I'm just trying to help push up an update that fixes things on my site and makes the module more HTML compliant. I'm not meaning to step on toes; if this were Joomla, I'd just fix the problem and keep the patch to myself, but hey, we're a community here, and I thought that's how the Drupal community is meant to work. If I'm wrong, no problem, I'll just bugger off.

I do need to post one correction: Media is (most of the time) URL-encoding the link, which means it gets double-URL-encoded after secondary processing through url(), not HTML-encoded and then URL-encoded. CKEditor's link inserter is the thing HTML-encoding first. Either way, both URLs are invalid once Pathologic runs.

The HTML/URL-encoding is happening on media insertion (or link insertion in CKEditor or other WYSIWYG), but that's irrelevant. If it didn't happen there, we'd have to write an output filter that did it (since these characters need to be encoded in attributes) and either way HTML attributes would end up with HTML-encoded characters when they enter Pathologic. I'm not asking for Pathologic to undo the HTML-encoding, quite the opposite: the attribute values need to be HTML-encoded at the end of processing, so Patholigic needs to output properly encoded values too. Hence my mention of adding check_url() at the end of processing, regardless of whether or not the rest of my fix is implemented.

is there any rational case in which having HTML encoded characters in a path is expected behavior?

Pathologic isn't dealing with paths at this point, but with paths *as attributes* - so any & in the path (which is valid) will at this point be HTML-encoded (except when it's URL-encoded). I think that point is the source of our miscommunication here.

So, restated, the problem is:

Ampersands are valid in URLs.
As attribute values (for src="...", etc.) they need to be encoded as & or %26 (HTML spec would seem to argue for & but that's another issue).
Image assets are a good example (on our site), which occasionally have ampersands in them. Media URL-encodes the ampersands, so the inserted IMG or whatnot has ampersands (encoded as %26) in the src attribute. CKEditor/other WYSIWYG will create links to these things with href="...&..." in them.

That's valid HTML. There's no wrongdoing on Media's part, or on the part of the content maintainers for uploading images with these characters. There's also no good fix up the chain by just REGEXPing offending characters out of URLs on creation (or out of filenames on upload), because "offending" is relative to the output context, and I'm not sure anyone could reasonably be expected to catch all of the endpoints that create URLs with these characters. Also, the characters are valid anyway, so why try to restrict their use (even if you could, which you can't)?

url() and parse_str() do not handle this because they are context-agnostic. They can't just go encoding things without knowing how they're to be used; this is proper behavior on their part. However, Pathologic is specifically geared for an HTML attribute context, so properly supporting HTML attribute values is expected behavior and should be implemented.

Ultimately, the way that these HTML/URL-encoded characters get added to the URL attribute shouldn't matter, & > < " are valid in URLs and in attribute contexts when encoded, so it'd be awesome if we could continue to use them along with Pathologic. Pathologic needs to be aware of its output context - it is dealing with finished HTML attributes; they should be expected to sometimes have encoded values from up the chain, and always they should be check_url()'d before being outputted (the docs concur. See also the comment by check_plain() in l())

Comment #9

Garrett Albright commented 10 July 2012 at 15:17

Okay, I just looked at l()'s code, and yeah, it is indeed ultimately running paths through htmlspecialchars(). Not that this suddenly makes sense to me. The fact that href attributes are URLs (and can't validly be anything else) is why it's just fine to have an unescaped ampersand in there, and if there's any encoding going on at all it should be URL encoding.

Can you find a reputable source (outside of Drupal core), though, that states otherwise? I searched around for a while and couldn't find anything suggesting that HTML encoding src or href attributes makes sense. Please prove me wrong. Otherwise I'm inclined to file a patch to Drupal core to stop l() from doing that.

Comment #10

jay.dansand commented 10 July 2012 at 17:03

The HTML 2.0 RFC (1866), specifically 3.2.1 Data Characters, formally states what PHP's htmlspecialchars() notes say: & > < should be escaped in all HTML (and XML) contexts, and " in any (not just href/src) attribute context (and any validator in strict HTML/XHTML mode will complain if they aren't.) Those characters are considered delimiters and must be encoded when their actual literal value is desired.

You can find supporting info here:
http://www.w3.org/TR/xhtml1/guidelines.html#C_12
http://dev.w3.org/html5/markup/syntax.html#syntax-text

Or, try to pass this invalid XHTML through any validator (such as W3C's):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head>
		<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
		<title>Validation Fail</title>
	</head>
	<body>
		<p>A < B</p>
		<p>A & B</p>
		<p><a href="a & b.jpg">A link href fail</a></p>
		<p><a href="?a=a&b=b">Another, different fail (incomplete/bad entity reference)</a></p>
	</body>
</html>

The reason for HTML-encoding versus URL-encoding is telling a browser to go to http://google.com/search?q=blah&foo=bar is fundamentally different from telling the browser to go to http://google.com/search?q=blah%26foo=bar. The former will actually work and search for "blah", whereas the second will actually search for "blah&foo=bar" (an erroneous result).

URL-encoding in this context is really only meant for parameter values, not the separators between them. &'s in path components or as delimiters between GET parameters really shouldn't be %26. They work in the former case (though a literal & does too) and they never work in the latter case. For information about URL- versus HTML-entity-encoding, see RFC 3986, specifically 2.4 "When to Encode or Decode". Since url() will (if dealt with properly) URL-encode any GET values, then we shouldn't worry about those but instead should focus on HTML-encoding the delimiters/etc. for safe inclusion as attribute values (which is why Drupal does this in l() and why check_url() exists - so that the URL is "... encoded for output to an HTML attribute value").

Note that the RFC also says

Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

which gets back to the issue of ensuring we decode the URL before dealing with it in Pathologic, especially before passing it through url() (since that'll always end up in percent-encoding an already percent-encoded string). The same theory applies to the HTML-encoded references as well.

I'm not very good at explaining it; I hope that all answers your questions in one way or another.

Comment #11

Garrett Albright commented 12 July 2012 at 06:18

Status	File	Size
new	Socially-Awkward-Penguin-Meme-Template-Blank.jpg	87.39 KB

Okay, you win. My code is in error and has been in error since pretty much forever. I'll start working in fixes based on your suggested changes.

Well, this is embarrassing.

Comment #12

jay.dansand commented 12 July 2012 at 12:52

I can roll a patch based on beta5 to deal with both decoding issues; I didn't do a patch the first time 'round because I was waiting on any code changes from my other issue (now perfectly dealt with in beta5, thank you!)

It looks to me like the original fix posted works for the HTML-encoded bits, but after parse_str() explodes the query we need to foreach() and rawurldecode() both $key and $value. That seems like a symmetric reversal of any upstream url()/Media/WYSIWYG treatment, based on how they rawurlencode() query parameters.

Then we're done: the url() call at the end of Pathologic will re-rawurlencode() the query components, so that's handled out of the box for us; we just need to add the check_url() at the end to re-HTML-encode any remaining special characters (URL-delimiters, etc.)

Sound right? I haven't had my morning coffee yet.

Comment #13

Garrett Albright commented 12 July 2012 at 13:17

Don't bother making a patch; I've already pushed code to address this, so please grab a dev release (or check out the code with Git if you're so inclined) and give it a try if you can (and then maybe make a patch off of that if it's still broke, I guess).

I'll probably make a new release over the weekend.

Comment #14

jay.dansand commented 12 July 2012 at 14:09

Status	File	Size
new	pathologic-double-rawurlencode-fix.patch	667 bytes

Looks like URL-encoding can still be double-encoded for path pieces. I had thought query pieces would be double-encoded too, but apparently parse_str() takes care of that.

Here's the test content I've been using to break the output:

<a href="/test?foo=bar&test2=test3">test</a>
<a href="/test?foo=bar&amp;test2=test3">test</a>
<a href="/test?foo=bar&test2=test3%26and%20such">test</a>
<a href="/test?foo=bar&amp;test2=test3%26and%20such">test</a>
<a href="/test?foo=bar&test2=test3 such">test</a>
<a href="/test?foo=bar&amp;test2=test3 such">test</a>

<img alt="" class="media-image" height="270" src="/sites/default/files/foo%20%26%20bar.jpg" width="480" />

with / set as a base path, of course.

Here's a one-liner (plus comments) patch against the latest dev that fixes it for me.

Comment #15

jay.dansand commented 18 July 2012 at 13:29

Status:

Active

» Reviewed & tested by the community

Commit c966aae against 7.x-2.x-dev (as of July 16th 2012) looks good with respect to this issue. Thanks much!

Comment #16

Garrett Albright commented 27 July 2012 at 06:50

Status:

Reviewed & tested by the community

» Closed (fixed)

7.x-2.0 released! Closing this issue.

Comment #16.0

Garrett Albright commented 27 July 2012 at 06:50

Issue summary:

View changes

Changed overly-simplistic str_replace() to htmlspecialchars_decode().

Character double-encoding/erroneous rawurlencode()