Since Pathologic runs as the last filter, other sanitizing filters may have run before Pathologic, such as replacing "&" with "&" in HTML attributes. Additionally, WYSIWYG editors automatically perform this replacement (e.g. CKEditor and inserted links, image sources).
When URLs are passed through Pathologic (and subsequently run through url()), both paths and query parameters are ultimately run through rawurlencode(), turning a valid (htmlspecialchars()-escaped) URL containing e.g. "&" to "&%3B" (the semicolon is URL-encoded as %3B). The situation is exacerbated for query parameters due to how parse_str() splits the query array and url() recombines it:
$parts = array('query' => 'foo=bar&bar=baz');
parse_str($parts['query'], $parts['qparts']);
// Because parse_str() simply explodes on "&", $parts['qparts'] becomes:
// Array
// (
// [foo] => bar
// [amp;bar] => baz
// )
The fix is (inside _pathologic_replace()):
// Now parse the URL
--- $parts = parse_url($matches[2]);
+++ $parts = parse_url(htmlspecialchars_decode($matches[2]));
Now that $parts['query'] no longer has a key of "amp;bar", when url() recombines the query string, all items are simply implode('&', $query)'d with both keys and values rawurlencode()'d (note: not safe for attribute values!):
This is no good for href/src/etc. attribute values:
?foo=bar&bar=baz
So, add check_url() on the final output (which should be done anyway, since we're always rewriting the direct HTML attribute value, and url() or something in hook_url_outbound_alter() may have returned an unfit string for an attribute value):
+++ // Query parameters OR hook_url_outbound_alter() may have made this URL unsafe
+++ // for the attribute locations we are putting it in (href|src|action|longdesc),
+++ // so run check_url() against it:
+++ $url = check_url($url);
+++
// $matches[1] will be the tag attribute; src, href, etc.
return "{$matches[1]}=\"{$url}";
Arguably, the check_url() should be added to the code regardless of the double-encode fix, since we're outputting a modified/rebuilt attribute value and nothing else is sanitizing it for us.
| Comment | File | Size | Author |
|---|---|---|---|
| #14 | pathologic-double-rawurlencode-fix.patch | 667 bytes | jay.dansand |
| #11 | Socially-Awkward-Penguin-Meme-Template-Blank.jpg | 87.39 KB | Garrett Albright |
| #6 | facepalm.jpg | 23.54 KB | jay.dansand |
| #5 | epic-jackie-chan-template.png | 31.72 KB | Garrett Albright |
Comments
Comment #1
Garrett Albright commentedHmm. Is there a legitimate case where characters in paths would be HTML-encoded before hitting Pathologic? I'm pretty sure that's your main problem here.
Comment #2
jay.dansand commentedSure, for instance on our site there are a few images with "&" in the name. It's a valid (though very annoying) character, and according to the RFCs it just needs to be escaped as per normal href/src/etc. attribute escaping. The Media module escapes the URL fine for the IMG's src attribute, but Pathologic deconstructs it and after url() runs, it outputs a mixed HTML/URI-encoded src attribute that (for obvious reasons) will never load.
I could put a filter in Media somewhere to remove non-word (i.e. [A-Za-z0-9_]) characters from filenames, but that seems an unwarranted band-aid (and difficult, since each Media plugin may need its own patch, such as Media Browser Plus which handles its own uploads); the characters are valid and there are plenty of corner cases where they make a lot of sense to keep.
On a separate note, I really think check_url() should be run anyway (even if nothing else about my fix is implemented): it's entirely possible url() returns a string unfit for an attribute value - that's okay (and intentional), because url() can't know its output context (sometimes the output isn't meant for an HTML attribute value). But, since we're only rewriting attributes with Pathologic, we know our context and the onus is on us to properly sanitize it.
Comment #3
Garrett Albright commentedI'm still confused. You say the Media module URL encodes URLs, which is fine, but your problem in the OP seems to stem from the & character being HTML encoded.
Comment #4
jay.dansand commentedSorry for the ambiguity... it HTML-encodes them, so you end up with "&"
Comment #5
Garrett Albright commentedComment #6
jay.dansand commentedThe Internet is hard :( Luckily, this problem isn't too difficult to solve. htmlspecialchars_decode() above and check_url() below the parse_str() and url() methods and we're done. We don't need to worry about anything else for proper attribute values, unless of course some stupid editor html_entities() stuff that should be left as UTF-8 like é or something. We do have some faculty (and courses) with e.g. ø in their name, but in that case, there's only one answer...
Comment #7
Garrett Albright commentedLook, by your own admission, the problem is that the paths in your content are getting HTML encoded somehow, right? So your solution is to have me change the code of my module to undo it, rather than figure out what's causing it and stopping it from happening in the first place on your site?
In other words, I ask again; is there any rational case in which having HTML encoded characters in a path is expected behavior?
Comment #8
jay.dansand commentedWhoa hey, I think we've had a misunderstanding. We're all friends here, I'm just trying to help push up an update that fixes things on my site and makes the module more HTML compliant. I'm not meaning to step on toes; if this were Joomla, I'd just fix the problem and keep the patch to myself, but hey, we're a community here, and I thought that's how the Drupal community is meant to work. If I'm wrong, no problem, I'll just bugger off.
I do need to post one correction: Media is (most of the time) URL-encoding the link, which means it gets double-URL-encoded after secondary processing through url(), not HTML-encoded and then URL-encoded. CKEditor's link inserter is the thing HTML-encoding first. Either way, both URLs are invalid once Pathologic runs.
The HTML/URL-encoding is happening on media insertion (or link insertion in CKEditor or other WYSIWYG), but that's irrelevant. If it didn't happen there, we'd have to write an output filter that did it (since these characters need to be encoded in attributes) and either way HTML attributes would end up with HTML-encoded characters when they enter Pathologic. I'm not asking for Pathologic to undo the HTML-encoding, quite the opposite: the attribute values need to be HTML-encoded at the end of processing, so Patholigic needs to output properly encoded values too. Hence my mention of adding check_url() at the end of processing, regardless of whether or not the rest of my fix is implemented.
Pathologic isn't dealing with paths at this point, but with paths *as attributes* - so any & in the path (which is valid) will at this point be HTML-encoded (except when it's URL-encoded). I think that point is the source of our miscommunication here.
So, restated, the problem is:
That's valid HTML. There's no wrongdoing on Media's part, or on the part of the content maintainers for uploading images with these characters. There's also no good fix up the chain by just REGEXPing offending characters out of URLs on creation (or out of filenames on upload), because "offending" is relative to the output context, and I'm not sure anyone could reasonably be expected to catch all of the endpoints that create URLs with these characters. Also, the characters are valid anyway, so why try to restrict their use (even if you could, which you can't)?
url() and parse_str() do not handle this because they are context-agnostic. They can't just go encoding things without knowing how they're to be used; this is proper behavior on their part. However, Pathologic is specifically geared for an HTML attribute context, so properly supporting HTML attribute values is expected behavior and should be implemented.
Ultimately, the way that these HTML/URL-encoded characters get added to the URL attribute shouldn't matter, & > < " are valid in URLs and in attribute contexts when encoded, so it'd be awesome if we could continue to use them along with Pathologic. Pathologic needs to be aware of its output context - it is dealing with finished HTML attributes; they should be expected to sometimes have encoded values from up the chain, and always they should be check_url()'d before being outputted (the docs concur. See also the comment by check_plain() in l())
Comment #9
Garrett Albright commentedOkay, I just looked at l()'s code, and yeah, it is indeed ultimately running paths through htmlspecialchars(). Not that this suddenly makes sense to me. The fact that href attributes are URLs (and can't validly be anything else) is why it's just fine to have an unescaped ampersand in there, and if there's any encoding going on at all it should be URL encoding.
Can you find a reputable source (outside of Drupal core), though, that states otherwise? I searched around for a while and couldn't find anything suggesting that HTML encoding src or href attributes makes sense. Please prove me wrong. Otherwise I'm inclined to file a patch to Drupal core to stop l() from doing that.
Comment #10
jay.dansand commentedThe HTML 2.0 RFC (1866), specifically 3.2.1 Data Characters, formally states what PHP's htmlspecialchars() notes say: & > < should be escaped in all HTML (and XML) contexts, and " in any (not just href/src) attribute context (and any validator in strict HTML/XHTML mode will complain if they aren't.) Those characters are considered delimiters and must be encoded when their actual literal value is desired.
You can find supporting info here:
http://www.w3.org/TR/xhtml1/guidelines.html#C_12
http://dev.w3.org/html5/markup/syntax.html#syntax-text
Or, try to pass this invalid XHTML through any validator (such as W3C's):
The reason for HTML-encoding versus URL-encoding is telling a browser to go to http://google.com/search?q=blah&foo=bar is fundamentally different from telling the browser to go to http://google.com/search?q=blah%26foo=bar. The former will actually work and search for "blah", whereas the second will actually search for "blah&foo=bar" (an erroneous result).
URL-encoding in this context is really only meant for parameter values, not the separators between them. &'s in path components or as delimiters between GET parameters really shouldn't be %26. They work in the former case (though a literal & does too) and they never work in the latter case. For information about URL- versus HTML-entity-encoding, see RFC 3986, specifically 2.4 "When to Encode or Decode". Since url() will (if dealt with properly) URL-encode any GET values, then we shouldn't worry about those but instead should focus on HTML-encoding the delimiters/etc. for safe inclusion as attribute values (which is why Drupal does this in l() and why check_url() exists - so that the URL is "... encoded for output to an HTML attribute value").
Note that the RFC also says
which gets back to the issue of ensuring we decode the URL before dealing with it in Pathologic, especially before passing it through url() (since that'll always end up in percent-encoding an already percent-encoded string). The same theory applies to the HTML-encoded references as well.
I'm not very good at explaining it; I hope that all answers your questions in one way or another.
Comment #11
Garrett Albright commentedOkay, you win. My code is in error and has been in error since pretty much forever. I'll start working in fixes based on your suggested changes.
Well, this is embarrassing.
Comment #12
jay.dansand commentedI can roll a patch based on beta5 to deal with both decoding issues; I didn't do a patch the first time 'round because I was waiting on any code changes from my other issue (now perfectly dealt with in beta5, thank you!)
It looks to me like the original fix posted works for the HTML-encoded bits, but after parse_str() explodes the query we need to foreach() and rawurldecode() both $key and $value. That seems like a symmetric reversal of any upstream url()/Media/WYSIWYG treatment, based on how they rawurlencode() query parameters.
Then we're done: the url() call at the end of Pathologic will re-rawurlencode() the query components, so that's handled out of the box for us; we just need to add the check_url() at the end to re-HTML-encode any remaining special characters (URL-delimiters, etc.)
Sound right? I haven't had my morning coffee yet.
Comment #13
Garrett Albright commentedDon't bother making a patch; I've already pushed code to address this, so please grab a dev release (or check out the code with Git if you're so inclined) and give it a try if you can (and then maybe make a patch off of that if it's still broke, I guess).
I'll probably make a new release over the weekend.
Comment #14
jay.dansand commentedLooks like URL-encoding can still be double-encoded for path pieces. I had thought query pieces would be double-encoded too, but apparently parse_str() takes care of that.
Here's the test content I've been using to break the output:
with
/set as a base path, of course.Here's a one-liner (plus comments) patch against the latest dev that fixes it for me.
Comment #15
jay.dansand commentedCommit c966aae against 7.x-2.x-dev (as of July 16th 2012) looks good with respect to this issue. Thanks much!
Comment #16
Garrett Albright commented7.x-2.0 released! Closing this issue.
Comment #16.0
Garrett Albright commentedChanged overly-simplistic str_replace() to htmlspecialchars_decode().