Replace the underlying URL-encoding function -- currently PHP's urlencode() -- with the long-available rawurlencode().

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

David Strauss’s picture

Assigned: Unassigned » David Strauss

This is a critical issue because the current URL-encoding scheme fails when encoding spaces for URLs (especially user-uploaded files) on RFC 1738-compliant web servers.

David Strauss’s picture

Status: Active » Needs review
David Strauss’s picture

This is a particularly safe patch, especially combined with how Drupal uses urldecode() and not rawurldecode(). The output from rawurlencode() is properly decoded by both urldecode() and rawurldecode(). The output from urlencode() is only decoded properly by urldecode().

This is a classic case of tightening postconditions while keeping preconditions stable.

David Strauss’s picture

FileSize
720 bytes

Plus signs were originally allowed unencoded in a URL:
"Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL."

However, the plus sign has been empirically rendered unsafe under this clause:
"Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters."

Many decoders convert the plus sign to a space. The solution is to use the unambiguous "%20" and "%2B", which rawurlencode() does.

Also, updated patch per klando's review.

klando’s picture

looks good, but still untested here.

cburschka’s picture

Patch applies cleanly, neither of the two lines changed contain a code-style problem.

Prior to patch, a test node was created with the following alias in the text field: test url+plus"quote/content
Which, database side, was converted to test url plus"quote/content
The page links to the following URL in the node's menu entry: test+url+plus%22quote/content
Manually entering the following URL will also show the page: test%20url+plus%22quote/content

After the patch, a test node with a similar alias was created. The plus characters in the form were again converted to spaces in the database.

The menu now links to test%20url%20plus%22quote/content
Manually entering the previous URL will also work: test+url+plus%22quote/content

This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there. That matter is for another patch, of course.

David Strauss’s picture

"This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there."

Allowing plus signs in paths would create ambiguity. Many web servers will convert the plusses to spaces before Drupal receives the path. There would be no way to know whether the user had entered a real space or a plus sign.

catch’s picture

One issue with converting plus signs to spaces, is it's impossible to put taxonomy/term/1+2+3+4 into the path alias form (and others probably) - and therefore impossible to alias urls like that unless you urlencode the pluses manually.

Since there's provision in core for paths with + signs, users should be able to put them in IMO.

cburschka’s picture

"This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there."

Allowing plus signs in paths would create ambiguity. Many web servers will convert the plusses to spaces before Drupal receives the path. There would be no way to know whether the user had entered a real space or a plus sign.

The new function escapes the plus character when generating the URL, to a non-ambiguous value of %2B. So if you actually feel like entering "+" in the alias text field (if the path module didn't replace it with " " on saving), the URL in all site links will still be non-ambiguous. The behavior of the non-compliant web-server only comes into play if a user manually accesses the URL without escaping the + character...

David Strauss’s picture

Can someone who's not me RTBC this? This change seems quite uncontroversial.

cburschka’s picture

Status: Needs review » Reviewed & tested by the community

In that case, the patch looks fine from what I saw in my test.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Good catch, David. I've committed this to CVS HEAD. Thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

nasi’s picture

Could this patch be back-ported to 5.x as it would be equally useful in that code stream?

drumm’s picture

Version: 6.x-dev » 5.x-dev
Status: Closed (fixed) » Patch (to be ported)
drumm’s picture

Status: Patch (to be ported) » Fixed

Patch still applies against Drupal 5, committed.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

mattconnolly’s picture

Status: Closed (fixed) » Needs review
FileSize
634 bytes

It seems that not everything is being correctly double escaped for clean urls. This is particularly evident using clean urls AND private uploads AND file names with "+" (or any other symbol that is altered by rawurlencode() ).

This patch double escapes the '%' always, instead of for only the selected '#' (%23) and '&' (%26) characters as the previous code had done.

-Matt

drumm’s picture

Version: 5.x-dev » 7.x-dev

This needs review in the current development version.

drumm’s picture

Version: 7.x-dev » 6.x-dev

Actually, the patch in #4 was applied to Drupal HEAD and 5.x, but not 6.x. I think this needs a lot of testing before being released with 5.x or 6.x, so I am rolling it back. The double-slash encoding change happened in a 5.x maintenance release and did cause problems, so I am cautious about other changes.

Matt- I highly recommend opening a new issue for your patch.

drumm’s picture

Version: 6.x-dev » 5.x-dev
Status: Needs review » Patch (to be ported)

Minor correction, this change was released with 6.0 since it was committed to HEAD before Drupal 6 was branched. I still will not re-commit this patch since there have been no 5.x reviews.

Gábor Hojtsy’s picture

Version: 5.x-dev » 7.x-dev
Status: Patch (to be ported) » Needs review

Note that 6.x was released with this fix: http://cvs.drupal.org/viewvc.py/drupal/drupal/includes/common.inc?view=d...

Matt, your issue might apply there as well, then?

Gábor Hojtsy’s picture

Version: 7.x-dev » 5.x-dev
Status: Needs review » Patch (to be ported)

Cross-posted.

mattconnolly’s picture

Yes. My issue should be applied there also.

Note: My issue was not the difference between urlencode() and rawurlencode(). It has to do with double encoding the '%' symbol when clean urls is enabled.

Because mod_rewrite decodes the url query, anything in the query needs to be double encoded. The old code, and the code in the CVS patch you showed, only double encodes '%2f', '%26' and '%23' (most commonly used in URLs). However, it should double encode ALL %, which is what my patch is about.

Actually - reading this thread again, should I start a new issue to differentiate my patch from the urlencode() vs rawulrlencode() change?

drumm’s picture

Actually - reading this thread again, should I start a new issue to differentiate my patch from the urlencode() vs rawulrlencode() change?

Yes, you should.

mattconnolly’s picture

drumm: Thanks, new issue created: http://drupal.org/node/284899

Damien Tournoud’s picture

Status: Patch (to be ported) » Closed (fixed)

Closing this issue, then.

gollyg’s picture

Status: Closed (fixed) » Active

Has this patch been lost in the upgrade to 5.9 and 5.10? The current version of Drupal uses urlencode() rather than rawurlencode()

Damien Tournoud’s picture

Status: Active » Needs review

In #20, drumm rolled back the change [1], because he though this was not tested enough. Now that there are lots of Drupal 6 deployments with this, maybe we should reconsider.

drumm’s picture

Priority: Critical » Normal

Yes, this needs through review since it affects all URL generation, and URLs are important.

NancyDru’s picture

Having fought for a couple of weeks with an issue revolving around drupal_urlencode() and valid_url(), it finally dawned on me that there seems to be a bit of a flaw in this whole function. Urlencode() (whatever flavor you want to use) should only be done to the query portion of the URL, not the whole thing.

  $y = parse_url($url);
  $url = ($y['scheme'] ? $y['scheme'] .'://' : NULL) . $y['host'] . $y['path'] . ($y['query'] ? '&'. urlencode($y['query']) : NULL);
chOP’s picture

I've applied the patch from #4 then #18 to an ourbrisbane.com development environment using Drupal 5.11 and tested this without seeing any problems so far. We use a fair few contrib modules as well as in-house coded modules and themes.

Can we expect this to be applied to the Drupal 5.x HEAD again?

I'm from http://www.ourbrisbane.com and have been running into caching hassles with the use of '+' or '%2B' instead of '%20' for spaces in URLs.

mattconnolly’s picture

chop: the issue you are having is not to do with urlencode vs rawurlencode. It has to do with the fact that the encoding effectively needs to be done twice when using clean urls.

See this issue: http://drupal.org/node/284899

chOP’s picture

Thanks @mattconnolly. I see how full and proper double encoding, when clean urls are enabled, should fix the problem. The use of rawurlencode is also an improvement.

Will the patch for the issue at http://drupal.org/node/284899 be backported to Drupal 5.x?

esteewhy’s picture

Another compliance-related sub-issue:

According to RFC1738:

The characters "abcdef" may also be used in hexadecimal encodings.

Also, there exists at least one PHP implementation (Phalanger for .Net), that has a rawurlencode function which encodes characters in lowercase hexadecimals. (For example, / (forward slash) is encoded as %2f, not as %2F expected by Drupal.

This behavior at least breaks activemenu.module functionality, which is then unable to prepare a correct POST request to get a sub-menu content asynchronously.

The proposed fix is to replace str_replace function call by a case-insensitive str_ireplace.

drumm’s picture

Status: Needs review » Closed (works as designed)

#284899: Drupal url problem with clean urls will probably not be backported, it is simply too big of a change.

I don't think I'll commit this for Drupal 5 since this is way too likely to cause subtle issues and nothing seems particularly broken.