Make drupal_urlencode RFC 1738-compliant

David Strauss - November 10, 2007 - 21:35
Project:Drupal
Version:5.x-dev
Component:base system
Category:bug report
Priority:normal
Assigned:David Strauss
Status:by design
Description

Replace the underlying URL-encoding function -- currently PHP's urlencode() -- with the long-available rawurlencode().

AttachmentSizeStatusTest resultOperations
rawurlencode.patch650 bytesIgnoredNoneNone

#1

David Strauss - November 10, 2007 - 21:37
Assigned to:Anonymous» David Strauss

This is a critical issue because the current URL-encoding scheme fails when encoding spaces for URLs (especially user-uploaded files) on RFC 1738-compliant web servers.

#2

David Strauss - November 10, 2007 - 21:37
Status:active» needs review

#3

David Strauss - November 10, 2007 - 22:02

This is a particularly safe patch, especially combined with how Drupal uses urldecode() and not rawurldecode(). The output from rawurlencode() is properly decoded by both urldecode() and rawurldecode(). The output from urlencode() is only decoded properly by urldecode().

This is a classic case of tightening postconditions while keeping preconditions stable.

#4

David Strauss - November 10, 2007 - 22:21

Plus signs were originally allowed unencoded in a URL:
"Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL."

However, the plus sign has been empirically rendered unsafe under this clause:
"Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters."

Many decoders convert the plus sign to a space. The solution is to use the unambiguous "%20" and "%2B", which rawurlencode() does.

Also, updated patch per klando's review.

AttachmentSizeStatusTest resultOperations
rawurlencode.patch720 bytesIgnoredNoneNone

#5

klando - November 10, 2007 - 22:41

looks good, but still untested here.

#6

Arancaytar - November 11, 2007 - 19:57

Patch applies cleanly, neither of the two lines changed contain a code-style problem.

Prior to patch, a test node was created with the following alias in the text field: test url+plus"quote/content
Which, database side, was converted to test url plus"quote/content
The page links to the following URL in the node's menu entry: test+url+plus%22quote/content
Manually entering the following URL will also show the page: test%20url+plus%22quote/content

After the patch, a test node with a similar alias was created. The plus characters in the form were again converted to spaces in the database.

The menu now links to test%20url%20plus%22quote/content
Manually entering the previous URL will also work: test+url+plus%22quote/content

This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there. That matter is for another patch, of course.

#7

David Strauss - November 11, 2007 - 20:48

"This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there."

Allowing plus signs in paths would create ambiguity. Many web servers will convert the plusses to spaces before Drupal receives the path. There would be no way to know whether the user had entered a real space or a plus sign.

#8

catch - November 11, 2007 - 22:29

One issue with converting plus signs to spaces, is it's impossible to put taxonomy/term/1+2+3+4 into the path alias form (and others probably) - and therefore impossible to alias urls like that unless you urlencode the pluses manually.

Since there's provision in core for paths with + signs, users should be able to put them in IMO.

#9

Arancaytar - November 12, 2007 - 13:51

"This won't break any pages, but I suggest that once it is in place, path.module should stop converting plus characters to spaces, as the ambiguity between the two is no longer there."

Allowing plus signs in paths would create ambiguity. Many web servers will convert the plusses to spaces before Drupal receives the path. There would be no way to know whether the user had entered a real space or a plus sign.

The new function escapes the plus character when generating the URL, to a non-ambiguous value of %2B. So if you actually feel like entering "+" in the alias text field (if the path module didn't replace it with " " on saving), the URL in all site links will still be non-ambiguous. The behavior of the non-compliant web-server only comes into play if a user manually accesses the URL without escaping the + character...

#10

David Strauss - November 13, 2007 - 04:59

Can someone who's not me RTBC this? This change seems quite uncontroversial.

#11

Arancaytar - November 13, 2007 - 08:36
Status:needs review» reviewed & tested by the community

In that case, the patch looks fine from what I saw in my test.

#12

Dries - November 13, 2007 - 12:35
Status:reviewed & tested by the community» fixed

Good catch, David. I've committed this to CVS HEAD. Thanks.

#13

Anonymous - November 27, 2007 - 12:43
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

#14

nasi - May 9, 2008 - 22:47

Could this patch be back-ported to 5.x as it would be equally useful in that code stream?

#15

drumm - May 26, 2008 - 03:32
Version:6.x-dev» 5.x-dev
Status:closed» patch (to be ported)

#16

drumm - May 26, 2008 - 03:35
Status:patch (to be ported)» fixed

Patch still applies against Drupal 5, committed.

#17

Anonymous (not verified) - June 9, 2008 - 03:41
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

#18

mattconnolly - July 1, 2008 - 09:46
Status:closed» needs review

It seems that not everything is being correctly double escaped for clean urls. This is particularly evident using clean urls AND private uploads AND file names with "+" (or any other symbol that is altered by rawurlencode() ).

This patch double escapes the '%' always, instead of for only the selected '#' (%23) and '&' (%26) characters as the previous code had done.

-Matt

AttachmentSizeStatusTest resultOperations
url-encode.patch634 bytesIgnoredNoneNone

#19

drumm - July 6, 2008 - 00:12
Version:5.x-dev» 7.x-dev

This needs review in the current development version.

#20

drumm - July 9, 2008 - 19:33
Version:7.x-dev» 6.x-dev

Actually, the patch in #4 was applied to Drupal HEAD and 5.x, but not 6.x. I think this needs a lot of testing before being released with 5.x or 6.x, so I am rolling it back. The double-slash encoding change happened in a 5.x maintenance release and did cause problems, so I am cautious about other changes.

Matt- I highly recommend opening a new issue for your patch.

#21

drumm - July 9, 2008 - 19:41
Version:6.x-dev» 5.x-dev
Status:needs review» patch (to be ported)

Minor correction, this change was released with 6.0 since it was committed to HEAD before Drupal 6 was branched. I still will not re-commit this patch since there have been no 5.x reviews.

#22

Gábor Hojtsy - July 9, 2008 - 19:48
Version:5.x-dev» 7.x-dev
Status:patch (to be ported)» needs review

Note that 6.x was released with this fix: http://cvs.drupal.org/viewvc.py/drupal/drupal/includes/common.inc?view=d...

Matt, your issue might apply there as well, then?

#23

Gábor Hojtsy - July 9, 2008 - 19:55
Version:7.x-dev» 5.x-dev
Status:needs review» patch (to be ported)

Cross-posted.

#24

mattconnolly - July 10, 2008 - 11:52

Yes. My issue should be applied there also.

Note: My issue was not the difference between urlencode() and rawurlencode(). It has to do with double encoding the '%' symbol when clean urls is enabled.

Because mod_rewrite decodes the url query, anything in the query needs to be double encoded. The old code, and the code in the CVS patch you showed, only double encodes '%2f', '%26' and '%23' (most commonly used in URLs). However, it should double encode ALL %, which is what my patch is about.

Actually - reading this thread again, should I start a new issue to differentiate my patch from the urlencode() vs rawulrlencode() change?

#25

drumm - July 16, 2008 - 18:41

Actually - reading this thread again, should I start a new issue to differentiate my patch from the urlencode() vs rawulrlencode() change?

Yes, you should.

#26

mattconnolly - July 20, 2008 - 08:35

drumm: Thanks, new issue created: http://drupal.org/node/284899

#27

Damien Tournoud - July 20, 2008 - 09:11
Status:patch (to be ported)» closed

Closing this issue, then.

#28

gollyg - September 11, 2008 - 12:26
Status:closed» active

Has this patch been lost in the upgrade to 5.9 and 5.10? The current version of Drupal uses urlencode() rather than rawurlencode()

#29

Damien Tournoud - September 11, 2008 - 12:48
Status:active» needs review

In #20, drumm rolled back the change [1], because he though this was not tested enough. Now that there are lots of Drupal 6 deployments with this, maybe we should reconsider.

#30

drumm - September 15, 2008 - 06:08
Priority:critical» normal

Yes, this needs through review since it affects all URL generation, and URLs are important.

#31

NancyDru - September 15, 2008 - 17:19

Having fought for a couple of weeks with an issue revolving around drupal_urlencode() and valid_url(), it finally dawned on me that there seems to be a bit of a flaw in this whole function. Urlencode() (whatever flavor you want to use) should only be done to the query portion of the URL, not the whole thing.

  $y = parse_url($url);
  $url = ($y['scheme'] ? $y['scheme'] .'://' : NULL) . $y['host'] . $y['path'] . ($y['query'] ? '&'. urlencode($y['query']) : NULL);

#32

chOP - October 24, 2008 - 05:58

I've applied the patch from #4 then #18 to an ourbrisbane.com development environment using Drupal 5.11 and tested this without seeing any problems so far. We use a fair few contrib modules as well as in-house coded modules and themes.

Can we expect this to be applied to the Drupal 5.x HEAD again?

I'm from http://www.ourbrisbane.com and have been running into caching hassles with the use of '+' or '%2B' instead of '%20' for spaces in URLs.

#33

mattconnolly - November 4, 2008 - 00:27

chop: the issue you are having is not to do with urlencode vs rawurlencode. It has to do with the fact that the encoding effectively needs to be done twice when using clean urls.

See this issue: http://drupal.org/node/284899

#34

chOP - November 4, 2008 - 02:24

Thanks @mattconnolly. I see how full and proper double encoding, when clean urls are enabled, should fix the problem. The use of rawurlencode is also an improvement.

Will the patch for the issue at http://drupal.org/node/284899 be backported to Drupal 5.x?

#35

esteewhy - August 5, 2009 - 13:21

Another compliance-related sub-issue:

According to RFC1738:

The characters "abcdef" may also be used in hexadecimal encodings.

Also, there exists at least one PHP implementation (Phalanger for .Net), that has a rawurlencode function which encodes characters in lowercase hexadecimals. (For example, / (forward slash) is encoded as %2f, not as %2F expected by Drupal.

This behavior at least breaks activemenu.module functionality, which is then unable to prepare a correct POST request to get a sub-menu content asynchronously.

The proposed fix is to replace str_replace function call by a case-insensitive str_ireplace.

#36

drumm - August 8, 2009 - 03:07
Status:needs review» by design

#284899: Drupal url problem with clean urls will probably not be backported, it is simply too big of a change.

I don't think I'll commit this for Drupal 5 since this is way too likely to cause subtle issues and nothing seems particularly broken.

 
 

Drupal is a registered trademark of Dries Buytaert.