Handle ampersands in search queries and other URLs when clean URLs are on

drumm - June 13, 2006 - 22:17
Project:Drupal
Version:4.7.2
Component:search.module
Category:bug report
Priority:critical
Assigned:drumm
Status:by design
Description

When you do a search containing '&', it is encoded as '%26' in the URL. mod_rewrite decodes this and puts anything after '&' in the query string. This patch encodes the '%' to '%25', giving us '%2526', a double-encoded ampersand.

Makes searches and other paths containing '&' work.

AttachmentSizeStatusTest resultOperations
common.inc.diff.txt783 bytesIgnoredNoneNone

#1

Steven - June 14, 2006 - 06:28

The same problem exists for #. I was going to look into it, but I'd much rather prefer a fix in the rewrite rules rather than producing such ugly URLs. Unfortunately it might be hardcoded behaviour.

Shouldn't the comment say: "... to counter-act the extra decoding in mod_rewrite." ?

#2

drumm - June 21, 2006 - 04:10

I haven't looked at #. This is meant to solve &.

I spent some time reading over mod_rewrite rules and am fairly certain that this is not fixable within the confines of .htaccess. Here is the relevant bug over on Apache's side: http://issues.apache.org/bugzilla/show_bug.cgi?id=32328#c8 (doesn't seem too likely to be fixed.)

#3

Steven - June 21, 2006 - 13:34

You're right, it cannot be fixed in .htaccess. However, the patch on that Apache issue you linked to would not help. There is already a suitable RewriteMap to use to undo most of the damage (escape), but in order to use it, you need a global RewriteMap directive in httpd.conf (not even per VirtualHost or Directory). That rules it out for us:

# httpd.conf
RewriteMap escape int:escape

# .htaccess
RewriteRule ^(.*)$ index.php?q=${escape:$1} [L,QSA]

I updated that Apache Bugzilla report all the same. It's another shining example of why blindly applying text transformation functions is the best way to screw yourself over.

So, I did a test and aside from & and #, no other characters need to be escaped (*). Patch attached. I'm not sure why you had that odd (bool) cast and TRUE check. Isn't that what if ($var) implicitly does? I also merged the second str_replace into the first.

(*) For some values of 'no'.

AttachmentSizeStatusTest resultOperations
mod_rewrite.sucks.patch1.16 KBIgnoredNoneNone

#4

drumm - June 22, 2006 - 08:57
Status:needs review» reviewed & tested by the community

Looks okay to me. I was copying the clean URL test from earlier in the code, which I thought was a bit weird, but decided to be consistent with it. This is good too.

#5

Dries - June 22, 2006 - 12:30

<?php
+ * - To avoid problems with mod_rewrites built-in unescaping, we double-escape
+ *   ampersands and hashes, when clean URLs are used.
?>

Could we clarify 'problems' in the PHPdoc. It would be nice to be a little bit more specific (not verbose).

#6

drumm - June 23, 2006 - 08:48

How are these comments?

AttachmentSizeStatusTest resultOperations
common.inc.diff_0.txt1.28 KBIgnoredNoneNone

#7

drumm - June 23, 2006 - 22:32

Tested on a 4.7 install and works well.

#8

Dries - June 25, 2006 - 15:03

drumm: that's more clear, thanks.

I think "mod_rewrite's unescapes" should be "mod_rewrite unescapes" though (no "'s").

Feel free to commit.

#9

Dries - June 25, 2006 - 15:04

(I wonder how this affect IIS or Lighttpd, but I guess we'll figure that out ...)

#10

drumm - July 2, 2006 - 01:20
Status:reviewed & tested by the community» fixed

Committed to HEAD.

#11

Anonymous - July 16, 2006 - 01:34
Status:fixed» closed

#12

Creazion - July 22, 2006 - 12:40
Version:<none>» 4.7.2
Status:closed» needs review

Hi,

why so much code, the patch at the attachment does the same with less lines?

AttachmentSizeStatusTest resultOperations
common.inc_27.patch439 bytesIgnoredNoneNone

#13

Dries - July 22, 2006 - 15:56
Category:bug report» task
Status:needs review» fixed

Creazion: your patch looks incomplete. You missed the '#'.

#14

Creazion - July 22, 2006 - 20:07
Status:fixed» needs review

Hi Dries,

sorry i made a wrong patch with diff on windows. The attached patch includes the '#'.

AttachmentSizeStatusTest resultOperations
common.inc.2.patch421 bytesIgnoredNoneNone

#15

drumm - July 22, 2006 - 21:24
Status:needs review» needs work

- This misses ampersand encoding, which has the problem as #.
- The extra encoding should only happen when clean urls are on since this is a workaround for a bug in mod_rewrite.
- This needs to be a patch against HEAD.

#16

Steven - July 22, 2006 - 21:32
Status:needs work» won't fix

Creazion: your patch does not do the same as what was committed. The goal is to double encode ampersands and hashes. Yours does not encode them at all.

#17

killes@www.drop.org - August 6, 2006 - 18:05
Status:won't fix» fixed

drumm's patch was also committed to 4.7

#18

Anonymous - August 20, 2006 - 18:15
Status:fixed» closed

#19

onionweb - August 21, 2006 - 15:58
Version:4.7.2» 4.7.3
Status:closed» active

Shouldn't this have looked more like:

function drupal_urlencode($text) {
if (variable_get('clean_url', '0')) {
return str_replace(array('%2F', '%26', '%23'),
array('/', '%2526', '%2523'),
urlencode($text));
}
else {
$text = str_replace('%2F', '/', urlencode($text));
return str_replace('%23', '#', urlencode($text));

}
}

Otherwise all the #'s are '%23, such as in the "login or register to post comments" links when clean urls is enabled.

#20

onionweb - August 21, 2006 - 16:16

hmm... well that bit I posted above doesn't actually fix the problem, but on drupal.org, when you login from a link below a post, you get page not found because of the %23.

#21

onionweb - August 22, 2006 - 11:30
Category:task» bug report

changed this from task to bug since this committed patch created a bug in the login.

#22

AjK - September 18, 2006 - 10:55
Priority:normal» critical

Raising awareness (critical) for this is it's easily reproduced on d.o. and clearly broken.

#23

NaX - September 28, 2006 - 16:16

You guys are forgetting about "Named Anchors" double-encoded # mean you cant have a Named Anchor as a menu item.

I think when the requested page is the search page then double encode # but every where else leave #

<?php
function drupal_urlencode($text) {
 
$hash = (arg(0) == 'search' ? '%2523' : '#');
  if (
variable_get('clean_url', '0')) {
    return
str_replace(array('%2F', '%26', '%23'),
                       array(
'/', '%2526', $hash),
                      
urlencode($text));
  }
  else {
    return
str_replace('%2F', '/', urlencode($text));
  }
}
?>

The problem with this solution is that any Named Anchor menu links breaks on the search page and when clicked on goes to page not found. Maybe a better solution would be to look at replacing # with double-encoded # somehow only when it comes from the search form id=search_form and form id=search or in the search_menu function ('path' => 'search/'. $name . $keys).

#24

Steven - September 29, 2006 - 15:48

In 5.0, this code has been changed to:

<?php
function drupal_urlencode($text) {
  if (
variable_get('clean_url', '0')) {
    return
str_replace(array('%2F', '%26', '%23'),
                       array(
'/', '%2526', '%2523'),
                      
urlencode($text));
  }
  else {
    return
str_replace('%2F', '/', urlencode($text));
  }
}
?>

Does this need to be backported? Is there a problem still? Are we sure this is not just a case of some places lacking a call to drupal_urlencode() ?

#25

NaX - October 8, 2006 - 22:09
Status:active» needs work

This patch makes it possible to both search for special characters (#,&) and allow named anchors in menu items (node/*#name). It is not very elegantly implemented and I suggest somebody that understands the workings of the core modules better than me look to implement it in a better way. But you can get the general idea of what I was trying to achieve.

This is the first patch I have created that involves multiple files, hope I did it correctly.
I hope you find it useful.

AttachmentSizeStatusTest resultOperations
url_special_char.patch2.45 KBIgnoredNoneNone

#26

Steven - October 10, 2006 - 05:23
Status:needs work» active

This patch is a completely wrong approach to solving this issue. Search.module should receive no special treatment whatsoever.

Again: is there still a bug that needs to be addressed in the latest 4.7 release?

#27

pwolanin - October 11, 2006 - 23:56

Yes, this is an issue and can be seen now on drupal.org if you log out. The "login to comment" links look like:

http://drupal.org/user/login?destination=comment/reply/88728%2523comment...

Where the '#' has been encoded

#28

NaX - October 12, 2006 - 17:31

Instead of trying to change the drupal_urlencode function maybe we should be focusing on the search module as it seams that is where the problem lies when it comes to special characters.

Here is another go.

<?php
function drupal_urlencode($text) {
 
$search = array('%2F', '%23');
 
$replace = array('/', '#');
  if (
variable_get('clean_url', '0')) {
    
$search[] = '%26';
    
$replace[] ='%2526';
  }
  return
str_replace($search, $replace, urlencode($text));
}
?>

But all these solutions means their is code in the drupal_urlencode function just for the search module maybe we should focus on the data the search module receives rather than all the current solutions.

<?php
function drupal_urlencode($text) {
  return
str_replace(array('%2F', '%23'), array('/', '#'), urlencode($text));
}
?>

#29

Steven - October 14, 2006 - 04:27
Status:active» by design

Did you bother to actually take a look at the URL in question?

http://drupal.org/user/login?destination=comment/reply/88728%2523comment...

This means: We are on the "user/login" page. After we finish logging in, we want to proceed to "comment/reply/88728#comment...". In other words, the #comment fragment identifier is part of the destination value, not part of the normal URL. If it was not escaped, it would be ignored by PHP.

By design.

#30

valiant - April 22, 2007 - 05:12
Title:Handle ampersands in search queries and other URLs when clean URLs are on» Use THE_REQUEST

see:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32328#c12

If you need the unescaped uri with all its consequences, use the ENV
THE_REQUEST, which contains the full untouched request string like
GET /foo%20bar?foo=bar HTTP/1.1

That works for us in Gallery 2. We're using THE_REQUEST for a long time now, with success.

e.g.

    RewriteCond %{THE_REQUEST} /gallery2/tag/([^?/]+)
    RewriteRule .   /gallery2/main.php?g2_view=tags.VirtualAlbum&g2_tagName=%1   [QSA,L]

#31

asimmonds - April 22, 2007 - 05:41
Title:Use THE_REQUEST» Handle ampersands in search queries and other URLs when clean URLs are on

Reverting title

 
 

Drupal is a registered trademark of Dries Buytaert.