Download & Extend

valid_url() doesn't accept URLs with @ in it

Project:Drupal core
Version:7.x-dev
Component:comment.module
Category:support request
Priority:normal
Assigned:Unassigned
Status:needs work

Issue Summary

Using a URL such as this for the homepage in a comment form
http://www.flickr.com/photos/16637594@N00/

and you get this error message:
"The URL of your homepage is not valid. Remember that it must be fully qualified, i.e. of the form http://example.com/directory"

However, if i remove the @ symbol, the URL is accepted (but wrong)

Comments

#1

Title:Homepage field doesn't accept URLs with @ in it» valid_url() doesn't accept URLs with @ in it
Version:4.6.5» x.y.z

can verify this.

#2

patch to common.inc to include the @ sign

as in, replace

$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';

with

$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';
AttachmentSizeStatusTest resultOperations
common_14.patch448 bytesIgnored: Check issue status.NoneNone

#3

Just wanted to point that adding the atsign, as it is the current regexp used for URL validation, might be dangerous, or at least opens up something that it was not possible.

The atsign, if entered just after the protocol (ie. part of the domain) is used to specify username:password. Sometimes this is used to fool people with crafted URLs such as:

http://www.example.com\use-a-long-enough-string-here@www.badguy.com/malware.php

#4

Priority:normal» minor

I am changing the priority to Minor.

According to http://www.ietf.org/rfc/rfc1738.txt the atsign is a reserved character.

To embed the atsign in a url, a percent sign followed by the hexadecimal code should be used. Infact this URL coding standard should be enforced/required for all special characters.

So

could be replaced by

In my opinion the code, as is _without_ patch works by design and should _not_ be patched to resolve this issue.

However, the same RFC points out that there is more than a couple of potocols. The list includes

ftp
File Transfer protocol
http
Hypertext Transfer Protocol
gopher
The Gopher protocol
mailto
Electronic mail address
news
USENET news
nntp
USENET news using NNTP access
telnet
Reference to interactive sessions
wais
Wide Area Information Servers
file
Host-specific file names
prospero
Prospero Directory Service

and that these protocols normally have the following schemes

  • //<user>:<password>@<host>:<port>/<url-path>

To be more correct the valid_url should be compatible with this scheme. As a practical example adding mailto support would allow the weblink module to maintain e-mail addresses.

Paddy.

#5

I have developed on the valid_url to be more compliant with the IETF RFC (http://www.ietf.org/rfc/rfc1738.txt).

The code is as follows

function valid_url($url, $absolute = FALSE) {
  $allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';

  $lowalpha = 'a-z';
  $hialpha  = 'A-Z';
  $alpha    = $lowalpha . $hialpha;
  $digit    = '0-9';
  $alphadigit = $alpha . $digit;
  $safe     = '\$\-\_\.\+';
  $extra    = '\!\*\'\(\)\,';
  $national = '\{\}\|\\\^\~\[\]\`';
  $punctuation = '\<\>\#\%\"';

  $reserved = '\;\/\?\:\@\&\=';
  $hex      = $digit . 'A-Fa-f';
  $escape   = '\%[' . $hex . ']{2}';
 
  $unreserved = $alpha . $digit . $safe . $extra;
  $uchar    = $unreserved . $escape;
  $xchar    = $unreserved . $reserved . $escape;
 
  $user     = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
  $password = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
  $login    = '(' . $user . '(\:' . $password . ')?\@)?';

  $toplabel = '([' . $alpha . ']|([' . $alpha . '][' . $alphadigit . '\-]*[' . $alphadigit .']))';
  $domainlabel = '([' . $alphadigit . ']|([' . $alphadigit . ']([' . $alphadigit . '\-])*[' . $alphadigit . ']))';
  $hostname = '(' . $domainlabel . '\.)*' . $toplabel;
  $port     = '[' . $digit . ']+';
  $hostnumber = '[' . $digit . ']{1,3}' . '(\.[' . $digit . ']{1,3}){3}';
  $host     = '(' . $hostname . '|' . $hostnumber . ')';
  $hostport = $host . '(\:' . $port . ')?';
 
  $search   = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
  $hsegment = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
  $hpath    = $hsegment . '(\/' . $hsegment . ')*' . '(\?' . $search . ')?';
  $proper_url = $login . $hostport . '(\/' . $hpath . ')?';

  $protocols = '(http|https|ftp):\/\/';

  if ($absolute) {
    watchdog('php', preg_match("/^(http|https|ftp):\/\/". $proper_url ."$/i", $url) ? 'Matched' : 'Oops', 0);
    return preg_match('/^' . $protocols . $proper_url. '$/i', $url);
  }
  else {
    return preg_match('/^'. $hpath .'$/i', $url);
  }
}

The resultant regular expression is

(([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+(\:([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+)?\@)?((([a-zA-Z0-9]|([a-zA-Z0-9]([a-zA-Z0-9\-])*[a-zA-Z0-9]))\.)*([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9]))|[0-9]{1,3}(\.[0-9]{1,3}){3})(\:[0-9]+)?(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*)*(\?([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))+)?)?

so I would be loath to simplify the code.

It works for the test cases

I will continue to test and debug.

I will provide a patch when thouroughly tested and working as expected.

Paddy.

#6

Sorry - please remove the call to watchdog().

#7

Status:active» needs review

But we have the code for this already... no need to reinvent the wheel.

AttachmentSizeStatusTest resultOperations
valid_url.patch1.11 KBIgnored: Check issue status.NoneNone

#8

Changed $path to $url in valid_url and patch works againts HEAD.

Both:

http://www.flickr.com/photos/16637594@N00/
and
http://www.flickr.com/photos/16637594%40N00/

are accepted as valid_url now.

AttachmentSizeStatusTest resultOperations
valid_url_2.patch1.21 KBIgnored: Check issue status.NoneNone

#9

Status:needs review» reviewed & tested by the community

then it's RTBC

#10

Previous patch was not rolled againts head. This should be final.

AttachmentSizeStatusTest resultOperations
valid_url_3.patch1.37 KBIgnored: Check issue status.NoneNone

#11

Looking at this patch; don't we declare $allowed_characters twice?

#12

Status:reviewed & tested by the community» needs work

every patch but the latest contained this line:
-  $allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';
so yes, now we define it twice but it should not be so.

#13

Status:needs work» reviewed & tested by the community

My bad, missed out one line when rerolled. This hopefully is the final final..

AttachmentSizeStatusTest resultOperations
valid_url_4.patch1.39 KBIgnored: Check issue status.NoneNone

#14

Status:reviewed & tested by the community» needs work

This doesn't seem to fix the actual issue, @ not being allowed.

The relevant RFC, http://www.ietf.org/rfc/rfc2396.txt ("revises and replaces" 1738) states @ is allowed in path segments.

#15

Status:needs work» needs review

Changed $path to $url in valid_url.

AttachmentSizeStatusTest resultOperations
valid_url_5.patch1.37 KBIgnored: Check issue status.NoneNone

#16

Rerolled from root to latest HEAD.

AttachmentSizeStatusTest resultOperations
valid_url_6.patch1.4 KBIgnored: Check issue status.NoneNone

#17

Version:x.y.z» 5.x-dev
Status:needs review» needs work

patching file includes/common.inc
Hunk #1 FAILED at 689.
Hunk #2 succeeded at 1141 with fuzz 2 (offset 121 lines).
1 out of 2 hunks FAILED -- saving rejects to file includes/common.inc.rej

I'm confused about why this isn't simply adding @ to the allowed characters list. What do all of the code changes do?

#18

Status:needs work» needs review

Rerolled for HEAD.

From #3, just adding @ to the allowed character list might open up a security loophole. And for the whole purpose of the patch, I don't get the whole idea either; just fixing syntax errors with what chx came up with.

AttachmentSizeStatusTest resultOperations
valid_url_7.patch1.41 KBIgnored: Check issue status.NoneNone

#19

From #3, just adding @ to the allowed character list might open up a security loophole. And for the whole purpose of the patch, I don't get the whole idea either; just fixing syntax errors with what chx came up with.

Phishing is not a reason to disallow valid URLs blindly. Flickr is a perfect example of legit sites using @.

Also, filter_xss_bad_protocol() does not guarantee anything other than the protocol. We should still at least validate the characters in absolute URLs to ensure at least a high degree of URL validity.

#20

Version:5.x-dev» 6.x-dev

I'd like to see this solved on the development version first.

#21

Status:needs review» needs work

One out of two hunks failed.

#22

Version:6.x-dev» 7.x-dev

I just ran into this. We should solve in in D7 and then backport the fix.

#23

I wonder if the '@' character would pass the PHP-5 function that we used to replace the internals of valid_email_address:

<?php
filter_var
($url, FILTER_VALIDATE_URL);
?>

#24

@Dave Reid I believe you mean the issue #308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL

I'm wondering the same thing. What level of url checking do we expect out of valid_url()?

#25

Status:needs work» fixed

This was fixed when #308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL went in.

#26

Status:fixed» needs work

@mfer That didn't actually get committed, it was won't fix. :)

#27

Status:needs work» fixed

oops, I meant #124492: valid_url() does not support all valid URL characters fixed this issue.

#28

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

#29

Why won't this url work????
Figured out that it is the "=\" in the following url
http://wirelessweek.com/rss.aspx?type=cat&path=\carriers

Confused with what to use from the above comments...

Can someone please help me out?
Thank you.

Using Drupal 7 - Aggregator (Core module)
Found this in "common.inc"

function valid_url($url, $absolute = FALSE) {
  if ($absolute) {
    return (bool)preg_match("
      /^                                                      # Start at the beginning of the text
      (?:ftp|https?|feed):\/\/                                # Look for ftp, http, https or feed schemes
      (?:                                                     # Userinfo (optional) which is typically
        (?:(?:[\w\.\-\+!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)*      # a username or a username and password
        (?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+@          # combination
      )?
      (?:
        (?:[a-z0-9\-\.]|%[0-9a-f]{2})+                        # A domain name or a IPv4 address
        |(?:\[(?:[0-9a-f]{0,4}:)*(?:[0-9a-f]{0,4})\])         # or a well formed IPv6 address
      )
      (?::[0-9]+)?                                            # Server port number (optional)
      (?:[\/|\?]
        (?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})   # The path and query (optional)
      *)?
    $/xi", $url);
  }
  else {
    return (bool)preg_match("/^(?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})+$/i", $url);
  }
}

#30

Category:bug report» support request
Status:closed (fixed)» needs work

Help, tried all of these, cannot get the "=\" to work in the query portion of the url

What I need=
http://wirelessweek.com?type=cat&path=\carriers

What I tried
---------------
http://wirelessweek.com?type=cat&path=%92carriers --> ascii
http://wirelessweek.com?type=cat&path=&92;carriers --> html
http://wirelessweek.com?type=cat&path=0x5Ccarriers --> hex

None of these will pull a feed from the above source.
Willing to make a change to the common.inc file, but don't understand all the short hand used in the drupal 7 version of this file.

#31

Priority:minor» normal

BTW: just edited common.inc and set value $url="http://example.com" and was able to pass validation.

HOWEVER, now when I run the aggregator update it now gives a message saying "Mismatched tag" on line 143.
Module I am using is Core Drupal 7 - Aggregator.

Note: this same link worked in Drupal 6.22
What do I have to get it to work now that I upgraded to Drupal 7?