Using a URL such as this for the homepage in a comment form
http://www.flickr.com/photos/16637594@N00/

and you get this error message:
"The URL of your homepage is not valid. Remember that it must be fully qualified, i.e. of the form http://example.com/directory"

However, if i remove the @ symbol, the URL is accepted (but wrong)

Comments

killes@www.drop.org’s picture

Title: Homepage field doesn't accept URLs with @ in it » valid_url() doesn't accept URLs with @ in it
Version: 4.6.5 » x.y.z

can verify this.

paddy_deburca’s picture

StatusFileSize
new448 bytes

patch to common.inc to include the @ sign

as in, replace

$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';

with

$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';
markus_petrux’s picture

Just wanted to point that adding the atsign, as it is the current regexp used for URL validation, might be dangerous, or at least opens up something that it was not possible.

The atsign, if entered just after the protocol (ie. part of the domain) is used to specify username:password. Sometimes this is used to fool people with crafted URLs such as:

http://www.example.com\use-a-long-enough-string-here@www.badguy.com/malware.php

paddy_deburca’s picture

Priority: Normal » Minor

I am changing the priority to Minor.

According to http://www.ietf.org/rfc/rfc1738.txt the atsign is a reserved character.

To embed the atsign in a url, a percent sign followed by the hexadecimal code should be used. Infact this URL coding standard should be enforced/required for all special characters.

So

could be replaced by

In my opinion the code, as is _without_ patch works by design and should _not_ be patched to resolve this issue.

However, the same RFC points out that there is more than a couple of potocols. The list includes

ftp
File Transfer protocol
http
Hypertext Transfer Protocol
gopher
The Gopher protocol
mailto
Electronic mail address
news
USENET news
nntp
USENET news using NNTP access
telnet
Reference to interactive sessions
wais
Wide Area Information Servers
file
Host-specific file names
prospero
Prospero Directory Service

and that these protocols normally have the following schemes

  • //<user>:<password>@<host>:<port>/<url-path>

To be more correct the valid_url should be compatible with this scheme. As a practical example adding mailto support would allow the weblink module to maintain e-mail addresses.

Paddy.

paddy_deburca’s picture

I have developed on the valid_url to be more compliant with the IETF RFC (http://www.ietf.org/rfc/rfc1738.txt).

The code is as follows

function valid_url($url, $absolute = FALSE) {
  $allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';

  $lowalpha = 'a-z';
  $hialpha  = 'A-Z';
  $alpha    = $lowalpha . $hialpha;
  $digit    = '0-9';
  $alphadigit = $alpha . $digit;
  $safe     = '\$\-\_\.\+';
  $extra    = '\!\*\'\(\)\,';
  $national = '\{\}\|\\\^\~\[\]\`';
  $punctuation = '\<\>\#\%\"';

  $reserved = '\;\/\?\:\@\&\=';
  $hex      = $digit . 'A-Fa-f';
  $escape   = '\%[' . $hex . ']{2}';
  
  $unreserved = $alpha . $digit . $safe . $extra;
  $uchar    = $unreserved . $escape;
  $xchar    = $unreserved . $reserved . $escape;
  
  $user     = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
  $password = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
  $login    = '(' . $user . '(\:' . $password . ')?\@)?';

  $toplabel = '([' . $alpha . ']|([' . $alpha . '][' . $alphadigit . '\-]*[' . $alphadigit .']))';
  $domainlabel = '([' . $alphadigit . ']|([' . $alphadigit . ']([' . $alphadigit . '\-])*[' . $alphadigit . ']))';
  $hostname = '(' . $domainlabel . '\.)*' . $toplabel;
  $port     = '[' . $digit . ']+';
  $hostnumber = '[' . $digit . ']{1,3}' . '(\.[' . $digit . ']{1,3}){3}';
  $host     = '(' . $hostname . '|' . $hostnumber . ')';
  $hostport = $host . '(\:' . $port . ')?';
  
  $search   = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
  $hsegment = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
  $hpath    = $hsegment . '(\/' . $hsegment . ')*' . '(\?' . $search . ')?';
  $proper_url = $login . $hostport . '(\/' . $hpath . ')?';

  $protocols = '(http|https|ftp):\/\/';

  if ($absolute) {
    watchdog('php', preg_match("/^(http|https|ftp):\/\/". $proper_url ."$/i", $url) ? 'Matched' : 'Oops', 0);
    return preg_match('/^' . $protocols . $proper_url. '$/i', $url);
  }
  else {
    return preg_match('/^'. $hpath .'$/i', $url);
  }
}

The resultant regular expression is

(([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+(\:([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+)?\@)?((([a-zA-Z0-9]|([a-zA-Z0-9]([a-zA-Z0-9\-])*[a-zA-Z0-9]))\.)*([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9]))|[0-9]{1,3}(\.[0-9]{1,3}){3})(\:[0-9]+)?(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*)*(\?([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))+)?)?

so I would be loath to simplify the code.

It works for the test cases

I will continue to test and debug.

I will provide a patch when thouroughly tested and working as expected.

Paddy.

paddy_deburca’s picture

Sorry - please remove the call to watchdog().

chx’s picture

Status: Active » Needs review
StatusFileSize
new1.11 KB

But we have the code for this already... no need to reinvent the wheel.

edmund.kwok’s picture

StatusFileSize
new1.21 KB

Changed $path to $url in valid_url and patch works againts HEAD.

Both:

http://www.flickr.com/photos/16637594@N00/
and
http://www.flickr.com/photos/16637594%40N00/

are accepted as valid_url now.

chx’s picture

Status: Needs review » Reviewed & tested by the community

then it's RTBC

edmund.kwok’s picture

StatusFileSize
new1.37 KB

Previous patch was not rolled againts head. This should be final.

dries’s picture

Looking at this patch; don't we declare $allowed_characters twice?

chx’s picture

Status: Reviewed & tested by the community » Needs work

every patch but the latest contained this line:
- $allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';
so yes, now we define it twice but it should not be so.

edmund.kwok’s picture

Status: Needs work » Reviewed & tested by the community
StatusFileSize
new1.39 KB

My bad, missed out one line when rerolled. This hopefully is the final final..

drumm’s picture

Status: Reviewed & tested by the community » Needs work

This doesn't seem to fix the actual issue, @ not being allowed.

The relevant RFC, http://www.ietf.org/rfc/rfc2396.txt ("revises and replaces" 1738) states @ is allowed in path segments.

edmund.kwok’s picture

Status: Needs work » Needs review
StatusFileSize
new1.37 KB

Changed $path to $url in valid_url.

edmund.kwok’s picture

StatusFileSize
new1.4 KB

Rerolled from root to latest HEAD.

drumm’s picture

Version: x.y.z » 5.x-dev
Status: Needs review » Needs work

patching file includes/common.inc
Hunk #1 FAILED at 689.
Hunk #2 succeeded at 1141 with fuzz 2 (offset 121 lines).
1 out of 2 hunks FAILED -- saving rejects to file includes/common.inc.rej

I'm confused about why this isn't simply adding @ to the allowed characters list. What do all of the code changes do?

edmund.kwok’s picture

Status: Needs work » Needs review
StatusFileSize
new1.41 KB

Rerolled for HEAD.

From #3, just adding @ to the allowed character list might open up a security loophole. And for the whole purpose of the patch, I don't get the whole idea either; just fixing syntax errors with what chx came up with.

Steven’s picture

From #3, just adding @ to the allowed character list might open up a security loophole. And for the whole purpose of the patch, I don't get the whole idea either; just fixing syntax errors with what chx came up with.

Phishing is not a reason to disallow valid URLs blindly. Flickr is a perfect example of legit sites using @.

Also, filter_xss_bad_protocol() does not guarantee anything other than the protocol. We should still at least validate the characters in absolute URLs to ensure at least a high degree of URL validity.

drumm’s picture

Version: 5.x-dev » 6.x-dev

I'd like to see this solved on the development version first.

catch’s picture

Status: Needs review » Needs work

One out of two hunks failed.

mfer’s picture

Version: 6.x-dev » 7.x-dev

I just ran into this. We should solve in in D7 and then backport the fix.

dave reid’s picture

I wonder if the '@' character would pass the PHP-5 function that we used to replace the internals of valid_email_address:

filter_var($url, FILTER_VALIDATE_URL);
mfer’s picture

@Dave Reid I believe you mean the issue #308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL

I'm wondering the same thing. What level of url checking do we expect out of valid_url()?

mfer’s picture

Status: Needs work » Fixed
dave reid’s picture

Status: Fixed » Needs work

@mfer That didn't actually get committed, it was won't fix. :)

mfer’s picture

Status: Needs work » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

thesurger’s picture

Why won't this url work????
Figured out that it is the "=\" in the following url
http://wirelessweek.com/rss.aspx?type=cat&path=\carriers

Confused with what to use from the above comments...

Can someone please help me out?
Thank you.

Using Drupal 7 - Aggregator (Core module)
Found this in "common.inc"

function valid_url($url, $absolute = FALSE) {
  if ($absolute) {
    return (bool)preg_match("
      /^                                                      # Start at the beginning of the text
      (?:ftp|https?|feed):\/\/                                # Look for ftp, http, https or feed schemes
      (?:                                                     # Userinfo (optional) which is typically
        (?:(?:[\w\.\-\+!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)*      # a username or a username and password
        (?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+@          # combination
      )?
      (?:
        (?:[a-z0-9\-\.]|%[0-9a-f]{2})+                        # A domain name or a IPv4 address
        |(?:\[(?:[0-9a-f]{0,4}:)*(?:[0-9a-f]{0,4})\])         # or a well formed IPv6 address
      )
      (?::[0-9]+)?                                            # Server port number (optional)
      (?:[\/|\?]
        (?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})   # The path and query (optional)
      *)?
    $/xi", $url);
  }
  else {
    return (bool)preg_match("/^(?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})+$/i", $url);
  }
}
thesurger’s picture

Category: bug » support
Status: Closed (fixed) » Needs work

Help, tried all of these, cannot get the "=\" to work in the query portion of the url

What I need=
http://wirelessweek.com?type=cat&path=\carriers

What I tried
---------------
http://wirelessweek.com?type=cat&path=%92carriers --> ascii
http://wirelessweek.com?type=cat&path=&92;carriers --> html
http://wirelessweek.com?type=cat&path=0x5Ccarriers --> hex

None of these will pull a feed from the above source.
Willing to make a change to the common.inc file, but don't understand all the short hand used in the drupal 7 version of this file.

thesurger’s picture

Priority: Minor » Normal

BTW: just edited common.inc and set value $url="http://example.com" and was able to pass validation.

HOWEVER, now when I run the aggregator update it now gives a message saying "Mismatched tag" on line 143.
Module I am using is Core Drupal 7 - Aggregator.

Note: this same link worked in Drupal 6.22
What do I have to get it to work now that I upgraded to Drupal 7?

pasqualle’s picture

Status: Needs work » Closed (fixed)