Posted by Shiny on December 19, 2005 at 12:43am
| Project: | Drupal core |
| Version: | 7.x-dev |
| Component: | comment.module |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
Issue Summary
Using a URL such as this for the homepage in a comment form
http://www.flickr.com/photos/16637594@N00/
and you get this error message:
"The URL of your homepage is not valid. Remember that it must be fully qualified, i.e. of the form http://example.com/directory"
However, if i remove the @ symbol, the URL is accepted (but wrong)
Comments
#1
can verify this.
#2
patch to common.inc to include the @ sign
as in, replace
$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';with
$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';#3
Just wanted to point that adding the atsign, as it is the current regexp used for URL validation, might be dangerous, or at least opens up something that it was not possible.
The atsign, if entered just after the protocol (ie. part of the domain) is used to specify username:password. Sometimes this is used to fool people with crafted URLs such as:
http://www.example.com\use-a-long-enough-string-here@www.badguy.com/malware.php
#4
I am changing the priority to Minor.
According to http://www.ietf.org/rfc/rfc1738.txt the atsign is a reserved character.
To embed the atsign in a url, a percent sign followed by the hexadecimal code should be used. Infact this URL coding standard should be enforced/required for all special characters.
So
could be replaced by
In my opinion the code, as is _without_ patch works by design and should _not_ be patched to resolve this issue.
However, the same RFC points out that there is more than a couple of potocols. The list includes
and that these protocols normally have the following schemes
To be more correct the valid_url should be compatible with this scheme. As a practical example adding mailto support would allow the weblink module to maintain e-mail addresses.
Paddy.
#5
I have developed on the valid_url to be more compliant with the IETF RFC (http://www.ietf.org/rfc/rfc1738.txt).
The code is as follows
function valid_url($url, $absolute = FALSE) {
$allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+\@]';
$lowalpha = 'a-z';
$hialpha = 'A-Z';
$alpha = $lowalpha . $hialpha;
$digit = '0-9';
$alphadigit = $alpha . $digit;
$safe = '\$\-\_\.\+';
$extra = '\!\*\'\(\)\,';
$national = '\{\}\|\\\^\~\[\]\`';
$punctuation = '\<\>\#\%\"';
$reserved = '\;\/\?\:\@\&\=';
$hex = $digit . 'A-Fa-f';
$escape = '\%[' . $hex . ']{2}';
$unreserved = $alpha . $digit . $safe . $extra;
$uchar = $unreserved . $escape;
$xchar = $unreserved . $reserved . $escape;
$user = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
$password = '([' . $unreserved . '\;\?\&\=]|('. $escape . '))+';
$login = '(' . $user . '(\:' . $password . ')?\@)?';
$toplabel = '([' . $alpha . ']|([' . $alpha . '][' . $alphadigit . '\-]*[' . $alphadigit .']))';
$domainlabel = '([' . $alphadigit . ']|([' . $alphadigit . ']([' . $alphadigit . '\-])*[' . $alphadigit . ']))';
$hostname = '(' . $domainlabel . '\.)*' . $toplabel;
$port = '[' . $digit . ']+';
$hostnumber = '[' . $digit . ']{1,3}' . '(\.[' . $digit . ']{1,3}){3}';
$host = '(' . $hostname . '|' . $hostnumber . ')';
$hostport = $host . '(\:' . $port . ')?';
$search = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
$hsegment = '([' . $unreserved . '\;\:\@\&\=]|('. $escape . '))*';
$hpath = $hsegment . '(\/' . $hsegment . ')*' . '(\?' . $search . ')?';
$proper_url = $login . $hostport . '(\/' . $hpath . ')?';
$protocols = '(http|https|ftp):\/\/';
if ($absolute) {
watchdog('php', preg_match("/^(http|https|ftp):\/\/". $proper_url ."$/i", $url) ? 'Matched' : 'Oops', 0);
return preg_match('/^' . $protocols . $proper_url. '$/i', $url);
}
else {
return preg_match('/^'. $hpath .'$/i', $url);
}
}
The resultant regular expression is
(([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+(\:([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\?\&\=]|(\%[0-9A-Fa-f]{2}))+)?\@)?((([a-zA-Z0-9]|([a-zA-Z0-9]([a-zA-Z0-9\-])*[a-zA-Z0-9]))\.)*([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9]))|[0-9]{1,3}(\.[0-9]{1,3}){3})(\:[0-9]+)?(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*(\/([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))*)*(\?([a-zA-Z0-9\$\-\_\.\+\!\*'\(\)\,\;\:\@\&\=]|(\%[0-9A-Fa-f]{2}))+)?)?so I would be loath to simplify the code.
It works for the test cases
I will continue to test and debug.
I will provide a patch when thouroughly tested and working as expected.
Paddy.
#6
Sorry - please remove the call to watchdog().
#7
But we have the code for this already... no need to reinvent the wheel.
#8
Changed $path to $url in valid_url and patch works againts HEAD.
Both:
http://www.flickr.com/photos/16637594@N00/
and
http://www.flickr.com/photos/16637594%40N00/
are accepted as valid_url now.
#9
then it's RTBC
#10
Previous patch was not rolled againts head. This should be final.
#11
Looking at this patch; don't we declare $allowed_characters twice?
#12
every patch but the latest contained this line:
- $allowed_characters = '[a-z0-9\/:_\-_\.\?\$,~=#&%\+]';so yes, now we define it twice but it should not be so.
#13
My bad, missed out one line when rerolled. This hopefully is the final final..
#14
This doesn't seem to fix the actual issue, @ not being allowed.
The relevant RFC, http://www.ietf.org/rfc/rfc2396.txt ("revises and replaces" 1738) states @ is allowed in path segments.
#15
Changed $path to $url in valid_url.
#16
Rerolled from root to latest HEAD.
#17
patching file includes/common.inc
Hunk #1 FAILED at 689.
Hunk #2 succeeded at 1141 with fuzz 2 (offset 121 lines).
1 out of 2 hunks FAILED -- saving rejects to file includes/common.inc.rej
I'm confused about why this isn't simply adding @ to the allowed characters list. What do all of the code changes do?
#18
Rerolled for HEAD.
From #3, just adding @ to the allowed character list might open up a security loophole. And for the whole purpose of the patch, I don't get the whole idea either; just fixing syntax errors with what chx came up with.
#19
Phishing is not a reason to disallow valid URLs blindly. Flickr is a perfect example of legit sites using @.
Also, filter_xss_bad_protocol() does not guarantee anything other than the protocol. We should still at least validate the characters in absolute URLs to ensure at least a high degree of URL validity.
#20
I'd like to see this solved on the development version first.
#21
One out of two hunks failed.
#22
I just ran into this. We should solve in in D7 and then backport the fix.
#23
I wonder if the '@' character would pass the PHP-5 function that we used to replace the internals of valid_email_address:
<?phpfilter_var($url, FILTER_VALIDATE_URL);
?>
#24
@Dave Reid I believe you mean the issue #308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL
I'm wondering the same thing. What level of url checking do we expect out of valid_url()?
#25
This was fixed when #308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL went in.
#26
@mfer That didn't actually get committed, it was won't fix. :)
#27
oops, I meant #124492: valid_url() does not support all valid URL characters fixed this issue.
#28
Automatically closed -- issue fixed for 2 weeks with no activity.
#29
Why won't this url work????
Figured out that it is the "=\" in the following url
http://wirelessweek.com/rss.aspx?type=cat&path=\carriers
Confused with what to use from the above comments...
Can someone please help me out?
Thank you.
Using Drupal 7 - Aggregator (Core module)
Found this in "common.inc"
function valid_url($url, $absolute = FALSE) {if ($absolute) {
return (bool)preg_match("
/^ # Start at the beginning of the text
(?:ftp|https?|feed):\/\/ # Look for ftp, http, https or feed schemes
(?: # Userinfo (optional) which is typically
(?:(?:[\w\.\-\+!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)* # a username or a username and password
(?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+@ # combination
)?
(?:
(?:[a-z0-9\-\.]|%[0-9a-f]{2})+ # A domain name or a IPv4 address
|(?:\[(?:[0-9a-f]{0,4}:)*(?:[0-9a-f]{0,4})\]) # or a well formed IPv6 address
)
(?::[0-9]+)? # Server port number (optional)
(?:[\/|\?]
(?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2}) # The path and query (optional)
*)?
$/xi", $url);
}
else {
return (bool)preg_match("/^(?:[\w#!:\.\?\+=&@$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})+$/i", $url);
}
}
#30
Help, tried all of these, cannot get the "=\" to work in the query portion of the url
What I need=
http://wirelessweek.com?type=cat&path=\carriers
What I tried
---------------
http://wirelessweek.com?type=cat&path=%92carriers --> ascii
http://wirelessweek.com?type=cat&path=&92;carriers --> html
http://wirelessweek.com?type=cat&path=0x5Ccarriers --> hex
None of these will pull a feed from the above source.
Willing to make a change to the common.inc file, but don't understand all the short hand used in the drupal 7 version of this file.
#31
BTW: just edited common.inc and set value $url="http://example.com" and was able to pass validation.
HOWEVER, now when I run the aggregator update it now gives a message saying "Mismatched tag" on line 143.
Module I am using is Core Drupal 7 - Aggregator.
Note: this same link worked in Drupal 6.22
What do I have to get it to work now that I upgraded to Drupal 7?