valid_url() does not support all valid URL characters
m3avrck - March 3, 2007 - 22:36
| Project: | Drupal |
| Version: | 5.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
| Issue tags: | IDN |
Description
I'm doing more thorough testing of my SimpleFeed module and I've noticed that it has had trouble creating some feed items. Well it turns out the culprit is valid_url() -- I have some feeds that have crazy URLs in theme, with characters like "!" which are not in the valid_url() check in Drupal.
While I will update my module to *not* use valid_url(), I suggest we fix this in core.
After some research:
(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
It was constructed according to the BNF grammar given in RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt).Or a bit simpler:
(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?Making a patch is trivial. Agreeing on the right approach is not. :-)

#1
We need a test suite that has a sampling of all the URLs we want to be valid or invalid.
#2
A bit OT: Are we moving towards using Simpletest as the testing suite for Drupal? I wanted to write some tests for another contrib module and wanted to make sure I was forward-looking? I suppose we could add some URL tests to the core tests included in Simpletest.
#3
Simpletest++. We intend to extend that platform some during SoC, so writing tests for it is forward looking. Care to make a test suite for this issue?
#4
This is a bug, as has been stated in #247880: valid_url needs to support more characters.
First tests have been committed on behalf of #283806: Notice appears when drupal_http_request called with an invalid URL.
#5
See also #295021: Extend valid_url() for a more radical revamp of valid_url().
#6
Test available for valid_url at #295021: Extend valid_url() as well.
#7
This issue, where valid_url() fails on many valid urls because it isn't checking the characters the way it should is rearing its ugly head in multiple issues and multiple ways. It looks like we need a better valid_url function.
See these issues plus this one of a few different proposed solutions:
#41651: valid_url() doesn't accept URLs with @ in it
#124492: valid_url() does not support all valid URL characters
#308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL
Before chasing after any of these possible solutions we need some guidance on what kind of solution we should use. Very loose checking, strict checking, somewhere in the middle? There have been multiple proposed solutions. What type of solution would be good?
Thoughts? I'm willing to take on this patch if we have a general path forward.
#8
Summary: currently the function valid_url fails on some valid urls. For example, http://www.flickr.com/people/93099231@N00/ would fail because of the @ symbol but is a legitimate (and common) address. This is a bug in the regex.
The attached patch updates the regex and provides tests. The regex is similar to the second one proposed by m3avrck above but is stricter and fixes an issue in the regex.
#9
Tasks:
- Make the regex rfc compliant. (or darn close)
- Add some more tests. Especially difficult ones.
It seems valid_url is only called for form validation in core.
If someone needs to test for a url but doesn't need strict validation filter_var with FILTER_VALIDATE_URL can do the job.
#10
The current regex I'm working with is:
<?phppreg_match('
/^ # Start at the beginning of the text
(?:ftp|http|https):\/\/ # Look for ftp, http, or https
(?: # Username:password combinations (optional)
[\w\.\-\+]+ # A username
:{0,1} # an optional colon to separate the username and password
[\w\.\-\+]*@ # A password
)?
(?:[a-z0-9\-\.]+) # The domain limiting it to just allowed characters
(?::[0-9]+)? # Server port number (optional)
(?: # The path (optional)
\/| # a forward slash
\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)| # or a forward slash followed by a full path
\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+) # or a question mark followed by key value pairs
)?$
/xi', $url);
?>
Thoughts?
#11
An updated and simpler regex.
#12
The last submitted patch failed testing.
#13
Looks like this failed because profile.test uses the randonName method to generate a url name and randonName includes the character _ which is invalid in the domain part of the url. That test will need to be updated as well.
#14
An updated patch that fixes the profile module test. Otherwise it's the same as the last one.
#15
Whether an URL is considered "valid" depends on the context. I suggest adding to the Doxygen comment which specification the URLs are checked against, e.g. RFC 3986 (another possibility is the definition in the HTML5 draft specification), and what additional requirements there are, e.g. the restriction to only the ftp, http and https schemes.
These URLs are valid according to RFC 3986 (AFAICT) but are rejected by the function:
http://example.org?http://john%20doe:secret:foo@example.org/
http://example.org/~,$'*;
http://caf%C3%A9.example.org
+ :{0,1}?instead of{0,1}like elsewhere in the pattern.+ 'sudomain.example.com',+ \/| # a forward slash+ (?:\/|\?)(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+) # or a forward slash or question mark followed by a full path
What is the purpose of having these two alternative lines, rather than getting rid of the first and then replacing the
+quantifier with*? Most of the backslashes are not necessary between square brackets[…]– I think they reduce readability. Also, I don't think it is appropriate to refer to the string following a question mark as a full path.#16
@c960657 - Thanks for the feedback.
This patch updates per the comments above as well as added additional tests to cover more possibilities. RFC 3986 is noted in the valid_url header comment as well.
Patch Description: The current version of valid_url in core fails for many valid urls. This is in part due to a spec change and in part due to the regular expression not covering all url variations. This patch updates the regular expression to cover more cases and provides test for many cases.
#17
+ $this->assertFalse($valid_url, $test_url .t(' is NOT a valid url.'));There are a number of such instances in the patch. We should correctly use t().
#18
oops. Fixed in attached patch. Passes the url through t() in the test.
#19
+ return (bool)preg_match("+ /^ # Start at the beginning of the text
+ (?:ftp|https?):\/\/ # Look for ftp, http, or https
+ (?: # Userinfo (optional)
+ (?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*
+ [\w\.\-\+%!$&'\(\)*\+,;=]+@
+ )?
+ (?:[a-z0-9\-\.%]+) # The domain
+ (?::[0-9]+)? # Server port number (optional)
+ (?:[\/|\?][\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]*)? # The path (optional)
+ $/xi", $url);
where did this regex come from? it'd be great to have a link to it's source.
+ /**+ * Implementation of getInfo().
+ */
should be removed.
+ return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);could really use a comment explaining it's purpose.
+ // _ is an invalid character for domain names.It might be better to spell out that the underscore is not valid in domain names.
+ function testValidUrl() {Needs a PHPDoc comment.
I'd split testValidUrl() up into separate tests:
#20
I'd like a summary of what characters are allowed under that RFC as comments so we can have an idea what you think it does incase there's a problem with it later.
#21
@drewish - I wrote this regular expression. I couldn't find one for RFC 3986 that would work. So, the source is here and me. That's why I would like someone else to look over the regular expression.
Is there a reason to document
+ return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);when the current regular expression in this space isn't documented? I love documentation but I don't want to overdo it. :)
This patch fixes the other issues and breaks the tests up into 4 tests (2 more invalid url cases were added).
#22
@drewish - a listing of all valid characters depends on the part and would be quite a comment block. For the userinfo, domain, path, and more the characters are different.
Would it be worth the large comment block when there is a spec?
#23
Code and tests look good to me. It is also properly documented. I'm not a regex wizard but this looks like RTBC to me.
#24
I shared the regular expression (and provided a more generic URI one) at http://mattfarina.com/2009/01/08/rfc-3986-url-validation in hopes someone can use it, break it, or share one that's better.
#25
False negative: http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.htmlFalse positivie: http://somedomain.com/ind%ex.html
I don't understand why we need to limit valid urls to http, https and ftp. Checking the accepted scheme should be the job of another function.
#26
@Heine thanks for the feedback.
I left the scheme setup as it had previously been in drupal. It would be trivial to change and have a valid regex that accepts all valid schemes. But, I'm not sure it's appropriate to have scheme checking beyond either the ones we already have or a properly formatted scheme in this issue. When I post my next update I'll post 2 patches. One that keeps the scheme limit the same and one that handles a well formed scheme per the spec.
Handling the % symbol and hex codes will take a little reword of the regex. The % is only valid when it's with a properly formatted hex code. Right now the % is just allowed.
#27
I'd like to toss in a vote for not limiting the schemes to /(ftp|https?)/. I occasionally use links with irc: and aim: schemes, and mailto: and javascript: still have their uses. Also, Tor and I believe some other anonymizing software use their own schemes as well. Limiting the scheme will cause frustrating false negatives.
That aside, good work on that regular expression, Matt. That must have taken some work to iron out. As a distraction, here's some interesting research on the ol' "Now you have two problems" quote.
#28
Note that filter_xss_bad_protocol() and _filter_url() also contain lists of allowed schemes.
See also #295021: Extend valid_url() for additional ideas for limiting the schemes etc. Most of this is outside the scope of this issue, though.
#29
I contest the notion that limiting the scheme (or not) is out of scope. valid_url should return TRUE for valid URLs, FALSE for invalid, and not make a value judgement about permissible schemes.
#30
@Heine I understand where you are coming from. I would like configurable schemes as well. This issue is filed as a bug (which I agree with). The current valid_url fails for many valid urls. Adding configurable schemes would be nice but is feature creep into a bug issue.
Let's get this one fixed and then we can add configurable schemes via #295021: Extend valid_url(). The tests should go in with this patch and well support the correct characters. We can work on extending the function there.
Since this is a bug I would like to get it backported to drupal 5 and 6.
#31
I don't think anyone is suggesting that the accepted schemes should be configurable; rather, that any scheme which contains valid characters should be accepted.
#32
I committed the patch in #21 to CVS HEAD. Thanks mfer.
We can continue the conversation about schemes and fixing the remaining failures using follow-up patches. :-)
#33
I'm still working to fix the 2 issues listed in #25. In the meantime another type of character has come up. What about unicode characters used internationally?
For example, http://例え.テスト/.
There are international domain names, there are url paths with international characters, like ü, on wikipedia.
Should we support these or should we only support their ascii representation?
#34
Would converting them to their ASCII representation and then doing the tests against that be a practical option?
#35
This updated fixes the two issues Heine found in #25. That is, it no longer allows the % character. Instead it looks for "% HEXDIGIT HEXDIGIT". It, also, allows for well formed IPv6. That is in the format
[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]Where a HEXDIGIT can happen 0 to 4 times between : and the whole thing is enclosed in [].
Added additional tests for these cases.
#36
Looks good to me, although it is a tiny bit light on documentation, IMO.
#37
Added some additional comments to the regex. Is there someplace specific I can add more documented detail?
#38
Great! Committed to CVS HEAD. I guess we can mark this 'fixed' now. Good work.
#39
Can we get this bug fixed in drupal 6 as well? The attached patch is a direct backport of the drupal 7 valid_url function.
#40
Why does "The path and query (optional)" has the ! twice (and also the preg in the other conditional case)? Do I miss something? (This might also need cleanup in Drupal 7).
#41
Nice catch Gábor.
The patch that ends in D7 removes the additional !. The patch that ends in D6 is a direct backport of the D7 function. Ran the Valid Url tests and they all pass. Also, manually tested the ! in the path.
#42
#43
As for a game plan moving forward, I would like to support international domain names (which currently work) among other things. Details and a path forward are at #295021: Extend valid_url()
#44
Committed the follow-up to D7. Moving back to D6. Thanks.
#45
Committed to Drupal 6. I expect we need to backport this to 5.
#46
Here is a patch against drupal 5. Same function as D6 and D7 just coppied over.
#47
Committed to 5.x.
#48
Automatically closed -- issue fixed for 2 weeks with no activity.
#49
Regex have bugs with IDN. See #368472: valid_url() marks correct IDN domains as invalid.
#50