Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I'm doing more thorough testing of my SimpleFeed module and I've noticed that it has had trouble creating some feed items. Well it turns out the culprit is valid_url() -- I have some feeds that have crazy URLs in theme, with characters like "!" which are not in the valid_url() check in Drupal.
While I will update my module to *not* use valid_url(), I suggest we fix this in core.
After some research:
(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
It was constructed according to the BNF grammar given in RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt).
Or a bit simpler:
(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?
Making a patch is trivial. Agreeing on the right approach is not. :-)
Comment | File | Size | Author |
---|---|---|---|
#46 | valid_url-124492-45-D5.patch | 2.03 KB | mfer |
#41 | valid_url-124492-41-D7.patch | 937 bytes | mfer |
#41 | valid_url-124492-41-D6.patch | 2.06 KB | mfer |
#39 | valid_url-124492-39.patch | 2.06 KB | mfer |
#37 | valid_url-124492-37.patch | 3.25 KB | mfer |
Comments
Comment #1
robertDouglass CreditAttribution: robertDouglass commentedWe need a test suite that has a sampling of all the URLs we want to be valid or invalid.
Comment #2
RobRoy CreditAttribution: RobRoy commentedA bit OT: Are we moving towards using Simpletest as the testing suite for Drupal? I wanted to write some tests for another contrib module and wanted to make sure I was forward-looking? I suppose we could add some URL tests to the core tests included in Simpletest.
Comment #3
robertDouglass CreditAttribution: robertDouglass commentedSimpletest++. We intend to extend that platform some during SoC, so writing tests for it is forward looking. Care to make a test suite for this issue?
Comment #4
sunThis is a bug, as has been stated in #247880: valid_url needs to support more characters.
First tests have been committed on behalf of #283806: Notice appears when drupal_http_request called with an invalid URL.
Comment #5
c960657 CreditAttribution: c960657 commentedSee also #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs for a more radical revamp of valid_url().
Comment #6
grndlvl CreditAttribution: grndlvl commentedTest available for valid_url at #295021: Extend valid_url() as well.
Comment #7
mfer CreditAttribution: mfer commentedThis issue, where valid_url() fails on many valid urls because it isn't checking the characters the way it should is rearing its ugly head in multiple issues and multiple ways. It looks like we need a better valid_url function.
See these issues plus this one of a few different proposed solutions:
#41651: valid_url() doesn't accept URLs with @ in it
#124492: valid_url() does not support all valid URL characters
#308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL
Before chasing after any of these possible solutions we need some guidance on what kind of solution we should use. Very loose checking, strict checking, somewhere in the middle? There have been multiple proposed solutions. What type of solution would be good?
Thoughts? I'm willing to take on this patch if we have a general path forward.
Comment #8
mfer CreditAttribution: mfer commentedSummary: currently the function valid_url fails on some valid urls. For example, http://www.flickr.com/people/93099231@N00/ would fail because of the @ symbol but is a legitimate (and common) address. This is a bug in the regex.
The attached patch updates the regex and provides tests. The regex is similar to the second one proposed by m3avrck above but is stricter and fixes an issue in the regex.
Comment #9
mfer CreditAttribution: mfer commentedTasks:
- Make the regex rfc compliant. (or darn close)
- Add some more tests. Especially difficult ones.
It seems valid_url is only called for form validation in core.
If someone needs to test for a url but doesn't need strict validation filter_var with FILTER_VALIDATE_URL can do the job.
Comment #10
mfer CreditAttribution: mfer commentedThe current regex I'm working with is:
Thoughts?
Comment #11
mfer CreditAttribution: mfer commentedAn updated and simpler regex.
Comment #13
mfer CreditAttribution: mfer commentedLooks like this failed because profile.test uses the randonName method to generate a url name and randonName includes the character _ which is invalid in the domain part of the url. That test will need to be updated as well.
Comment #14
mfer CreditAttribution: mfer commentedAn updated patch that fixes the profile module test. Otherwise it's the same as the last one.
Comment #15
c960657 CreditAttribution: c960657 commentedWhether an URL is considered "valid" depends on the context. I suggest adding to the Doxygen comment which specification the URLs are checked against, e.g. RFC 3986 (another possibility is the definition in the HTML5 draft specification), and what additional requirements there are, e.g. the restriction to only the ftp, http and https schemes.
These URLs are valid according to RFC 3986 (AFAICT) but are rejected by the function:
I suggest using
?
instead of{0,1}
like elsewhere in the pattern.Did you mean subdomain?
What is the purpose of having these two alternative lines, rather than getting rid of the first and then replacing the
+
quantifier with*
? Most of the backslashes are not necessary between square brackets[…]
– I think they reduce readability. Also, I don't think it is appropriate to refer to the string following a question mark as a full path.Comment #16
mfer CreditAttribution: mfer commented@c960657 - Thanks for the feedback.
This patch updates per the comments above as well as added additional tests to cover more possibilities. RFC 3986 is noted in the valid_url header comment as well.
Patch Description: The current version of valid_url in core fails for many valid urls. This is in part due to a spec change and in part due to the regular expression not covering all url variations. This patch updates the regular expression to cover more cases and provides test for many cases.
Comment #17
Dries CreditAttribution: Dries commented+ $this->assertFalse($valid_url, $test_url .t(' is NOT a valid url.'));
There are a number of such instances in the patch. We should correctly use t().
Comment #18
mfer CreditAttribution: mfer commentedoops. Fixed in attached patch. Passes the url through t() in the test.
Comment #19
drewish CreditAttribution: drewish commentedwhere did this regex come from? it'd be great to have a link to it's source.
should be removed.
could really use a comment explaining it's purpose.
It might be better to spell out that the underscore is not valid in domain names.
Needs a PHPDoc comment.
I'd split testValidUrl() up into separate tests:
Comment #20
drewish CreditAttribution: drewish commentedI'd like a summary of what characters are allowed under that RFC as comments so we can have an idea what you think it does incase there's a problem with it later.
Comment #21
mfer CreditAttribution: mfer commented@drewish - I wrote this regular expression. I couldn't find one for RFC 3986 that would work. So, the source is here and me. That's why I would like someone else to look over the regular expression.
Is there a reason to document
when the current regular expression in this space isn't documented? I love documentation but I don't want to overdo it. :)
This patch fixes the other issues and breaks the tests up into 4 tests (2 more invalid url cases were added).
Comment #22
mfer CreditAttribution: mfer commented@drewish - a listing of all valid characters depends on the part and would be quite a comment block. For the userinfo, domain, path, and more the characters are different.
Would it be worth the large comment block when there is a spec?
Comment #23
Dries CreditAttribution: Dries commentedCode and tests look good to me. It is also properly documented. I'm not a regex wizard but this looks like RTBC to me.
Comment #24
mfer CreditAttribution: mfer commentedI shared the regular expression (and provided a more generic URI one) at http://mattfarina.com/2009/01/08/rfc-3986-url-validation in hopes someone can use it, break it, or share one that's better.
Comment #25
Heine CreditAttribution: Heine commentedI don't understand why we need to limit valid urls to http, https and ftp. Checking the accepted scheme should be the job of another function.
Comment #26
mfer CreditAttribution: mfer commented@Heine thanks for the feedback.
I left the scheme setup as it had previously been in drupal. It would be trivial to change and have a valid regex that accepts all valid schemes. But, I'm not sure it's appropriate to have scheme checking beyond either the ones we already have or a properly formatted scheme in this issue. When I post my next update I'll post 2 patches. One that keeps the scheme limit the same and one that handles a well formed scheme per the spec.
Handling the % symbol and hex codes will take a little reword of the regex. The % is only valid when it's with a properly formatted hex code. Right now the % is just allowed.
Comment #27
Garrett Albright CreditAttribution: Garrett Albright commentedI'd like to toss in a vote for not limiting the schemes to /(ftp|https?)/. I occasionally use links with irc: and aim: schemes, and mailto: and javascript: still have their uses. Also, Tor and I believe some other anonymizing software use their own schemes as well. Limiting the scheme will cause frustrating false negatives.
That aside, good work on that regular expression, Matt. That must have taken some work to iron out. As a distraction, here's some interesting research on the ol' "Now you have two problems" quote.
Comment #28
c960657 CreditAttribution: c960657 commentedNote that filter_xss_bad_protocol() and _filter_url() also contain lists of allowed schemes.
See also #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs for additional ideas for limiting the schemes etc. Most of this is outside the scope of this issue, though.
Comment #29
Heine CreditAttribution: Heine commentedI contest the notion that limiting the scheme (or not) is out of scope. valid_url should return TRUE for valid URLs, FALSE for invalid, and not make a value judgement about permissible schemes.
Comment #30
mfer CreditAttribution: mfer commented@Heine I understand where you are coming from. I would like configurable schemes as well. This issue is filed as a bug (which I agree with). The current valid_url fails for many valid urls. Adding configurable schemes would be nice but is feature creep into a bug issue.
Let's get this one fixed and then we can add configurable schemes via #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs. The tests should go in with this patch and well support the correct characters. We can work on extending the function there.
Since this is a bug I would like to get it backported to drupal 5 and 6.
Comment #31
Garrett Albright CreditAttribution: Garrett Albright commentedI don't think anyone is suggesting that the accepted schemes should be configurable; rather, that any scheme which contains valid characters should be accepted.
Comment #32
Dries CreditAttribution: Dries commentedI committed the patch in #21 to CVS HEAD. Thanks mfer.
We can continue the conversation about schemes and fixing the remaining failures using follow-up patches. :-)
Comment #33
mfer CreditAttribution: mfer commentedI'm still working to fix the 2 issues listed in #25. In the meantime another type of character has come up. What about unicode characters used internationally?
For example, http://例え.テスト/.
There are international domain names, there are url paths with international characters, like ü, on wikipedia.
Should we support these or should we only support their ascii representation?
Comment #34
Garrett Albright CreditAttribution: Garrett Albright commentedWould converting them to their ASCII representation and then doing the tests against that be a practical option?
Comment #35
mfer CreditAttribution: mfer commentedThis updated fixes the two issues Heine found in #25. That is, it no longer allows the % character. Instead it looks for "% HEXDIGIT HEXDIGIT". It, also, allows for well formed IPv6. That is in the format
Where a HEXDIGIT can happen 0 to 4 times between : and the whole thing is enclosed in [].
Added additional tests for these cases.
Comment #36
Dries CreditAttribution: Dries commentedLooks good to me, although it is a tiny bit light on documentation, IMO.
Comment #37
mfer CreditAttribution: mfer commentedAdded some additional comments to the regex. Is there someplace specific I can add more documented detail?
Comment #38
Dries CreditAttribution: Dries commentedGreat! Committed to CVS HEAD. I guess we can mark this 'fixed' now. Good work.
Comment #39
mfer CreditAttribution: mfer commentedCan we get this bug fixed in drupal 6 as well? The attached patch is a direct backport of the drupal 7 valid_url function.
Comment #40
Gábor HojtsyWhy does "The path and query (optional)" has the ! twice (and also the preg in the other conditional case)? Do I miss something? (This might also need cleanup in Drupal 7).
Comment #41
mfer CreditAttribution: mfer commentedNice catch Gábor.
The patch that ends in D7 removes the additional !. The patch that ends in D6 is a direct backport of the D7 function. Ran the Valid Url tests and they all pass. Also, manually tested the ! in the path.
Comment #42
mfer CreditAttribution: mfer commentedComment #43
mfer CreditAttribution: mfer commentedAs for a game plan moving forward, I would like to support international domain names (which currently work) among other things. Details and a path forward are at #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs
Comment #44
Dries CreditAttribution: Dries commentedCommitted the follow-up to D7. Moving back to D6. Thanks.
Comment #45
Gábor HojtsyCommitted to Drupal 6. I expect we need to backport this to 5.
Comment #46
mfer CreditAttribution: mfer commentedHere is a patch against drupal 5. Same function as D6 and D7 just coppied over.
Comment #47
drummCommitted to 5.x.
Comment #49
hass CreditAttribution: hass commentedRegex have bugs with IDN. See #368472: valid_url() marks correct IDN domains as invalid.
Comment #50
alexanderpas CreditAttribution: alexanderpas commented