valid_url() does not support all valid URL characters [#124492]

Comment	File	Size	Author
#46	valid_url-124492-45-D5.patch	2.03 KB	mfer
#41	valid_url-124492-41-D7.patch	937 bytes	mfer
#41	valid_url-124492-41-D6.patch	2.06 KB	mfer
#39	valid_url-124492-39.patch	2.06 KB	mfer
#37	valid_url-124492-37.patch	3.25 KB	mfer
#35	valid_url-124492-35.patch	3.18 KB	mfer
#21	valid_url-124492-20.patch	5.97 KB	mfer
#18	valid_url-124492-18.patch	5.23 KB	mfer
#16	valid_url-124492-16.patch	5.17 KB	mfer
#14	valid_url-124492-14.patch	5.05 KB	mfer
#11	valid_url-124492-10.patch	4.27 KB	mfer
#8	valid_url-124492-8.patch	3.44 KB	mfer

Comment #1

robertdouglass commented 3 March 2007 at 22:50

We need a test suite that has a sampling of all the URLs we want to be valid or invalid.

Log in or register to post comments

A bit OT: Are we moving towards using Simpletest as the testing suite for Drupal? I wanted to write some tests for another contrib module and wanted to make sure I was forward-looking? I suppose we could add some URL tests to the core tests included in Simpletest.

Log in or register to post comments

Comment #3

robertdouglass commented 4 March 2007 at 19:42

Simpletest++. We intend to extend that platform some during SoC, so writing tests for it is forward looking. Care to make a test suite for this issue?

Log in or register to post comments

Comment #4

sun

German

Karlsruhe

commented 13 August 2008 at 19:02

Title:	improve valid_url() checking	» valid_url() does not support all valid URL characters
Version:	6.x-dev	» 7.x-dev
Category:	task	» bug

This is a bug, as has been stated in #247880: valid_url needs to support more characters.
First tests have been committed on behalf of #283806: Notice appears when drupal_http_request called with an invalid URL.

Log in or register to post comments

Comment #5

c960657 commented 13 August 2008 at 21:55

See also #295021: Expand/augment UrlHelper::isValid() to improve correctness and handle more URL types for a more radical revamp of valid_url().

Log in or register to post comments

Comment #6

grndlvl commented 29 August 2008 at 21:53

Status:

Active

» Closed (duplicate)

Test available for valid_url at #295021: Extend valid_url() as well.

Log in or register to post comments

Comment #7

mfer commented 2 December 2008 at 03:15

Status:

Closed (duplicate)

» Active

This issue, where valid_url() fails on many valid urls because it isn't checking the characters the way it should is rearing its ugly head in multiple issues and multiple ways. It looks like we need a better valid_url function.

See these issues plus this one of a few different proposed solutions:
#41651: valid_url() doesn't accept URLs with @ in it
#124492: valid_url() does not support all valid URL characters
#308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL

Before chasing after any of these possible solutions we need some guidance on what kind of solution we should use. Very loose checking, strict checking, somewhere in the middle? There have been multiple proposed solutions. What type of solution would be good?

Thoughts? I'm willing to take on this patch if we have a general path forward.

Log in or register to post comments

Comment #8

mfer commented 3 December 2008 at 00:40

Status:

Active

» Needs review

Status	File	Size
new	valid_url-124492-8.patch	3.44 KB

Summary: currently the function valid_url fails on some valid urls. For example, http://www.flickr.com/people/93099231@N00/ would fail because of the @ symbol but is a legitimate (and common) address. This is a bug in the regex.

The attached patch updates the regex and provides tests. The regex is similar to the second one proposed by m3avrck above but is stricter and fixes an issue in the regex.

Log in or register to post comments

Comment #9

mfer commented 3 December 2008 at 04:26

Status:

Needs review

» Needs work

Tasks:
- Make the regex rfc compliant. (or darn close)
- Add some more tests. Especially difficult ones.

It seems valid_url is only called for form validation in core.

If someone needs to test for a url but doesn't need strict validation filter_var with FILTER_VALIDATE_URL can do the job.

Log in or register to post comments

Comment #10

mfer commented 5 December 2008 at 00:57

The current regex I'm working with is:

preg_match('
      /^                                   # Start at the beginning of the text
      (?:ftp|http|https):\/\/              # Look for ftp, http, or https
      (?:                                  # Username:password combinations (optional)
        [\w\.\-\+]+                        # A username
        :{0,1}                             # an optional colon to separate the username and password
        [\w\.\-\+]*@                       # A password
      )?
      (?:[a-z0-9\-\.]+)                    # The domain limiting it to just allowed characters
      (?::[0-9]+)?                         # Server port number (optional)
      (?:                                  # The path (optional)
        \/|                                # a forward slash
        \/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)| # or a forward slash followed by a full path
        \?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)  # or a question mark followed by key value pairs
       )?$
     /xi', $url);

Thoughts?

Log in or register to post comments

Comment #11

mfer commented 6 December 2008 at 03:32

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-10.patch	4.27 KB

An updated and simpler regex.

Log in or register to post comments

Comment #12

6 December 2008 at 03:40

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #13

mfer commented 6 December 2008 at 21:41

Looks like this failed because profile.test uses the randonName method to generate a url name and randonName includes the character _ which is invalid in the domain part of the url. That test will need to be updated as well.

Log in or register to post comments

Comment #14

mfer commented 6 December 2008 at 21:58

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-14.patch	5.05 KB

An updated patch that fixes the profile module test. Otherwise it's the same as the last one.

Log in or register to post comments

Comment #15

c960657 commented 6 December 2008 at 23:11

Status:

Needs review

» Needs work

Whether an URL is considered "valid" depends on the context. I suggest adding to the Doxygen comment which specification the URLs are checked against, e.g. RFC 3986 (another possibility is the definition in the HTML5 draft specification), and what additional requirements there are, e.g. the restriction to only the ftp, http and https schemes.

These URLs are valid according to RFC 3986 (AFAICT) but are rejected by the function:

http://example.org?
http://john%20doe:secret:foo@example.org/
http://example.org/~,$'*;
http://caf%C3%A9.example.org

+        :{0,1}

I suggest using ? instead of {0,1} like elsewhere in the pattern.

+      'sudomain.example.com',

Did you mean subdomain?

+        \/|                                       # a forward slash
+        (?:\/|\?)(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)  # or a forward slash or question mark followed by a full path

What is the purpose of having these two alternative lines, rather than getting rid of the first and then replacing the + quantifier with * ? Most of the backslashes are not necessary between square brackets […] – I think they reduce readability. Also, I don't think it is appropriate to refer to the string following a question mark as a full path.

Log in or register to post comments

Comment #16

mfer commented 29 December 2008 at 20:53

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-16.patch	5.17 KB

@c960657 - Thanks for the feedback.

This patch updates per the comments above as well as added additional tests to cover more possibilities. RFC 3986 is noted in the valid_url header comment as well.

Patch Description: The current version of valid_url in core fails for many valid urls. This is in part due to a spec change and in part due to the regular expression not covering all url variations. This patch updates the regular expression to cover more cases and provides test for many cases.

Log in or register to post comments

Comment #17

dries commented 6 January 2009 at 13:31

Status:

Needs review

» Needs work

+ $this->assertFalse($valid_url, $test_url .t(' is NOT a valid url.'));
There are a number of such instances in the patch. We should correctly use t().

Log in or register to post comments

Comment #18

mfer commented 7 January 2009 at 23:19

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-18.patch	5.23 KB

oops. Fixed in attached patch. Passes the url through t() in the test.

Log in or register to post comments

Comment #19

drewish commented 8 January 2009 at 00:43

Status:

Needs review

» Needs work

+    return (bool)preg_match("
+      /^                                                # Start at the beginning of the text
+      (?:ftp|https?):\/\/                               # Look for ftp, http, or https
+      (?:                                               # Userinfo (optional)
+        (?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*
+        [\w\.\-\+%!$&'\(\)*\+,;=]+@
+      )?
+      (?:[a-z0-9\-\.%]+)                                # The domain
+      (?::[0-9]+)?                                      # Server port number (optional)
+      (?:[\/|\?][\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]*)? # The path (optional)
+    $/xi", $url);

where did this regex come from? it'd be great to have a link to it's source.

+  /**
+   * Implementation of getInfo().
+   */

should be removed.

+    return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);

could really use a comment explaining it's purpose.

+      // _ is an invalid character for domain names.

It might be better to spell out that the underscore is not valid in domain names.

+  function testValidUrl() {

Needs a PHPDoc comment.

I'd split testValidUrl() up into separate tests:

testValidAbsolute()
testInvalidAbsolute()
testValidRelative()
testInvalidRelative()
testRandomGarbage() <-- I'd add this and put in some random perl code and make sure it barfs.

Log in or register to post comments

Comment #20

drewish commented 8 January 2009 at 00:47

I'd like a summary of what characters are allowed under that RFC as comments so we can have an idea what you think it does incase there's a problem with it later.

Log in or register to post comments

Comment #21

mfer commented 8 January 2009 at 01:25

Status	File	Size
new	valid_url-124492-20.patch	5.97 KB

@drewish - I wrote this regular expression. I couldn't find one for RFC 3986 that would work. So, the source is here and me. That's why I would like someone else to look over the regular expression.

Is there a reason to document

+    return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);

when the current regular expression in this space isn't documented? I love documentation but I don't want to overdo it. :)

This patch fixes the other issues and breaks the tests up into 4 tests (2 more invalid url cases were added).

Log in or register to post comments

Comment #22

mfer commented 8 January 2009 at 01:29

Status:

Needs work

» Needs review

@drewish - a listing of all valid characters depends on the part and would be quite a comment block. For the userinfo, domain, path, and more the characters are different.

Would it be worth the large comment block when there is a spec?

Log in or register to post comments

Comment #23

dries commented 8 January 2009 at 10:20

Code and tests look good to me. It is also properly documented. I'm not a regex wizard but this looks like RTBC to me.

Log in or register to post comments

Comment #24

mfer commented 8 January 2009 at 13:18

I shared the regular expression (and provided a more generic URI one) at http://mattfarina.com/2009/01/08/rfc-3986-url-validation in hopes someone can use it, break it, or share one that's better.

Log in or register to post comments

Comment #25

heine commented 8 January 2009 at 14:27

False negative: http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.html
False positivie: http://somedomain.com/ind%ex.html

I don't understand why we need to limit valid urls to http, https and ftp. Checking the accepted scheme should be the job of another function.

Log in or register to post comments

Comment #26

mfer commented 8 January 2009 at 15:10

Status:

Needs review

» Needs work

@Heine thanks for the feedback.

I left the scheme setup as it had previously been in drupal. It would be trivial to change and have a valid regex that accepts all valid schemes. But, I'm not sure it's appropriate to have scheme checking beyond either the ones we already have or a properly formatted scheme in this issue. When I post my next update I'll post 2 patches. One that keeps the scheme limit the same and one that handles a well formed scheme per the spec.

Handling the % symbol and hex codes will take a little reword of the regex. The % is only valid when it's with a properly formatted hex code. Right now the % is just allowed.

Log in or register to post comments

Comment #27

Garrett Albright commented 8 January 2009 at 17:22

I'd like to toss in a vote for not limiting the schemes to /(ftp|https?)/. I occasionally use links with irc: and aim: schemes, and mailto: and javascript: still have their uses. Also, Tor and I believe some other anonymizing software use their own schemes as well. Limiting the scheme will cause frustrating false negatives.

That aside, good work on that regular expression, Matt. That must have taken some work to iron out. As a distraction, here's some interesting research on the ol' "Now you have two problems" quote.

Log in or register to post comments

Comment #28

c960657 commented 8 January 2009 at 16:56

Note that filter_xss_bad_protocol() and _filter_url() also contain lists of allowed schemes.

See also #295021: Expand/augment UrlHelper::isValid() to improve correctness and handle more URL types for additional ideas for limiting the schemes etc. Most of this is outside the scope of this issue, though.

Log in or register to post comments

Comment #29

heine commented 8 January 2009 at 17:02

I contest the notion that limiting the scheme (or not) is out of scope. valid_url should return TRUE for valid URLs, FALSE for invalid, and not make a value judgement about permissible schemes.

Log in or register to post comments

Comment #30

mfer commented 8 January 2009 at 17:25

@Heine I understand where you are coming from. I would like configurable schemes as well. This issue is filed as a bug (which I agree with). The current valid_url fails for many valid urls. Adding configurable schemes would be nice but is feature creep into a bug issue.

Let's get this one fixed and then we can add configurable schemes via #295021: Expand/augment UrlHelper::isValid() to improve correctness and handle more URL types. The tests should go in with this patch and well support the correct characters. We can work on extending the function there.

Since this is a bug I would like to get it backported to drupal 5 and 6.

Log in or register to post comments

Comment #31

Garrett Albright commented 8 January 2009 at 18:48

I don't think anyone is suggesting that the accepted schemes should be configurable; rather, that any scheme which contains valid characters should be accepted.

Log in or register to post comments

Comment #32

dries commented 8 January 2009 at 19:15

I committed the patch in #21 to CVS HEAD. Thanks mfer.

We can continue the conversation about schemes and fixing the remaining failures using follow-up patches. :-)

Log in or register to post comments

Comment #33

mfer commented 9 January 2009 at 10:33

I'm still working to fix the 2 issues listed in #25. In the meantime another type of character has come up. What about unicode characters used internationally?

For example, http://例え.テスト/.

There are international domain names, there are url paths with international characters, like ü, on wikipedia.

Should we support these or should we only support their ascii representation?

Log in or register to post comments

Comment #34

Garrett Albright commented 9 January 2009 at 16:38

Would converting them to their ASCII representation and then doing the tests against that be a practical option?

Log in or register to post comments

Comment #35

mfer commented 10 January 2009 at 15:56

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-35.patch	3.18 KB

This updated fixes the two issues Heine found in #25. That is, it no longer allows the % character. Instead it looks for "% HEXDIGIT HEXDIGIT". It, also, allows for well formed IPv6. That is in the format

[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]

Where a HEXDIGIT can happen 0 to 4 times between : and the whole thing is enclosed in [].

Added additional tests for these cases.

Log in or register to post comments

Comment #36

dries commented 10 January 2009 at 20:02

Looks good to me, although it is a tiny bit light on documentation, IMO.

Log in or register to post comments

Comment #37

mfer commented 10 January 2009 at 20:54

Status	File	Size
new	valid_url-124492-37.patch	3.25 KB

Added some additional comments to the regex. Is there someplace specific I can add more documented detail?

Log in or register to post comments

Comment #38

dries commented 11 January 2009 at 08:39

Status:

Needs review

» Fixed

Great! Committed to CVS HEAD. I guess we can mark this 'fixed' now. Good work.

Log in or register to post comments

Comment #39

mfer commented 12 January 2009 at 01:08

Version:	7.x-dev	» 6.x-dev
Status:	Fixed	» Reviewed & tested by the community

Status	File	Size
new	valid_url-124492-39.patch	2.06 KB

Can we get this bug fixed in drupal 6 as well? The attached patch is a direct backport of the drupal 7 valid_url function.

Log in or register to post comments

Comment #40

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 January 2009 at 15:34

Status:

Reviewed & tested by the community

» Needs work

Why does "The path and query (optional)" has the ! twice (and also the preg in the other conditional case)? Do I miss something? (This might also need cleanup in Drupal 7).

Log in or register to post comments

Comment #41

mfer commented 13 January 2009 at 00:15

Status:

Needs work

» Needs review

Status	File	Size
new	valid_url-124492-41-D6.patch	2.06 KB
new	valid_url-124492-41-D7.patch	937 bytes

Nice catch Gábor.

The patch that ends in D7 removes the additional !. The patch that ends in D6 is a direct backport of the D7 function. Ran the Valid Url tests and they all pass. Also, manually tested the ! in the path.

Log in or register to post comments

Comment #42

mfer commented 13 January 2009 at 00:15

Version:

6.x-dev

» 7.x-dev

Log in or register to post comments

Comment #43

mfer commented 13 January 2009 at 12:07

As for a game plan moving forward, I would like to support international domain names (which currently work) among other things. Details and a path forward are at #295021: Expand/augment UrlHelper::isValid() to improve correctness and handle more URL types