I'm doing more thorough testing of my SimpleFeed module and I've noticed that it has had trouble creating some feed items. Well it turns out the culprit is valid_url() -- I have some feeds that have crazy URLs in theme, with characters like "!" which are not in the valid_url() check in Drupal.

While I will update my module to *not* use valid_url(), I suggest we fix this in core.

After some research:

(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?

It was constructed according to the BNF grammar given in RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt).

Or a bit simpler:

(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?

Making a patch is trivial. Agreeing on the right approach is not. :-)

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

robertDouglass’s picture

We need a test suite that has a sampling of all the URLs we want to be valid or invalid.

RobRoy’s picture

A bit OT: Are we moving towards using Simpletest as the testing suite for Drupal? I wanted to write some tests for another contrib module and wanted to make sure I was forward-looking? I suppose we could add some URL tests to the core tests included in Simpletest.

robertDouglass’s picture

Simpletest++. We intend to extend that platform some during SoC, so writing tests for it is forward looking. Care to make a test suite for this issue?

sun’s picture

Title: improve valid_url() checking » valid_url() does not support all valid URL characters
Version: 6.x-dev » 7.x-dev
Category: task » bug

This is a bug, as has been stated in #247880: valid_url needs to support more characters.
First tests have been committed on behalf of #283806: Notice appears when drupal_http_request called with an invalid URL.

c960657’s picture

grndlvl’s picture

Status: Active » Closed (duplicate)

Test available for valid_url at #295021: Extend valid_url() as well.

mfer’s picture

Status: Closed (duplicate) » Active

This issue, where valid_url() fails on many valid urls because it isn't checking the characters the way it should is rearing its ugly head in multiple issues and multiple ways. It looks like we need a better valid_url function.

See these issues plus this one of a few different proposed solutions:
#41651: valid_url() doesn't accept URLs with @ in it
#124492: valid_url() does not support all valid URL characters
#308977: Replace valid_url with filter_var and FILTER_VALIDATE_URL

Before chasing after any of these possible solutions we need some guidance on what kind of solution we should use. Very loose checking, strict checking, somewhere in the middle? There have been multiple proposed solutions. What type of solution would be good?

Thoughts? I'm willing to take on this patch if we have a general path forward.

mfer’s picture

Status: Active » Needs review
FileSize
3.44 KB

Summary: currently the function valid_url fails on some valid urls. For example, http://www.flickr.com/people/93099231@N00/ would fail because of the @ symbol but is a legitimate (and common) address. This is a bug in the regex.

The attached patch updates the regex and provides tests. The regex is similar to the second one proposed by m3avrck above but is stricter and fixes an issue in the regex.

mfer’s picture

Status: Needs review » Needs work

Tasks:
- Make the regex rfc compliant. (or darn close)
- Add some more tests. Especially difficult ones.

It seems valid_url is only called for form validation in core.

If someone needs to test for a url but doesn't need strict validation filter_var with FILTER_VALIDATE_URL can do the job.

mfer’s picture

The current regex I'm working with is:

preg_match('
      /^                                   # Start at the beginning of the text
      (?:ftp|http|https):\/\/              # Look for ftp, http, or https
      (?:                                  # Username:password combinations (optional)
        [\w\.\-\+]+                        # A username
        :{0,1}                             # an optional colon to separate the username and password
        [\w\.\-\+]*@                       # A password
      )?
      (?:[a-z0-9\-\.]+)                    # The domain limiting it to just allowed characters
      (?::[0-9]+)?                         # Server port number (optional)
      (?:                                  # The path (optional)
        \/|                                # a forward slash
        \/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)| # or a forward slash followed by a full path
        \?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)  # or a question mark followed by key value pairs
       )?$
     /xi', $url);

Thoughts?

mfer’s picture

Status: Needs work » Needs review
FileSize
4.27 KB

An updated and simpler regex.

Status: Needs review » Needs work

The last submitted patch failed testing.

mfer’s picture

Looks like this failed because profile.test uses the randonName method to generate a url name and randonName includes the character _ which is invalid in the domain part of the url. That test will need to be updated as well.

mfer’s picture

Status: Needs work » Needs review
FileSize
5.05 KB

An updated patch that fixes the profile module test. Otherwise it's the same as the last one.

c960657’s picture

Status: Needs review » Needs work

Whether an URL is considered "valid" depends on the context. I suggest adding to the Doxygen comment which specification the URLs are checked against, e.g. RFC 3986 (another possibility is the definition in the HTML5 draft specification), and what additional requirements there are, e.g. the restriction to only the ftp, http and https schemes.

These URLs are valid according to RFC 3986 (AFAICT) but are rejected by the function:

http://example.org?
http://john%20doe:secret:foo@example.org/
http://example.org/~,$'*;
http://caf%C3%A9.example.org
+        :{0,1}

I suggest using ? instead of {0,1} like elsewhere in the pattern.

+      'sudomain.example.com',

Did you mean subdomain?

+        \/|                                       # a forward slash
+        (?:\/|\?)(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)  # or a forward slash or question mark followed by a full path

What is the purpose of having these two alternative lines, rather than getting rid of the first and then replacing the + quantifier with * ? Most of the backslashes are not necessary between square brackets […] – I think they reduce readability. Also, I don't think it is appropriate to refer to the string following a question mark as a full path.

mfer’s picture

Status: Needs work » Needs review
FileSize
5.17 KB

@c960657 - Thanks for the feedback.

This patch updates per the comments above as well as added additional tests to cover more possibilities. RFC 3986 is noted in the valid_url header comment as well.

Patch Description: The current version of valid_url in core fails for many valid urls. This is in part due to a spec change and in part due to the regular expression not covering all url variations. This patch updates the regular expression to cover more cases and provides test for many cases.

Dries’s picture

Status: Needs review » Needs work

+ $this->assertFalse($valid_url, $test_url .t(' is NOT a valid url.'));
There are a number of such instances in the patch. We should correctly use t().

mfer’s picture

Status: Needs work » Needs review
FileSize
5.23 KB

oops. Fixed in attached patch. Passes the url through t() in the test.

drewish’s picture

Status: Needs review » Needs work
+    return (bool)preg_match("
+      /^                                                # Start at the beginning of the text
+      (?:ftp|https?):\/\/                               # Look for ftp, http, or https
+      (?:                                               # Userinfo (optional)
+        (?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*
+        [\w\.\-\+%!$&'\(\)*\+,;=]+@
+      )?
+      (?:[a-z0-9\-\.%]+)                                # The domain
+      (?::[0-9]+)?                                      # Server port number (optional)
+      (?:[\/|\?][\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]*)? # The path (optional)
+    $/xi", $url);

where did this regex come from? it'd be great to have a link to it's source.

+  /**
+   * Implementation of getInfo().
+   */

should be removed.

+    return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);

could really use a comment explaining it's purpose.

+      // _ is an invalid character for domain names.

It might be better to spell out that the underscore is not valid in domain names.

+  function testValidUrl() {

Needs a PHPDoc comment.

I'd split testValidUrl() up into separate tests:

  • testValidAbsolute()
  • testInvalidAbsolute()
  • testValidRelative()
  • testInvalidRelative()
  • testRandomGarbage() <-- I'd add this and put in some random perl code and make sure it barfs.
drewish’s picture

I'd like a summary of what characters are allowed under that RFC as comments so we can have an idea what you think it does incase there's a problem with it later.

mfer’s picture

FileSize
5.97 KB

@drewish - I wrote this regular expression. I couldn't find one for RFC 3986 that would work. So, the source is here and me. That's why I would like someone else to look over the regular expression.

Is there a reason to document

+    return (bool)preg_match("/^[\w#!:\.\?\+=&%@!$'~*,;\/\(\)\[\]\-]+$/i", $url);

when the current regular expression in this space isn't documented? I love documentation but I don't want to overdo it. :)

This patch fixes the other issues and breaks the tests up into 4 tests (2 more invalid url cases were added).

mfer’s picture

Status: Needs work » Needs review

@drewish - a listing of all valid characters depends on the part and would be quite a comment block. For the userinfo, domain, path, and more the characters are different.

Would it be worth the large comment block when there is a spec?

Dries’s picture

Code and tests look good to me. It is also properly documented. I'm not a regex wizard but this looks like RTBC to me.

mfer’s picture

I shared the regular expression (and provided a more generic URI one) at http://mattfarina.com/2009/01/08/rfc-3986-url-validation in hopes someone can use it, break it, or share one that's better.

Heine’s picture

False negative: http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.html
False positivie: http://somedomain.com/ind%ex.html

I don't understand why we need to limit valid urls to http, https and ftp. Checking the accepted scheme should be the job of another function.

mfer’s picture

Status: Needs review » Needs work

@Heine thanks for the feedback.

I left the scheme setup as it had previously been in drupal. It would be trivial to change and have a valid regex that accepts all valid schemes. But, I'm not sure it's appropriate to have scheme checking beyond either the ones we already have or a properly formatted scheme in this issue. When I post my next update I'll post 2 patches. One that keeps the scheme limit the same and one that handles a well formed scheme per the spec.

Handling the % symbol and hex codes will take a little reword of the regex. The % is only valid when it's with a properly formatted hex code. Right now the % is just allowed.

Garrett Albright’s picture

I'd like to toss in a vote for not limiting the schemes to /(ftp|https?)/. I occasionally use links with irc: and aim: schemes, and mailto: and javascript: still have their uses. Also, Tor and I believe some other anonymizing software use their own schemes as well. Limiting the scheme will cause frustrating false negatives.

That aside, good work on that regular expression, Matt. That must have taken some work to iron out. As a distraction, here's some interesting research on the ol' "Now you have two problems" quote.

c960657’s picture

Note that filter_xss_bad_protocol() and _filter_url() also contain lists of allowed schemes.

See also #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs for additional ideas for limiting the schemes etc. Most of this is outside the scope of this issue, though.

Heine’s picture

I contest the notion that limiting the scheme (or not) is out of scope. valid_url should return TRUE for valid URLs, FALSE for invalid, and not make a value judgement about permissible schemes.

mfer’s picture

@Heine I understand where you are coming from. I would like configurable schemes as well. This issue is filed as a bug (which I agree with). The current valid_url fails for many valid urls. Adding configurable schemes would be nice but is feature creep into a bug issue.

Let's get this one fixed and then we can add configurable schemes via #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs. The tests should go in with this patch and well support the correct characters. We can work on extending the function there.

Since this is a bug I would like to get it backported to drupal 5 and 6.

Garrett Albright’s picture

I don't think anyone is suggesting that the accepted schemes should be configurable; rather, that any scheme which contains valid characters should be accepted.

Dries’s picture

I committed the patch in #21 to CVS HEAD. Thanks mfer.

We can continue the conversation about schemes and fixing the remaining failures using follow-up patches. :-)

mfer’s picture

I'm still working to fix the 2 issues listed in #25. In the meantime another type of character has come up. What about unicode characters used internationally?

For example, http://例え.テスト/.

There are international domain names, there are url paths with international characters, like ü, on wikipedia.

Should we support these or should we only support their ascii representation?

Garrett Albright’s picture

Would converting them to their ASCII representation and then doing the tests against that be a practical option?

mfer’s picture

Status: Needs work » Needs review
FileSize
3.18 KB

This updated fixes the two issues Heine found in #25. That is, it no longer allows the % character. Instead it looks for "% HEXDIGIT HEXDIGIT". It, also, allows for well formed IPv6. That is in the format

[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]

Where a HEXDIGIT can happen 0 to 4 times between : and the whole thing is enclosed in [].

Added additional tests for these cases.

Dries’s picture

Looks good to me, although it is a tiny bit light on documentation, IMO.

mfer’s picture

FileSize
3.25 KB

Added some additional comments to the regex. Is there someplace specific I can add more documented detail?

Dries’s picture

Status: Needs review » Fixed

Great! Committed to CVS HEAD. I guess we can mark this 'fixed' now. Good work.

mfer’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Reviewed & tested by the community
FileSize
2.06 KB

Can we get this bug fixed in drupal 6 as well? The attached patch is a direct backport of the drupal 7 valid_url function.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Needs work

Why does "The path and query (optional)" has the ! twice (and also the preg in the other conditional case)? Do I miss something? (This might also need cleanup in Drupal 7).

mfer’s picture

Status: Needs work » Needs review
FileSize
2.06 KB
937 bytes

Nice catch Gábor.

The patch that ends in D7 removes the additional !. The patch that ends in D6 is a direct backport of the D7 function. Ran the Valid Url tests and they all pass. Also, manually tested the ! in the path.

mfer’s picture

Version: 6.x-dev » 7.x-dev
mfer’s picture

As for a game plan moving forward, I would like to support international domain names (which currently work) among other things. Details and a path forward are at #295021: filter_var() with FILTER_VALIDATE_URL accepts malformed URLs and rejects not all valid URLs

Dries’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs review » Reviewed & tested by the community

Committed the follow-up to D7. Moving back to D6. Thanks.

Gábor Hojtsy’s picture

Version: 6.x-dev » 5.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

Committed to Drupal 6. I expect we need to backport this to 5.

mfer’s picture

Status: Patch (to be ported) » Reviewed & tested by the community
FileSize
2.03 KB

Here is a patch against drupal 5. Same function as D6 and D7 just coppied over.

drumm’s picture

Status: Reviewed & tested by the community » Fixed

Committed to 5.x.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

hass’s picture

alexanderpas’s picture

Issue tags: +IDN