I want to write a regexp that validates web links. It's going pretty well so far, but I've identified a couple cases where the regexp fails to see a bad URL. Here's the code:


  if (!preg_match(
	  // The protocols: http://
    '/^((https|http|ftp|news):\/\/)?'.
    // domains
    '(([a-z]([a-z0-9\-_]*\.)+)'.
    '(aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|[a-z]{2})'.
    '(\/[a-z0-9_\-\.~]+)*'.
    '(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.\/%=&]*)?)?)'.
    // OR ip addresses
    '|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'.
    // port number
    '(:([0-9]{1,4}))'.
    // forward slash 0 or 1 times
    '((\/)|(\/(.*)))?'.
    // end of the expression, case insensitive
    '$/i', $text, $m)) {
    return false;
  } 

and here are the two examples that fail:

$text = 'drupal.org:';
$text = 'http://www.yahoo.com:80abc';

Thanks for any suggestions!

Comments

robertdouglass’s picture

$text = 'drupal.org.';

- Robert Douglass

-----
My Drupal book: Building Online Communities with Drupal, phpBB and WordPress

scroogie’s picture

Try adding parenthesis around the OR block so that it looks like
( domain | ipaddress )
The port should be optional for both the IP and the domain.

What is the part '(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.\/%=&]*)?)?)'. used for? Seems to be redundant on first sight.

dado’s picture

i often use this site as a starting point
http://regexlib.com/Search.aspx?k=url
you might peruse the regexps on that page. Each regex tells you what it does & doesn't match.

I like to use
The Regex Coach for testing/developing my regular expressions
dado