URL validation broken with DNS suffix append

crackerjackmack - June 23, 2008 - 20:34
Project:Link
Version:5.x-2.2
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed
Description

The attached patch fixes two issues I ran into during site maintenance.

  1. Domains ending in .local domains are considered invalid. This is standard for Zeroconf and most Active Directory deployments
  2. In locations where DNS search suffix's are deployed (via DHCP or Active Directory), the matching tool assumed a FQDN is required.

As for #2, the code is backwards compatible with FQDN, validation still. Tested and properly verifies for combinations of host.domain.tld and domain.tld as well as just host.

The purpose for this patch is to support more infrastructure sites in a corporate environment as well as supporting all RFC web standards.

AttachmentSize
link.module.patch1.01 KB

#1

quicksketch - June 23, 2008 - 22:17

RegExs are notoriously hard to describe, but if you could tell me the difference between these changed I'd appreciate it. The original code is "one or more", while the new code looks like a single expression, followed by a "zero" or more, seemingly identical behaviors.

Original: ([a-z0-9\-_\[\]]*\.))+)
Updated:  ([a-z0-9\-_\[\]])*)(\.(([a-z0-9\-_\[\]])+\.)*)

#2

crackerjackmack - June 24, 2008 - 00:15

-  $domain = '((([a-z0-9]([a-z0-9\-_\[\]]*\.))+)('. LINK_DOMAINS .'|[a-z]{2}))';
+  $domain = '(([a-z0-9]([a-z0-9\-_\[\]])*)(\.(([a-z0-9\-_\[\]])+\.)*('. LINK_DOMAINS .'|[a-z]{2}))?)';

I'll do my best in my conversion from regex2english....

The first regex entry basically says

Match any letter beginning with a number or letter, followed by any number of valid characters ending with a dot, one or more times. This statement would get be followed by any of these strings or exactly two letters.

The second regex more or less says

Match any letter beginning with a number or letter, possibly followed by any number of valid characters more than once. After which, only match any valid characters followed by a dot none or more times followed by one of these strings or exactly two letters when a dot exist after the first condition.

The second regex is obviously much more complex but it allows for intranet sites who heavily rely on DNS suffix searching for websites to properly use the links module. Because I can match any of the following domains using the second regex statement

  • a
  • a.com
  • a.google.com
  • a.labs.google.com
  • a.google fails, as it should

I hope I did a good regex2english conversion for you

-- Kevin Landreth

#3

quicksketch - June 24, 2008 - 00:42

Ah I see now. Basically allowing domains such as "localhost" or any domain that doesn't actually have a top-level .com-like domain.

The trouble could be with the auto-cleanup of URLs. For example, the URL node/10 should be rendered as an "internal" link, prefixed with the site name. If "node" is suddenly a valid domain, all the internal links on that site have suddenly started pointing to http://node/10, which could be rather devastating. If this is the case, it'd cause link.module to loose the ability entirely to support internal URLs. Could you test and see if this has occured?

#4

crackerjackmack - June 24, 2008 - 02:57

Ah yes, you are right, bad testing on my behalf. Let me rework this some.

Basically the only way I can absolutely match a short DNS name is if the protocol is supplied. But since link_validate_url() does a weak for a external_url based off of FQDN, shouldn't it require a protocol since it's attempting to validate? shouldn't that logic be moved to link_cleanup_url() ?

#5

quicksketch - June 24, 2008 - 04:52

So here's the rundown on how link cleans up URLs (the most logical I could think of at the time):

Local path into URL: node/10 => http://example.com/node/10 (this works for any local path, except those containing periods in the first word)
Very Abbreviated domain into URL: google.com => http://google.com
Abbreviated domain into URL: www.google.com => http://www.google.com
Full URL: http://www.google.com => http://www.google.com

Your suggesting to allow protocol://domain structure right? I think that should be fine, though more and more link module is hardly doing any sort of URL validation, we need to start looking into varying levels of validation (allow local only, remote only, or both, sort of options).

#6

crackerjackmack - June 24, 2008 - 13:28

The only reason I submitted a patch was that link_validate_url() was returning false, preventing me from submitting http://host/page

The following code from validate_url() isn't reflective of your statement.

  if (preg_match($external_pattern . $end, $text)) {
    return LINK_EXTERNAL;
  }
  elseif (preg_match($internal_pattern . $end, $text)) {
    return LINK_INTERNAL;
  }
  elseif (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
    return LINK_EMAIL;
  }
  elseif (strpos($text, '<front>') === 0) {
    return LINK_FRONT;
  }
  return FALSE;

Would you say that it would be best to change the evaluation process for this ? If so, I re-rolled the patch to include the logic you stated above. What would happen is that host/dir would be evaluated as a internal link instead of http://host/dir because of the similarities to internal link. I think the occurrence of this would be low and users would quickly edit the links to include http:// or other protocols.

I believe that going this route will have the lowest user impact.

AttachmentSize
link.module.patch 2.44 KB

#7

quicksketch - June 29, 2008 - 21:44
Status:needs review» fixed

The above series of regexs to determine an internal or external link previously did work as I described, but only before the changes to the regexs that you suggested. Because "node/10" did not register as an external link before your changes, it was safe to do the external check first, then the internal check. Now that "node/10" is counted as a valid external URL, switching the logic around is necessary, as you included in your patch.

Regardless of how things did or did not work, this patch looks fine and works as described. I committed it and it'll be in the 2.3 version coming out later today.

#8

Anonymous (not verified) - July 13, 2008 - 21:52
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.