URL validation broken with DNS suffix append
crackerjackmack - June 23, 2008 - 20:34
| Project: | Link |
| Version: | 5.x-2.2 |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
Description
The attached patch fixes two issues I ran into during site maintenance.
- Domains ending in .local domains are considered invalid. This is standard for Zeroconf and most Active Directory deployments
- In locations where DNS search suffix's are deployed (via DHCP or Active Directory), the matching tool assumed a FQDN is required.
As for #2, the code is backwards compatible with FQDN, validation still. Tested and properly verifies for combinations of host.domain.tld and domain.tld as well as just host.
The purpose for this patch is to support more infrastructure sites in a corporate environment as well as supporting all RFC web standards.
| Attachment | Size |
|---|---|
| link.module.patch | 1.01 KB |

#1
RegExs are notoriously hard to describe, but if you could tell me the difference between these changed I'd appreciate it. The original code is "one or more", while the new code looks like a single expression, followed by a "zero" or more, seemingly identical behaviors.
Original: ([a-z0-9\-_\[\]]*\.))+)Updated: ([a-z0-9\-_\[\]])*)(\.(([a-z0-9\-_\[\]])+\.)*)
#2
- $domain = '((([a-z0-9]([a-z0-9\-_\[\]]*\.))+)('. LINK_DOMAINS .'|[a-z]{2}))';+ $domain = '(([a-z0-9]([a-z0-9\-_\[\]])*)(\.(([a-z0-9\-_\[\]])+\.)*('. LINK_DOMAINS .'|[a-z]{2}))?)';
I'll do my best in my conversion from regex2english....
The first regex entry basically says
The second regex more or less says
The second regex is obviously much more complex but it allows for intranet sites who heavily rely on DNS suffix searching for websites to properly use the links module. Because I can match any of the following domains using the second regex statement
I hope I did a good regex2english conversion for you
-- Kevin Landreth
#3
Ah I see now. Basically allowing domains such as "localhost" or any domain that doesn't actually have a top-level .com-like domain.
The trouble could be with the auto-cleanup of URLs. For example, the URL node/10 should be rendered as an "internal" link, prefixed with the site name. If "node" is suddenly a valid domain, all the internal links on that site have suddenly started pointing to http://node/10, which could be rather devastating. If this is the case, it'd cause link.module to loose the ability entirely to support internal URLs. Could you test and see if this has occured?
#4
Ah yes, you are right, bad testing on my behalf. Let me rework this some.
Basically the only way I can absolutely match a short DNS name is if the protocol is supplied. But since link_validate_url() does a weak for a external_url based off of FQDN, shouldn't it require a protocol since it's attempting to validate? shouldn't that logic be moved to link_cleanup_url() ?
#5
So here's the rundown on how link cleans up URLs (the most logical I could think of at the time):
Local path into URL: node/10 => http://example.com/node/10 (this works for any local path, except those containing periods in the first word)
Very Abbreviated domain into URL: google.com => http://google.com
Abbreviated domain into URL: www.google.com => http://www.google.com
Full URL: http://www.google.com => http://www.google.com
Your suggesting to allow protocol://domain structure right? I think that should be fine, though more and more link module is hardly doing any sort of URL validation, we need to start looking into varying levels of validation (allow local only, remote only, or both, sort of options).
#6
The only reason I submitted a patch was that link_validate_url() was returning false, preventing me from submitting http://host/page
The following code from validate_url() isn't reflective of your statement.
if (preg_match($external_pattern . $end, $text)) {return LINK_EXTERNAL;
}
elseif (preg_match($internal_pattern . $end, $text)) {
return LINK_INTERNAL;
}
elseif (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
return LINK_EMAIL;
}
elseif (strpos($text, '<front>') === 0) {
return LINK_FRONT;
}
return FALSE;
Would you say that it would be best to change the evaluation process for this ? If so, I re-rolled the patch to include the logic you stated above. What would happen is that host/dir would be evaluated as a internal link instead of http://host/dir because of the similarities to internal link. I think the occurrence of this would be low and users would quickly edit the links to include http:// or other protocols.
I believe that going this route will have the lowest user impact.
#7
The above series of regexs to determine an internal or external link previously did work as I described, but only before the changes to the regexs that you suggested. Because "node/10" did not register as an external link before your changes, it was safe to do the external check first, then the internal check. Now that "node/10" is counted as a valid external URL, switching the logic around is necessary, as you included in your patch.
Regardless of how things did or did not work, this patch looks fine and works as described. I committed it and it'll be in the 2.3 version coming out later today.
#8
Automatically closed -- issue fixed for two weeks with no activity.