Add Internationalized domain names (IDN) support to core
Bence - February 9, 2008 - 16:28
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
| Issue tags: | IDN |
Description
Accentuated URLs don't turn into links:
http://www.magyarország.hu/
http://www.magyarorszag.hu/
Notice the character á in the first URL. This is a valid and working URL.
(Same issue in Drupal 5.7)

#1
#2
This patch adds ability to match an Unicode letter (\p{L}) in an URL (not in emails).
Anyway http://www.magyarország.hu/ should matches.
I can't say it's good solutions.
Can will better using \p{M} ?
#3
Note that e-mail addresses may contain non-ASCII letters, too. And a wide range of Unicode letters are allowed (maybe Chinese characters, too?).
http://en.wikipedia.org/wiki/Internationalized_domain_name
However, every TLDs support different characters. A .de domain with a Chinese character isn't valid. And I think (but I'm not sure) that generic top-level domains can contain only ASCII letters, numbers and hypens.
But in practice I think it is way too complex to cover all these exceptions. However we need to check wheter the current domain is a generic top-level domain or not, and then don't allow international characters, because generic top-level domains are most common. But checking whether a specifiic country code top-level domain can contain a specific character, or not, is very complex.
#4
You suggest not to use non-ASCII letters in a generic top-level domain (http://www.iana.org/gtld/gtld.htm).
You suggest use non-ASCII letters in other domains.
Am I right?
Ok, I'll try to make a new patch.
#5
Yes, the filter must check also the top-level domain: if it is a generic top-level domain, then it can contain only ASCII letters, numbers and hyphens. So the filter may not turn this into link, since it's invalid domain: http://www.drupál.org(generic top level domains may also contain accentuated letters.)However a country code top-level domain can contain international letters, so if a domain has country code top-level domain and contain non-ASCII letters, then this MAY be a valid link, so turn it into link.
The filter should also further check the country code top-level domain, but this is very complicated. (For example, the german domains cannot contain Russian letters, and vice versa)
#6
subscribe
#7
I found this URL validation regex in jQuery... maybe of interests and looks much heaver than the above.
// http://docs.jquery.com/Plugins/Validation/Methods/urlvar domain = /^(https?|ftp|news|nntp|telnet|irc|ssh|sftp|webcal):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)*(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?$/i.exec(window.location);
valid_url()also have bugs with IDN and there could be more places... changing in a more general topic to get more attention.#8
#9
We have many \p pregs ready in search.module as constants.
#10
Subscribing