Remove restrictions on path aliases (support IRIs)

greggles - November 18, 2006 - 04:23
Project:Drupal
Version:5.x-dev
Component:path.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed
Description

In my brief testing it's impossible to create a URL alias that includes characters which should be allowed.

In IRC UnConeD also pointed out that "core is broken" in this regard.

#1

Steven - November 18, 2006 - 05:01
Title:support IRI» Remove restrictions on path aliases (support IRIs)
Version:6.x-dev» 5.x-dev
Status:active» patch (code needs review)

Things to know:

  • All menu paths are urlencoded on output (by Drupal) when placed in the GET query.
  • All GET values (including the menu path) are urldecoded on input (by PHP).

This means, the URLs that result from user defined menu paths and aliases will always be valid, even menu paths that use punctuation like "#" or "!" or even random Unicode characters.

e.g.

Path/Alias = blog/Bunnies are made of people!?
Resulting URI = http://example.com/base-path/?q=blog/Bunnies+are+made+of+people%21%3F

Path/Alias = blog/My résumé
Resulting URI = http://example.com/base-path/?q=blog/My+r%C3%A9sum%C3%A9

Path/Alias = blog/アニメ
Resulting URI = http://example.com/base-path/?q=blog/%E3%82%A2%E3%83%8B%E3%83%A1

In spite of this, path.module requires that path aliases contain only characters valid in relative URLs. This makes no sense. The attached path removes this restriction.

This is a necessary step towards allowing e.g. pathauto to support arbitrary languages. The current practice of transliteration of letters to ASCII and removal of accents is a hack which produces 'prettier URLs', but which are less meaningful to search engines. It is also useless for languages which do not use the latin script.

Note that the 'odd' escapes for the Unicode characters above is perfectly normal. This is the standard used for IRIs (the i18n'd form of URIs, see RFC 3987) and supported by all the major browsers and search engines.

However, because of phishing abuse, some browsers will not show the Unicode characters in some or all IRIs in the address bar and/or status bar. e.g. Japanese Wikipedia on Google.

AttachmentSize
iri.patch2.15 KB

#2

chx - November 18, 2006 - 08:34
Status:patch (code needs review)» patch (reviewed & tested by the community)

Lovely patch. Less restrictions, more features, less code, more comments.

#3

Dries - November 21, 2006 - 19:39
Status:patch (reviewed & tested by the community)» fixed

Committed to CVS HEAD! :)

#4

Anonymous - December 5, 2006 - 19:45
Status:fixed» closed
 
 

Drupal is a registered trademark of Dries Buytaert.