Since PHP 4.4.0 and 5.1.0, there is better way to create clean URL. Function pathauto_cleanstring() could look simple like this (pretending that '-' is a separator, which is better for Google than the default underscore '_'):
function pathauto_cleanstring($string)
{
$url = $string;
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
$url = trim($url, "-");
$url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
$url = strtolower($url);
$url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
return $url;
}
The obvious advantage is that you can never forget some translation pair (for example, in 4.7 release, there is currently missing š=>s conversion).
The script comes from Jakub Vrana and was originally published here: http://php.vrana.cz/vytvoreni-pratelskeho-url.php (article is Czech only).
Comments
Comment #1
gregglesthis seems great, but will have to wait until those versions of php become standard - which they currently aren't http://drupal.org/requirements
Comment #2
nicholasthompsonAccording to the docs, preg_replace has been present since 3.0.9 and iconv since 4.0.5. The others are pretty core commands since V3.
Comment #3
greggles@nicholasThompson - do you have a citation?
the iconv was the item that didn't look fully supported to me.
Comment #4
nicholasthompsonIconV: http://uk.php.net/manual/en/function.iconv.php
Preg_replace: http://uk.php.net/manual/en/function.preg-replace.php
Commands like 'iconv_ strlen()' are PHP5, but the command iconv() are PHP4.0.5 and above.
Comment #5
FiReaNGeL commentedIn my experience, iconv //TRANSLIT isn't foolproof; some characters (some of the really odd ones) will just disappear instead of being replaced. I think a translitteration table like we've been using is safer (as long as its complete).
Comment #6
gregglesFireangel - it seems like obscure problems in the iconv code are things where we could add bugs to the php project to have them fixed.
I'm quite interested in this solution because the other solutions are difficult to update and get right.
Comment #7
gregglesHere's an (untested) patch that implements this in a slightly different order and with some extra items that pathauto needs to consider like user configuration for apostrophes and maxlength.
I'm curious about the trim($url,$separator) because the current pathauto has always used a ctype_alnum and preg_replace function. The ctype_alnum isn't supported on all platforms (see http://drupal.org/node/20289).
Tests and critiques welcome (I'm about to do one myself).
Comment #8
gregglesPerhaps iconv is really bad or I'm just not doing something right in the patch, but when I used Ä Ü ' - stuff -- as a test URL the output was simply stuff.
Also, the patch didn't apply for me - I'm not sure of the exact problem, but apologies in advance if it doesn't work for others. I've attached the entire module file as a workaround for now.
Comment #9
gregglesAnother possible method is the one provided in the accents module starting around line 20 of the .module file:
http://drupal.org/project/accents
http://cvs.drupal.org/viewcvs/drupal/contributions/modules/accents/accen...
However, I'm not sure if that removes all of the characters that people want to remove...
Comment #10
gregglesI'm postponing this because I couldn't get it to work and there are other techniques in this issue queue that will work better.
Comment #11
gregglesGiven http://drupal.org/node/61815 has been applied I think we can mark this as won't fix. That feels like the best solution to me for now. If we need to revisit this idea we can re-open this issue at that time.