When certain accented characters are converted to ascii, extra dashes are added.

Some other characters are simply stripped.

E.g. Ăă Îî Ââ Şş Ţţ Iñtërnâţiônàlizætiønş is converted to ii-aa-in-te-rna-t-io-na-lizaetions.

I attached a file with a conversion "table" for all the accented characters in the Latin-1 and Latin Extended-A tables of the Unicode spec.

I would be glad to make a patch ...if someone can show me how :-)

Thanks.

Comments

mikeryan’s picture

Assigned: Unassigned » mikeryan

I can reproduce the problem, but that table isn't in a useful form. Here's the code pathauto uses, borrowed from commentary in the online PHP manual:

  // Convert accented characters to their ASCII counterparts...
  $output = strtr(utf8_decode($output),
       "\xA1\xAA\xBA\xBF".
       "\xC0\xC1\xC2\xC3\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF".
       "\xD0\xD1\xD2\xD3\xD4\xD5\xD8\xD9\xDA\xDB\xDD".
       "\xE0\xE1\xE2\xE3\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF".
       "\xF0\xF1\xF2\xF3\xF4\xF5\xF8\xF9\xFA\xFB\xFD\xFF",
       "!ao?AAAAACEEEEIIIIDNOOOOOUUUYaaaaaceeeeiiiidnooooouuuyy"); 
  // ...and ligatures too
  $output = utf8_encode(strtr($output, array("\xC4"=>"Ae", "\xC6"=>"AE", "\xD6"=>"Oe", 
    "\xDC"=>"Ue", "\xDE"=>"TH", "\xDF"=>"ss", "\xE4"=>"ae", "\xE6"=>"ae", 
    "\xF6"=>"oe", "\xFC"=>"ue", "\xFE"=>"th")));

Any thoughts on modifying this to take care of the reported problem cases?

Thanks.

Gabriel R.’s picture

I found that code bit on my own too, but I couldn't find a way to convert a text files into \xxx stuff. If anything comes up I'll think of this issue.

restyler’s picture

Assigned: mikeryan » restyler
StatusFileSize
new12.6 KB

I've edited pathauto.module a bit, to create translited aliases for russian node titles - but it doesn't uses these codes, it uses true cyrillic letters. So, I couldn't patch the original pathauto.module with unxutils.sourceforge.net diff.exe because of cyrillic letters, though I've created a patch.

restyler’s picture

StatusFileSize
new1.8 KB

and patch

bilgehan’s picture

This bug should not be seen as narrow as to fix only the accented characters. Many languages have a couple of special characters (ie. spanish:ñ, catalan:ç, turkish:ü,ö,ş etc.) that are also converted to extra dashes or just stripped. There must be a solution including all these.

One solution is to use utf-8 counterpart of the characters like Wikipedia is using. For example the url path of the term "españa" is http://es.wikipedia.org/wiki/Espa%C3%B1a

restyler’s picture

well, I think the easiest solution (the best?) is to increase the number of elements in array

 $output = utf8_encode(strtr($output, array("\xC4"=>"Ae", "\xC6"=>"AE", "\xD6"=>"Oe", 
    "\xDC"=>"Ue", "\xDE"=>"TH", "\xDF"=>"ss", "\xE4"=>"ae", "\xE6"=>"ae", 
    "\xF6"=>"oe", "\xFC"=>"ue", "\xFE"=>"th"))); 

you can find all symbol codes in charmap.exe (win32)
cyrillic, accented, ...etc

if I have time, I'll create the patch soon.

Gabriel R.’s picture

bilgehan: My table contains all diacritics, not just accented characters.

_Troy_: I know the hard way to do it, I was looking for a convertor to do it without wasting half a day and me love for Drupal :-)

I may end up paying for this, it's too important.

mikeryan’s picture

Partial fix - I've implemented a translation using the table you provided, which seems to improve the situation but doesn't cleanly translate all the characters from your original example (Ăă Îî Ââ Şş Ţţ Iñtërnâţiônàlizætiøns). Any advice on further refinement welcome...

I'll commit this to cvs later today.

greggles’s picture

Status: Active » Fixed

Since mike committed most of this, I'm going to close it in favor of this issue which discusses new ideas for the transliteration process: http://drupal.org/node/61815

Anonymous’s picture

Status: Fixed » Closed (fixed)