Hello, the Czech letter "š" is not converted to s in 4.7. Instead it is converted to "-". This causes problems when editing an existing node, because the Pathauto changes the node URL again and omits the letter "š" (which I have to change manually in the URL Alias menu option). The effect is that after editing an existing node the URL changes inadvertently and leads to "Page not found".

Example:

Node title:
Objednávka starších čísel

Pathauto URL:
objednavka-star-ich-cisel

Corrected URL:
objednavka-starsich-cisel

After editing an existing node:
objednavka-star-ich-cisel

Thanks for any help. Roman

Comments

intu.cz’s picture

Is this part of the code what makes letters with accents convert to pure a-z?

function pathauto_cleanstring($string) {
  static $translations = array(
    'À'=>'A','Á'=>'A','Â'=>'A','Ã'=>'A','Ä'=>'A','Å'=>'A','Ā'=>'A','Ą'=>'A','Ă'=>'A',
    'à'=>'a','á'=>'a','â'=>'a','ã'=>'a','ä'=>'a','å'=>'a','ā'=>'a','ą'=>'a','ă'=>'a',
    'Æ'=>'Ae',
    'æ'=>'ae',
    'Ç'=>'C','Ć'=>'C','Č'=>'C','Ĉ'=>'C','Ċ'=>'C',
    'ç'=>'c','ć'=>'c','č'=>'c','ĉ'=>'c','ċ'=>'c',
    'Ď'=>'D','Đ'=>'D','Ð'=>'D',
    'ď'=>'d','đ'=>'d','ð'=>'d',
    'È'=>'E','É'=>'E','Ê'=>'E','Ë'=>'E','Ē'=>'E','Ę'=>'E','Ě'=>'E','Ĕ'=>'E','Ė'=>'E',
    'è'=>'e','é'=>'e','ê'=>'e','ë'=>'e','ē'=>'e','ę'=>'e','ě'=>'e','ĕ'=>'e','ė'=>'e',
    'ƒ'=>'f',
    'Ĝ'=>'G','Ğ'=>'G','Ġ'=>'G','Ģ'=>'G',
    'ĝ'=>'g','ğ'=>'g','ġ'=>'g','ģ'=>'g',
    'Ĥ'=>'H','Ħ'=>'H',
    'ĥ'=>'h','ħ'=>'h',
    'Ì'=>'I','Í'=>'I','Î'=>'I','Ï'=>'I','Ī'=>'I','Ĩ'=>'I','Ĭ'=>'I','Į'=>'I','İ'=>'I',
    'ì'=>'i','í'=>'i','î'=>'i','ï'=>'i','ī'=>'i','ĩ'=>'i','ĭ'=>'i','į'=>'i','ı'=>'i',
    'IJ'=>'Ij',
    'ij'=>'ij',
    'Ĵ'=>'J',
    'ĵ'=>'j',
    'Ķ'=>'K',
    'ķ'=>'k','ĸ'=>'k',
    'Ł'=>'L','Ľ'=>'L','Ĺ'=>'L','Ļ'=>'L','Ŀ'=>'L',
    'ł'=>'l','ľ'=>'l','ĺ'=>'l','ļ'=>'l','ŀ'=>'l',
    'Ñ'=>'N','Ń'=>'N','Ň'=>'N','Ņ'=>'N','Ŋ'=>'N',
    'ñ'=>'n','ń'=>'n','ň'=>'n','ņ'=>'n','ʼn'=>'n','ŋ'=>'n',
    'Ò'=>'O','Ó'=>'O','Ô'=>'O','Õ'=>'O','Ö'=>'O','Ø'=>'O','Ō'=>'O','Ő'=>'O','Ŏ'=>'O',
    'ò'=>'o','ó'=>'o','ô'=>'o','õ'=>'o','ö'=>'o','ø'=>'o','ō'=>'o','ő'=>'o','ŏ'=>'o',
    'Œ'=>'Oe',
    'œ'=>'oe',
    'Ŕ'=>'R','Ř'=>'R','Ŗ'=>'R',
    'ŕ'=>'r','ř'=>'r','ŗ'=>'r',
    'Ś'=>'S','Š'=>'S','Ş'=>'S','Ŝ'=>'S','Ș'=>'S',
    'Ť'=>'T','Ţ'=>'T','Ŧ'=>'T','Ț'=>'T','Þ'=>'T',
    'þ'=>'t',
    'Ù'=>'U','Ú'=>'U','Û'=>'U','Ü'=>'U','Ū'=>'U','Ů'=>'U','Ű'=>'U','Ŭ'=>'U','Ũ'=>'U','Ų'=>'U',
    'ú'=>'u','û'=>'u','ü'=>'u','ū'=>'u','ů'=>'u','ű'=>'u','ŭ'=>'u','ũ'=>'u','ų'=>'u',
    'Ŵ'=>'W',
    'ŵ'=>'w',
    'Ý'=>'Y','Ŷ'=>'Y','Ÿ'=>'Y','Y'=>'Y',
    'ý'=>'y','ÿ'=>'y','ŷ'=>'y',
    'Ź'=>'Z','Ž'=>'Z','Ż'=>'Z',
    'ž'=>'z','ż'=>'z','ź'=>'z',
    'ß'=>'ss','ſ'=>'ss');

Because if it is, there are some (a lot of) characters missing. Including "š" which has been a major pain on a website here. I'd like to help
and solve the problem, but it would be better if I could cooperate with somebody more knowledgeable.

Thanks

Roman

intu.cz’s picture

Version: 4.7.x-1.x-dev » 5.x-1.x-dev

The same problem applies as in http://drupal.org/node/106817

Roman

intu.cz’s picture

Version: 5.x-1.x-dev » 4.6.x-1.x-dev

The same problem applies to 4.6 as well. I haven't found any better way of expressing this (there is no option in the form to select more than one version of the module.)

My suggestion would be twofold:

In the short run:
1) The missing characters could be added manually. What I did was to change the lines so that I have:

    'Ś'=>'S','Š'=>'S','Ş'=>'S','Ŝ'=>'S','Ș'=>'S',
    'ś'=>'s','š'=>'s','ş'=>'s','ŝ'=>'s','ș'=>'s',
    'Ť'=>'T','Ţ'=>'T','Ŧ'=>'T','Ț'=>'T','Þ'=>'T',
    'ť'=>'t','ţ'=>'t','ŧ'=>'t','ț'=>'t','þ'=>'t',
    'Ù'=>'U','Ú'=>'U','Û'=>'U','Ü'=>'U','Ū'=>'U','Ů'=>'U','Ű'=>'U','Ŭ'=>'U','Ũ'=>'U','Ų'=>'U',
    'ù'=>'u','ú'=>'u','û'=>'u','ü'=>'u','ū'=>'u','ů'=>'u','ű'=>'u','ŭ'=>'u','ũ'=>'u','ų'=>'u',

which covers what our website in Czech needs. Some other languages might still miss characters...

2) Some kind of transcription mechanism might be found to work in general for all languages. Why is the following conversion mechanism in pathauto.module commented out?

// Convert accented characters to their ASCII counterparts...
/*  $output = strtr(utf8_decode($output),
       "\xA1\xAA\xBA\xBF".
       "\xC0\xC1\xC2\xC3\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF".
       "\xD0\xD1\xD2\xD3\xD4\xD5\xD8\xD9\xDA\xDB\xDD".
       "\xE0\xE1\xE2\xE3\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF".
       "\xF0\xF1\xF2\xF3\xF4\xF5\xF8\xF9\xFA\xFB\xFD\xFF",
       "!ao?AAAAACEEEEIIIIDNOOOOOUUUYaaaaaceeeeiiiidnooooouuuyy");
  // ...and ligatures too
  $output = utf8_encode(strtr($output, array("\xC4"=>"Ae", "\xC6"=>"AE", "\xD6"=>"Oe",
    "\xDC"=>"Ue", "\xDE"=>"TH", "\xDF"=>"ss", "\xE4"=>"ae", "\xE6"=>"ae",
    "\xF6"=>"oe", "\xFC"=>"ue", "\xFE"=>"th")));*/

In the long run:
1) Change the approach to i18n and Drupal Translations to encompass:

  • translation of strings
  • typography
  • transliteration tables

Translation of strings works, but could be less problematic in multilingual websites (though I admit to having seen the i18n module long time ago).

Typography: nobody seems to pay attention to it, besides Smartypants, but that is English-centric. Here is what happens if you ignore it:
http://www.ahmadinejad.ir/en/merry-christmas-to-everyone/

Transliteration tables could be another package in a system covering all language aspects. Looking at the code above: where are Cyrillic letters, Arabic, Armenian, Klingon, etc. ?

Roman

Láďa’s picture

I have to agree with this in 5.0-rc2. There should be another two lines

    'š' => 's',
    'ť' => 't',

each one after theirs uppercase counterparts. Is this enough or should I make the patch?
All other czech characters are ok, tested with "příliš žluťoučký kůň úpěl ďábelské ódy" (uppercased and lowercased), what is the shortest known czech sentence including all accented characters.

intu.cz’s picture

Not only those two, but a few more should be included in the patch. I tried to suggest the additions in http://drupal.org/node/106817#comment-175985 under 1).

greggles’s picture

Version: 4.6.x-1.x-dev » 5.x-1.x-dev
Status: Active » Closed (duplicate)

First, I'm not making changes to the 4.6 branch, so I changed the version.

Second, I'm not making changes to the transliteration array as proposed. See the reasoning for the elsewhere in the issue queue.

Third, there is a solution posted now in another issue which needs testers (see http://drupal.org/node/61815) and I believe that is the best long-term sustainable solution.

I would appreciate your help in testing the patch in 61815 so that we can get it committed and included in future releases. It is hard for me to test this because I only use ASCII-96...