Usually, when transforming German umlauts to plain ascii, the letter "e" is appended to either a, o, u like this:

ä => ae
ö => oe
ü => ue

However, pathauto just replaces ä with a, ö with o and ü with u.

I attached a patch that introduces the correct behavior.

Comments

gerd riesselmann’s picture

StatusFileSize
new1.44 KB

Hmm, look I clicked on the wrong file. Here's the patch,

greggles’s picture

Status: Needs review » Closed (won't fix)

Hi. I'm not going to change this for 4.7, though I do appreciate the fact that you've provided a patch.

See http://drupal.org/node/92900 for details on the way I have fixe this for 5.x

asbdpl’s picture

Version: 4.7.x-1.x-dev » 5.x-1.1

How do I make pathauto.module convert umlauts and as described above?

On my setup (Drupal 5.1), it just replaces umlauts with anunderscore (_). That results in useless URLs like "...handbuch_der_s_ugetiere_europas".

Regards, -asb

greggles’s picture

My apologies, I provided the wrong link earlier - the release notes at http://drupal.org/node/129345 have more information on this. Basically read the INSTALL.txt and README.txt files and be sure to read/rename/edit the i18n-ascii.example.txt file.

asbdpl’s picture

Sorry for nagging about this, but I can't get pathauto to do useful conversions. If I copy i18n-ascii.example.txt to i18n-ascii.txt, I get conversions e.g. from "ä" to "a", not to "ae". As far as I can read i18n-ascii.example.txt, the replacement should be "ä" to "ae" etc. I read INSTALL.txt and README.txt several times, but I simply don't get it - it *does* a replacement, but not the one it is supposed to do according to i18n-ascii.example.txt.

Also, what I absolutely don't understand is whis such a conversion is supposed to be necessary at all. URLs *can* contain umlauts, they always could; if I understand RFC 1738 [4] correctly, they simply have to be encoded properly. Examples:

* Ä = %C4 (196)
* ä = %E4 (228)
* Ö = %D6 (214)
* ö = %F6 (246)
* Ü = %DC (220)
* ü = %FC (252)
* ß = %DF (223)
* é = %E9 (232)
* á = %E1 (225)

Software like MediaWiki *does* support it, as can be seen in numerous projects like German Wikipedia [1], french and chinese Wikisource [2], Russian and Greek Wikibooks [3] etc.; those projects also prove, that almost any browser on the market is able to handle such URLs. Even Amazon *does* use umlauts in its URLs, e.g. ISBN-10 3492246885 at Amazon.de, so it can't be that bad at all. Since 2004, even umlauts in domain names are allowed (Punycode enconding, so called "Internationalized Domain Name"). So why is the conversion necessary at all?

I did some experimenting with this and created an URL containing "bücher" in Druapl and MediaWiki; on the same server, I could access http://www.example.com/wiki/bücher, delivered from MediaWiki 1.10.1, but not http://www.example.com/drupal/bücher, delivered from Drupal 5.1. Is this a bug in Drupal core? Does it even encode URLs, or delivers it simply unencoded ("safe") 7-bit ASCII characters?

NB: A fine tool to construct valid encoded URLs is the "The URLEncode and URLDecode Page" by Albion Research [5].

Regards, asb

[1] http://de.wikibooks.org/wiki/Hauptseite
[2] http://fr.wikisource.org/wiki/, http://zh.wikisource.org/wiki/
[3] http://el.wikibooks.org/wiki/, http://ru.wikibooks.org/wiki/
[4] http://www.w3.org/Addressing/rfc1738.txt
[5] http://www.albionresearch.com/misc/urlencode.php

greggles’s picture

Status: Closed (won't fix) » Active

Well, I'm not really sure about the ä to ae problem. I will try to confirm and test that out.

As to why this is necessary at all...well, it's not. Drupal5 core now allows any characters as I believe you determined. Pathauto 5.x-2.x-dev allows you to not transliterate at all if you so desire, but it has it's own bugs as well. If you can help test that version and perhaps provide patches to make it work more reliably then we could get to a place where that is available for all.

asbdpl’s picture

Version: 5.x-1.1 » 5.x-2.x-dev

> Well, I'm not really sure about the ä to ae problem. I will try to confirm and test that out.

It happens on five similar configured sites, running German localisation. I *copied* i18n-ascii.example.txt to i18n-ascii.txt and didn't rename it, if that matters. Please give a short notice if I can do anything to help.

> As to why this is necessary at all...well, it's not. Drupal5 core now allows any characters as I believe you determined.
> Pathauto 5.x-2.x-dev allows you to not transliterate at all if you so desire, but it has it's own bugs as well. If you can
> help test that version and perhaps provide patches to make it work more reliably then we could get to a place where
> that is available for all.

I did a quick setup of pathauto 5.x-2.x-dev on top of Drupal 5.2 dev in an testing environment and tried regenerating the node paths for some +20k nodes after deleting i18n-ascii.txt to circumvent umlaut conversion (pathauto 5.x-1.2 was working as usual on Drupal 5.2 dev).

Now, accessing ./admin/settings/pathauto results in

Fatal error: Call to undefined function token_get_list() in /var/www/drupal/portal/modules/pathauto/pathauto_node.inc on line 18

Indeed, there is no function token_get_list() in athauto_node.inc. Hm, I don't think I'm good enough in PHP to provide a patch. However, the testing environment remains available for a few days. Please drop me a note if I can be of any assistance.

Greetings, -asb

greggles’s picture

Title: German umlauts not processed correctly » transliteration ä to ae not working even with i18n-ascii.txt installed and configured properly.
Version: 5.x-2.x-dev » 5.x-1.2

You need to also install the token module to get pathauto5.x-2 to work. I do appreciate help testing 5.x-2 in general, but that shouldn't be the focus of this issue.

I'm changing the title and version to reflect what I think this issue is about - the a to ae problem (and perhaps others).

greggles’s picture

Status: Active » Postponed (maintainer needs more info)

I tested this in 5.x-2.x branch and it worked fine, so I'm not sure what's going on. Try searching your whole i18n-ascii.txt file for ä as their may be multiple entries which then overwrite each other...

asbdpl’s picture

> I tested this in 5.x-2.x branch and it worked fine, so I'm not sure what's going on. Try searching your whole
> i18n-ascii.txt file for ä as their may be multiple entries which then overwrite each other...

I simply used the file provided with the module without editing anything an was hoping the this file would do the trick. Maybe someone could publish a working and up-to-date tranliteration file on drupal.org?

To be honest, I'm pretty pretty much stuck here since working with Unicode files on a remote server is bit too tough for me since I can't figure out anymore, what applications in which versions are really UTF8-clean (nano? vi?) and which ones in the processing chain modify the data.

E.g., if I open i18n-ascii.txt from a Windows machine with WinSCP and do a search for "ä" or "Ä", the editor doesn't find any matches. However, visually, there appear to be matches like Ä  = "G", which are analyzed by the editor as "character 160 (0xA0), and 34 (0x22).

Even if the server is configured with

# cat /etc/locale.gen
de_DE.UTF-8 UTF-8

,

all I get in a terminal window, connected with the SSH client UTF-8 TeraTerm Pro, are rubbish characters. Again, I get different results when I download the file from the server to my local Windows machine with WinSCP and open them with UltraEdit, which also is supposed to be UTF8-clean; maybe during the download the files are modified somehow (maybe something like bin mode in FTP transfers?). Thus I think that editing the file could cause even more damage.

Any suggestions how I should proceed?

greggles’s picture

I can't really help you with file transfers, but perhaps substituting the latest version of the file from CVS will help: http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/pathauto/i1...

asbdpl’s picture

Status: Postponed (maintainer needs more info) » Closed (fixed)

> [...] perhaps substituting the latest version of the file from CVS will help [...]

Thank you for this idead; I downloaded the file from CVS and did a quick check; now, the transliteration appears to be working correctly (issue closed).

Thanks a lot & greetings, -asb