Hi,

I am making a site which will have to cope with filenames and 'pathauto urls' in
english, spanish, german and french.

However, I'm a bit confused. A half the information I read suggests that if your
website and database are using utf-8, non-ascii filenames and urls won't be a
problem. The other half say they should be ascii.

Personally, I would prefer to leave the filenames and urls with non-ascii characters
like ñ, ä and é.

Before I spend hours trying to break my site with exotic filenames and urls, can
anyone clarify the situation please?

Cheers.

Comments

And another quick question:

In the pathauto docs, it says that for D6.2, the i18n-ascii.txt is depreciated
and transliteration module will be used instead. However, in the pathauto
settings, I see the i18n-ascii.txt option, but nothing about the transliteration
module (I expected a greyed checkbox). Is that simply because I don't have
transliteration installed yet, or did the stated integration not occur?

Cheers.

*Bump*
Thanks.

Hi Anti,

I also thought that if your file system, database and connections used the utf8 character set, then transliteration would not be needed.

Perhaps it's the legality of the urls rather than whether the system works.

Having said that ... I've just Googled .jp and .fr and find that all the urls I checked did NOT contain any accented characters, japanese characters etc

I also READ that the only really 'safe' characters were ascii, but that 'most western European characters would work'. Not sure if this is true or not.

This is a shame, as I would have thought that foreign words in urls would be good for ranking in the respective country's search engines. Not sure if that's important to you, but it will be for me.

Perhaps worth checking the wikipedia page about urls?

Did you find a good answer to this question, here or elsewhere?

> Perhaps it's the legality of the urls rather than whether the system works.

Yes. Exactly. It works, but should we doing it? That is the question for which I cannot find a clear answer.

I don't know what actually needs to be done, or what the problem is. I suppose they need to change the url protocol and then software manufactures need to incorporate the changes. If that's the case, then we could be looking at five or ten years before everyone can agree that non-ascii is safe. Maybe they're waiting for utf16 (whoever 'they' are)?

I'm going to use transliteration for urls until I see non-ascii characters in urls on a daily basis.

I'll try to clarify your questions:
1. UTF-8 characters in filenames (file uploads) were broken in Drupal 5. This has been fixed in D6. If you're on *nix there is no reason not to keep filenames in UTF-8. On Windoze, though, UTF-8 might still be problematic (not sure).
2. UTF-8 in URL aliases (pathauto or manually entered) never were a problem, transliterating or not is just a matter of personal taste. See the russian Wikipedia for a proof (if you need one).

I should really put this on the project page or into the README...

#4 it is a personal taste but if you regulary send URLs by email, for example, it's a bit ugly to have this kind of things:
http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%...

Anyway, it is different if you have an url with letters like à, ï and a few more than russian. Reading aviaera instead of àvïaerà is really not a problem and I think it is much cleaner to use.

Status:Active» Closed (fixed)

(Finally) added a section on the project page. Thanks for your input!