See http://drupal.org/node/98293
Now that we can use all characters in the URL, it seems we should remove this restriction from pathauto.
The only question I have before doing so is if it should be an option filter that an admin can perform if they want to.
So, the two paths are:
1. remove it all
Node titles, taxonomy strings, usernames, etc. are all passed into the path_set_alias in their original form
2. Make it optional
The admin/settings/pathauto page would have a checkbox with text like "Attempt to transliterate local languages into Latin characters to provide legacy style URLs"
Please provide feedback on the subject whether you like strategy #1, strategy #2, or a whole new one?
Also, if you like #2 please provide ideas on the UI format and message of the text.
| Comment | File | Size | Author |
|---|---|---|---|
| #9 | pathauto_optional_translit_0.patch | 3.9 KB | greggles |
| #3 | pathauto_optional_translit.patch | 3.5 KB | greggles |
Comments
Comment #1
kkaefer commentedI think it would be cool to have the choice: For example, I'd like to replace spaces with dashes, but leave accented characters like they are.
Comment #2
gregglesI should have been more clear - I'm asking specifically about the clean_strings function and not about the way the ' character nor space are translated since those are well known/functioning.
Comment #3
gregglesOk - it's relatively simple, but here it is.
Comment #4
gregglesAlso, here is the UI (so you don't have to dig in the patch):
"{checkbox} Transliterate prior to creating alias
When a pattern includes certain characters (such as those with accents) should Pathauto attempt to transliterate them into a Western-Latin alphabet?"
I have no idea if that's even close to what is happening, so feel free to provide better language.
Comment #5
intu.cz commentedHaving just written http://drupal.org/node/106817 , I would think that transliteration still has a future. As I understood, what the issue here is suggesting is to leave the title in the URL un-transliterated, presumably because we now have IDN. But from what I see here in the Czech Republic, IDN will not be universal soon: it has been rejected twice by the local domain name regulator (nic.cz) and polls always indicate a general dislike of the IDN idea among the domain owners (duplication of costs, confusion, speculation).
Perhaps a truly Drupal solution would be to allow pathauto to create both: a transliterated and non-transliterated alias URL simultaneously.
IDN seems very nice until you actually have to type a URL using characters not on your keyboard.
Comment #6
gregglesWell, I don't plan on providing the same content at two URLs on purpose. There is a strong trend away from that for usability and search engine purposes.
Beyond that, the feature is to allow users to decide whether to use the local character set values in the URL or if pathauto should try to transliterate them.
So, if you have a pattern for nodes like [title] and have it set to not transliterate and you enter a title like " ÀÁÂÃ" then you would get a url like www.example.com/ÀÁÂÃ (although it would probably actually be lower case...). If you have it set to transliterate then pathauto would give a url like www.example.com/aaaa.
Did you test the patch? Do you have any feedback on the text that I proposed in comment #4? I'd really appreciate it if you could test this feature.
Comment #7
fgmActually, transliteration as we have it, if I'm not mistaken, is not to western latin (ISO-8859-1[5]) but just to a subset of ASCII-96. I think this is what you are referring to when saying western latin ?
I'd be in favor of a setting, just like the one for apostrophes merging/converting.
Comment #8
greggles@fgm - yes, that is exactly what I was hoping for! thanks, I'll commit this shortly.
Would that still be the correct language if we switched to using this transliteration table?
Comment #9
gregglesrerolled.
Comment #10
fgmI can't confirm whether the transliteration table is correct: after all many letters in some languages don't even have an exact sound-alike equivalent in ASCII-96, so it may well be imperfect.
But FWIW, the right-hand letters are ASCII-96, and for most Western Latin languages, the left-to-right conversion seems correct.
One thing to keep in mind, although I don't think it has an impact on pathauto, is that this transformation is obviously surjective, meaning there's no way back from the transliterated version.
Comment #11
lennart commentedlooks good.
Comment #12
gregglesthis is now committed - thanks kkaeffer, lennart, and especially fgm for the assistance!
Comment #13
(not verified) commentedComment #14
m3avrck commentedMaybe I didn't look hard enough, but this patch is looking for the file i18n-ascii.txt but the included file is i18n-ascii.example.txt with no mention of renaming this file.
However, I think a far better option is to follow the translit module: http://drupal.org/project/file_translit -- this does the same thing for file names and is super easy to use. Including this PHP package with pathauto would offer the most robust support IMO and would be easy to implement.
Comment #15
gregglesWell, http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/pathauto/IN...
And aside from token, I don't really like the idea of making pathauto dependent on another module. If someone wants to work on that and provide a patch in a new issue that would be fine, but I'd want to see some compelling reason to use that module (i.e. it's less code, works more reliably, isn't any slower etc.)
The thing I really like about the current solution is that it is small and fast and can be tuned well for an individual site - i.e. if you just need cyrillic characters you can use those and no other characters.
Comment #16
m3avrck commentedMaybe I should have been more clear. That module cleans filenames like this issue cleans URLs.
It uses this open source package: http://sourceforge.net/projects/phputf8 -- this is far more robust than the i18n-ascii file in pathauto.
All pathauto would need to do is include that library (it's essentially the same as the on pathauto includes but has support for all languages) and then simply add this code:
What I'm suggesting is remove the i18n-ascii, add in this UTF-8 package, and then simply that one line of code. No dependencies, and it's just as fast, if not faster than what is there now.
Comment #17
m3avrck commentedOoops sorry, it would need to use the function in the project, but it's only 20+ lines long or so, file translit does this:
Comment #18
m3avrck commentedOoops sorry, it would need to use the function in the project, but it's only 20+ lines long or so, file translit does this:
Comment #19
greggles