Several modules (pathauto, token, audio) need to take international strings and convert them into plain text.
Users would like to have the mappings done intelligently but it varies by language i.e. ä -> a and ö -> o in Swedish but 'ä'=>'ae' and 'ö'=>'oe' in German. Using a separate text file allows the user to modify the mappings to suit them. This patch is based off the solution greggles came to for the pathauto module.

CommentFileSizeAuthor
#1 i18n-ascii.example.txt5.64 KBdrewish
check_ascii.patch876 bytesdrewish

Comments

drewish’s picture

StatusFileSize
new5.64 KB

here's pathauto's example text file.

drewish’s picture

Status: Active » Needs review

forgot the status ;)

greggles’s picture

Just to give credit where it's due, this was actually from textpattern originally and the patch was started by someone else - I just cleaned it up and applied it.

@eaton, you had also requested another feature like "prepare to be used in URLs" which might include removing spaces or something else. That may deserve to be a separate issue, though.

Steven’s picture

Status: Needs review » Closed (won't fix)

Transliteration is not a solution to any problem. It destroys data and hence makes your site less accessible to search engines. We might as well add a function to core that removes all vowels in a string.

For pathauto, IRIs should be generated, with regular UTF-8 in them. Browsers will support this to varying degrees, and support is increasing every day. The most important consumers of clean URLs, namely search engines, support IRIs perfectly.

Also, there is no such thing as "take international strings and convert them into plain text.". What you probably mean is "convert Unicode strings to 7-bit ASCII".

drewish’s picture

Title: Add function to tranliterate an international string to plain ASCII characters » Add function to convert UTF8 to 7-bit ASCII

alright then ;)

joshk’s picture

Title: Add function to convert UTF8 to 7-bit ASCII » Implement drupal-standard transliteration
Status: Closed (won't fix) » Active

Not to start a food fight here, but Stephen's answer above ("Transliteration is not a solution to any problem") is incorrect.

Currently support for full UTF-8 is spotty in browsers, and it will likely remain so for several years. Marking this "won't fix" is essentially saying that the internet is the problem, not Drupal. That's kind of the wrong way to approach things IMHO.

Comparing transliteration -- standardizing strings while retaining semantic data -- with removing vowels is a flawed analogy. It's not "removing data" and it's not creating gibberish: it's creating a standard. We have coding standards, and we can all agree they're good to have and to follow. This is a similar situation. It's about standardization.

There are many cases where transliteration is advantageous, even necessary, from a usability perspective. Compressing a verbose, irregularly cased and possibly punctuated user-input string to something more manageable for internal system-string/path use comes in handy in lots of ways:

  • Pathauto is a great example, because even if some browser can support full UTF-8 text in paths, user-friendly URLs (or IRIs) don't contain punctuation, spaces, etc. In the end, the web is for people, not browsers.
  • It's very nice to have transliteration to help create semantically useful CSS class/id names (e.g. block-whos-online is preferable to block-user-2).
  • Views arguments: it would be good if one could have a term named "My Term" and be able to pass through an argument like "my_term" (e.g. /someview/my_term, rather than /someview/My Term).

There are many other examples where developers have these kinds of issues, and refusing to take on this challenge in core will almost certainly result in a proliferation of incompatible and inconsistent solutions in contrib-space. That's a fine choice for the core maintainers to make, but I think we should at least think it over.

Having a drupal-standard for doing this would make module development easier, drive convergence in coding practices, and increase usability. With respect, I would ask that this issue be re-considered.

joshk’s picture

Title: Implement drupal-standard transliteration » Implement drupal-standard transliteration
Status: Active » Needs work

Just to be clear, I am actually hijacking this thread a bit: I'm suggesting that drupal core may want to implement a standard way of making system-friendly strings. This can be used to collapse functions like pathauto_cleanstring, zen_safe_id, etc.

As it stands, the patch that's been submitted here isn't that interesting to me. I'd like to see drupal tackle the more general problem of taking a hairy string like "That's why we call a Node, a Node" and standardizing it into something that can be easily used in paths, css classes, etc, which retains the semantic data.

Steven’s picture

Status: Needs work » Closed (won't fix)

Pathauto is a great example, because even if some browser can support full UTF-8 text in paths, user-friendly URLs (or IRIs) don't contain punctuation, spaces, etc. In the end, the web is for people, not browsers.

I disagree... I rarely look at clean URLs myself. The main consumer is search engines, who all support IRIs perfectly and are much better off with real data than transliterations. Wikipedia is the perfect example: how do you explain that Wikipedia's clean URLs with IRIs consistently rank the highest on the web for keyword searches in dozens of languages?

Having a 'catch all' safe id generating function in core is a bad idea. The requirements vary heavily on the context, and the last thing we need is sloppyness and over-zealous heavy string replacement in key parts of Drupal. We already have appropriate cleaning functions for certain contexts (e.g. form_clean_id) which should be used when appropriate. If you are coding for custom contexts, look up the relevant spec and code a function that is correct for your use case.

None of the examples you mention require transliteration. IRIs don't (pathauto), while CSS classes and IDs are much more restricted and in a different way, so that transliteration is of no use. As for user generated content in paths, this is not a problem. There are no disallowed characters in paths. Look at search.module's clean URLs for example. The idea of generic 'safe strings' is a fantasy that does not exist in real life.

As for destroying data: if you have russian text, and you transliterate it to latin, the semantics are lost. Search engines do not reverse transliterations and keyword searches as well as PageRank will suffer significantly. Adding in customizability by supplying a text file is a hack which isn't done anywhere else in Drupal.

Transliteration in core: not going to happen.

joshk’s picture

I think we're talking at cross-purposes. Clearly I'm not communicating my idea very well, and it's likely that I should start a new thread so that what I'm talking about is not cluttered up with the extraneous issues at play here.

kkaefer’s picture

See http://drupal.org/node/153574 for further discussion on this topic.

JirkaRybka’s picture

Just to make something clear: I live in Czech and speak Czech. Since the very first computers till now, Czech character-set was never fully supported yet. Yes, with UTF-8 it improves a lot, but still there are lots of places where full Czech string can't go, including css classes, HTML ids, but even more importantly filenames (sometimes even impossible to open such file in Windows!), quoted text transferred across platforms and applications, any mobile devices, WAP, SMS, even Drupal.org shows Czech wrong if viewed under Windows/IE6 where further conversion seems to occur internally... Every other day there's a place where Czech charset breaks.

Because of this, people here somehow accepted the 'plain-text' option, where 'á' becomes 'a' and 'š' becomes 's' for example. SMS always looks like that, and lots of people got used to write all e-mails and even web-posts this way just to avoid garbage showing if recipient is on another platform. It's far from being nice, but it's perfectly readable and no data lost. Search engines seem to be perfectly capable to accept it as sort-of "common typo" strings.

But if CCK, Path Auto, or some other part of Drupal replace all the non-English characters with underscores, where they'll be not legal or accessible for some reason - now THAT's a data-loss, making it all unreadable for both people and search engines. And if modules are required to develop the transliteration independently - well, that's a lot of code duplication.

If it's agreed upon, to "won't fix" this - OK. I just wanted to share my angle of view.