Transliteration

smk-ka - December 9, 2007 - 11:57
Transliteration of upload filenames

This module provides a central transliteration facility for other Drupal modules, as well as sanitizing of file names during uploads.

Generally spoken, it takes Unicode data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration — i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system.

According to Unidecode, from which most of the transliteration data has been derived, "Russian and Greek seem to work passably. But sometimes the output is very dirty: it works quite bad on Japanese and Thai."

If you would like to help make transliteration better, the following sources might act as a starting point: CLDR — Unicode Common Locale Data Repository, especially the guidelines and available transliteration charts.

Should I use transliteration?

This question can't be generally answered, since it depends on what you want to do with user submitted file uploads. There are two simple cases when you might not need transliteration:

  1. you let users upload files to your site and offer these files as download without PHP processing, and you're on Drupal 6 or later. Or,
  2. you are sure your users won't upload files or images with non-ASCII filenames.

However, whenever you process uploaded files on the server, you most likely need transliteration. For example, if you are using ImageCache to provide modified versions of uploaded images. The reason is that PHP 5 doesn't fully support unicode in filenames, and may not be able to read files with non-ASCII characters in their name.

Whether you use transliteration for URLs (requires Pathauto), however, is a matter of personal taste. For example, Wikipedia uses full unicode in their URLs. On the other hand, a user noted that unicode links in e-mails look quite ugly.

On Drupal 5, transliteration is required since unicode characters in generated URLs (for example, file attachments) are not properly encoded in certain cases, see #191116: Make drupal_urlencode RFC 1738-compliant.

Installation

Please see current README.

Roadmap

Planned features for the next major version (3.x):

  • New developer-friendly transliteration data file layout
  • Lower memory footprint of replacement function
  • Make filename cleaning optional.
  • Move retroactive filename cleaning to the backend.
  • #362006: Enable transliteration on specific pages (unsure)

Credits

Authors:

  • Stefan M. Kudwien (smk-ka)
  • Daniel F. Kudwien (sun)

UTF-8 normalization is based on UtfNormal.php from MediaWiki and transliteration uses data from Sean M. Burke's Text::Unidecode module.

Sponsor:

UNLEASHED MIND
Specialized in consulting and development of Drupal powered sites, our services include installation, development, theming, customization, and hosting to get you started.

Releases

Official releasesDateSizeLinksStatus
6.x-2.12009-Jun-0997.26 KBRecommended for 6.xThis is currently the recommended release for 6.x.
5.x-2.12009-Jun-0997.16 KBRecommended for 5.xThis is currently the recommended release for 5.x.
Development snapshotsDateSizeLinksStatus
6.x-3.x-dev2009-Jul-07100.16 KBDevelopment snapshotDevelopment snapshots are automatically regenerated and their contents can frequently change, so they are not recommended for production use.


 
 

Drupal is a registered trademark of Dries Buytaert.