Problem/Motivation

In order to create automatic machine names, file names, and URL paths in Drupal, we need to transliterate Unicode characters into the ASCII range. This is currently handled by the contrib Transliteration module (http://drupal.org/project/transliteration).

Proposed resolution

Get transliteration functionality into Core.

The idea is to create a factory class, with a reference PHP implementation that does basic transliteration, with a hook allowing contrib modules to supply language-specific overrides of the basic transliteration tables.

There was a discussion of providing an implementation that wrapped the PECL intl's Transliterator class (http://php.net/manual/book.intl.php), which is based on the ICU project (http://userguide.icu-project.org/transforms/general), but as this class was poorly documented and not widely available in standard PHP installations, this idea was dropped.

Remaining tasks

a) [done] Finish the reference implementation.
b) [done] Finish the tests of the reference implementation.
c) [done] Commit this patch.
d) Follow-up tasks and issues:
- #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance
- #1842726: Transliteration component must not contain drupal_alter()
- #1842710: Transliteration: Make language-specific character overrides use YAML files
- #1842718: Use new Transliteration functionality in core for machine names
- #1858376: Provide tests and documentation for long Unicode characters

User interface changes

None. This patch does not provide a user interface.

API changes

No API changes -- this is an API addition that will facilitate getting PathAuto into core, but it doesn't change existing APIs.

Files: 
CommentFileSizeAuthor
#136 drupal8.translit-maintainers.136.patch578 bytessun
PASSED: [[SimpleTest]]: [MySQL] 48,184 pass(es).
[ View ]
#127 transliteration-567832-127-full.patch459.08 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 47,877 pass(es).
[ View ]
#127 transliteration-567832-127-interdiff.txt9.59 KBjhodgdon
#127 transliteration-567832-127-code-only-do-not-test.patch19.71 KBjhodgdon
#118 567832.patch458.92 KBDamien Tournoud
FAILED: [[SimpleTest]]: [MySQL] 47,804 pass(es), 1 fail(s), and 0 exception(s).
[ View ]
#118 567832-interdiff-do-not-test.patch10.91 KBDamien Tournoud
#114 567832.patch458.46 KBDamien Tournoud
FAILED: [[SimpleTest]]: [MySQL] 47,674 pass(es), 2 fail(s), and 0 exception(s).
[ View ]
#112 567832.patch459.54 KBDamien Tournoud
FAILED: [[SimpleTest]]: [MySQL] Repository checkout: failed to checkout from [git://git.drupal.org/project/drupal.git].
[ View ]
#105 transliterate-567832-105-code-only-do-not-test.patch21.86 KBjhodgdon
#105 transliterate-567832-105-full.patch461.23 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 46,864 pass(es).
[ View ]
#96 transliterate-567832-96-code-only-do-not-test.patch23.21 KBjhodgdon
#96 transliterate-567832-96-full.patch462.58 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 46,415 pass(es).
[ View ]
#83 translit-only-code-do-not-test.patch23.61 KBjhodgdon
#75 transliteration-full-74.patch462.98 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 46,234 pass(es).
[ View ]
#75 interdiff-translit-54-74.patch9.33 KBjhodgdon
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch interdiff-translit-54-74.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#54 transliteration-full-54.patch459.09 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 46,205 pass(es).
[ View ]
#50 transliteration-full-50.patch459.03 KBjhodgdon
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch transliteration-full-50.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#50 transliteration-interdiff-50-to-41.txt10.27 KBjhodgdon
#44 transliteration-full-41.patch456.87 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 42,808 pass(es).
[ View ]
#44 transliteration-interdiff-41.txt139.88 KBjhodgdon
#44 transliteration-interdiff-41-to-40.txt64.66 KBjhodgdon
#42 transliteration-full-40.patch450.66 KBjhodgdon
PASSED: [[SimpleTest]]: [MySQL] 42,814 pass(es).
[ View ]
#42 transliteration-interdiff-40.txt96.16 KBjhodgdon
#39 transliteration_no_data.patch14.34 KBamateescu
PASSED: [[SimpleTest]]: [MySQL] 41,108 pass(es).
[ View ]
#39 transliteration_full.patch437.32 KBamateescu
PASSED: [[SimpleTest]]: [MySQL] 41,112 pass(es).
[ View ]

Comments

We have to:

  1. Add transliteration code to core
  2. Write implementations with file and field modules (if I missed something comment please)
  3. Create settings page for default language code selection, if you have an idea where to append - existing settings page - please comment - do we have to overwrite global $language detection?
  4. Make the transliteration data pluggable (with this localizations could add language specific settings)
  5. Write tests???

Related older discussion (that I'm aware of):
#110972: Implement drupal-standard transliteration (resulted in won't fix, unfortunately)
#153574: Add transliteration to core
#43505: Files/URLs are not accessible if they contain reserved or non-ascii characters (and various duplicates linked from there)

Contrib module: Transliteration (edit: Has a bit of good summary on the project page, too.)

Overall, I'm highly +1 to transliteration in core, which is a must for every non-English site.

The main difference is that Transliteration is available now, which implements standardized transliteration.

For machine readable names it's ux issue

Category:feature» bug
Priority:Normal» Critical

Actually this is worse than it looks because completely breaks this functionality for languages like russian. Bumping up to critical with blessing from Dries :)

The transcoded version of эжду3 becomes _3.

Summarizing transliteration comments from other threads: overall transliteration its hard to do, because there are many different contexts for transliteration. Filenames, urls, inline html classes and ids, php functions can contain completely different elements. Most of the earlier issues were focused on filenames and their encoding, this has been progressing in #284899: Drupal url problem with clean urls.

There is a contrib solution: transliteration and a PHP extension.

There is also a good discussion on the subject in #43505: Files/URLs are not accessible if they contain reserved or non-ascii characters but it was marked as duplicate because the focus shifted to file name transliteration instead of a transliteration API in core.

A few more research on this:

  • Transliteration is using tables from the Unidecode perl extension [1]
  • It is not clear where the tables from the translit module come from. The module itself uses a slightly different strategy (it implements transformations like 'han_transliterate', 'greek_transliterate', 'lowercase_latin', and so on, and it's up to the caller to use the correct transformations depending on the context and the language.

[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

Quick grep on 7.x-dev core tarball shows about a dozen of preg_replace() occurences involving the 'A-Za-z' pattern, mostly replacing everything else with underscores or empty strings for various contexts (let alone another handful of preg_match() calls). All these are potential candidates for transliteration, and can't rely on contrib.

BTW, my site have a lot of broken transliterations already fixed in it's machine-readable names of various things (old CCK fields, propagating also to database tables, css classes, and Views fields, old URL aliases etc.), and had trouble with inaccessible filenames - contrary to what was said about %-encoded utf-8 being legal, some browsers seem to truncate URLs on first occurence of an unusual character, resulting in huge spikes of 404-not found traffic (I observed repeatedly). Although the history, implications, ways to fix, and current (often resolved somehow) state of these problems are very different, the mere fact of their amount, repeated occurences, and impossibility of a clean later fix on existing data, all indicate that transliteration as core feature and policy would be helpful to prevent further problems of this sort being introduced all over Drupal codebase, as well as code duplication.

Issue tags:+i18n, +l10n, +D7UX

Issue tags:-D7UX

Can you please not apply tags incorrectly? No reason this to be in my queue.

I'm sorry, I thought, that D7UX tag is for confusing screen elements (like machine readable names as _ds____c___ - another question is, why do we need to display machine readable names).

Version:7.x-dev» 8.x-dev
Component:other» base system
Category:bug» feature
Priority:Critical» Normal
Issue tags:-i18n, -l10n, -D7CX

Category:feature» bug

Excuse me, but why is this a feature request, when it breaks the UI for core drupal for non-latin languages? i understand that transliteration in core is a big thing in itself, and i also dont feel it can be done for d7, but then the auto-generated machine friendly names via JS should be removed.

Category:bug» feature

#576180: Make the machine-name generation pluggeable opened to revert this part of the UI, thus changing this issue to a feature request as sun proposed.

Hi!
I posted the same issue - #646466: Machine name generating not correctly for languages other than English
maybe we can make a check in JS file - if transliteration module is present - make AJAX call to it?

Category:feature» bug
Priority:Normal» Major

For those languages which have completely non-ascii characters like Russian, Arabic, Persian, Chinese, etc... The machine readable name is hidden from the user !
For example a content type named 'مقاله' will have an empty machine readable name with no option to define one!
I suggest that it's better to show the machine readable textfield in these situation where user has entered something, but this field is empty.

Category:bug» feature
Priority:Major» Normal

I think the bug you're talking about is #646466: Machine name generating not correctly for languages other than English. This is a feature request.

I don't really see a decent reason for keeping #646466: Machine name generating not correctly for languages other than English around as a bug report against D7 when this feature request here is its solution (or at the very least *a* solution). Well, unless of course this feature request here is not accepted/fixed or if the final solution is not backportable to D7.

Now since #576180: Make the machine-name generation pluggeable is in and while we wait for a solution in core, I believe that this could be addressed in contrib. Did anyone file an issue in the Transliteration project's queue?

Anybody wants to take on moving transliteration to core in D8?

Perhaps the best thing would be to talk to the respective project's maintainers?

This issue could be addressed by this implementation, although it is javascript-based:
#1517024: Transliterate accented vowels in machine name strings via JS instead of substituting them with underscores

Transliteration might be too much of an overhead for core if the gripe here is machine names. +1 for the patch proposed in #25.

I heard about this thread today chatting on IRC.

As @alexweber said on #26, transliteration seems to be an overhead just for machine names and uploaded file names.

As @Damien Tournoud said on #7, the module Transliteration has many transformations like 'han_transliterate', 'greek_transliterate', 'lowercase_latin'.

Why not separating these transformations per language (or language groups), and downloading them automatically if/when that language is added to the interface? It would help to have less lines of "useless" code.

The issue in #25 is interesting, but does not address non-Latin character sets.

Drupal 8 onward, we should rely on the intl PHP extension. Being part of PHP core since 5.3, this extension is going to be virtually available on all PHP 5.3 installations moving forward.

The library relies on ICU which has a very rich set of transformation capabilities.

The issue summary needs an update by someone that knows where we're headed here :/

The problem is that the Transliterator class from intl is only available in PHP 5.4, so we still need a fallback for 5.3. I started working on this but there is very little progress so nothing to show at the moment. I will assign this issue to me only when I'll have more time to work on it..

I wanted to agree with @Damien Tournoud, but @amateescu is right:

http://php.net/manual/transliterator.transliterate.php

Furthermore, there's a user comment stating that intl isn't enabled on some distros:
http://php.net/manual/en/intl.installation.php

Also, there seems to be a base requirement for the external ICU library itself, and it's not clear to me whether that is also bundled with PHP (doesn't sound so) or which OS/platforms have it installed:
http://php.net/manual/intl.requirements.php

Of course, what we could do is to introduce support for it in core, but which only works on PHP 5.4 (or lower with the PECL intl extension installed).


Lastly, regarding #7, Transliteration module's sources, the project page states:

UTF-8 normalization is based on UtfNormal.php from MediaWiki and transliteration uses data from Sean M. Burke's Text::Unidecode module.

IIRC, at the time when the project was started, there was no generic external user-space library for PHP. MediaWiki's implementation was incomplete, so additional data form Perl was incorporated.

Also relatively important: AFAIK, there have been attempts to contribute improvements back to upstream sources, but that didn't work so well. The transliteration data in the module diverged from the original sources in the meantime.

In general, I hope that @smk-ka can shed some more light on Transliteration module's history, data, and architecture, since he is the main author of the module and has incredible experience in that field.

Also, there seems to be a base requirement for the external ICU library itself, and it's not clear to me whether that is also bundled with PHP (doesn't sound so) or which OS/platforms have it installed:

I know for sure that the extension is bundled with the official build of PHP 5.4 for Windows.

As far as the data is concerned, I was planning to update Transliteration's tables with fresh data by running PHP's Transliterator on each UTF-8 char :)

As far as the data is concerned, I was planning to update Transliteration's tables with fresh data by running PHP's Transliterator on each UTF-8 char :)

That would actually be a nice approach. That would allow us to use the faster native intl when available and fallback to a pre-generated map when not. I assume the pre-generated map will only work for NFC (so it will be slightly less robust / secure then intl), but it doesn't really matter as we currently assume that all input is in a proper NFC form.

...any news here? As I said back in #30 it would be nice if the issue summary got updated so that everyone can tell what needs to be done here. Perhaps even help.

Any ideas about implementation? mostly in context of D8 plugins?
Also we really need a mention of @smk-ka

EDIT: summary changed a bit

Issue summary:View changes

Updated issue summary.

Issue summary:View changes

Updated issue summary.

Issue tags:+API addition

I've revamped the issue summary with the actual, current proposal.

Thanx ;)

Status:Active» Needs review
StatusFileSize
new437.32 KB
PASSED: [[SimpleTest]]: [MySQL] 41,112 pass(es).
[ View ]
new14.34 KB
PASSED: [[SimpleTest]]: [MySQL] 41,108 pass(es).
[ View ]

I had some time to work on this for a couple of days during Drupalcon Munich and was really enthusiastic about the result.. until I hit the last part of this patch: updating the transliteration tables with ICU data.

It was only then that I realized that the Transliterator class from PHP 5.4 (or the 5.3 pecl extension) is not a silver bullet for our US-ASCII transliteration needs :( Short story: the Latin-ASCII script doesn't work as expected, it leaves some (read: a lot of) non-ASCII characters untransliterated.

I'm out of energy for this patch so I'm posting what I have so far, hoping that someone who has more experience with ICU's implementation can either bring it home or at least shed some light on it.

The first patch contains only the review-able code and the second one is the full patch that also has the current data tables from the Transliteration contrib module.

Marking as NR to get some attention on the patch, but it's really NW.. I know there are some doxygen blocks missing, and I also didn't get to the part where I wanted to switch the data tables to JSON for better memory management and also for making a native Javascript implementation possible.

Our approach here should be to get something that works well enough into core. I am extremely hesitant to rely on intl, it's PHP documentation is nonexistent and hosting support is very spotty.
Here's a quote about transliteration itself

In general, transliteration is a hard problem, and the GNU iconv implementation is only sufficient for trivial cases. How a non-ASCII character gets transliterated is not a property of the character but of the language of the text and how it's being used. For instance, should "日" become "ri" or "ni" or something else entirely? Or if you want to stick with Latin-based languages, should "ö" become "o" or "oe"? Expanding to other non-Latin scripts, transliterating most Indic languages is fairly straightforward, but transliterating Thai requires some reordering of characters and transliterating Tibetan requires parsing whole syllables and identifying which letters are in root/prefix/suffix/etc. roles

.

So let's get something practical in which these data tables definitely look like into core and then we can get pathauto in which everyone would love.

Issue tags:+Needs tests, +D8MI

Awesome! icu + fallback covers mostly all
@todo in code could be resolved in follow-up

Also this require tests and review form current maintainers of translitiration

+++ b/core/lib/Drupal/Core/CoreBundle.phpundefined
@@ -51,6 +51,7 @@ class CoreBundle extends Bundle
+    $container->register('transliteration', 'Drupal\Component\Transliteration\TransliterationFactory');

Suppose transliteration makes no sense when site in English (no language module enabled)

StatusFileSize
new96.16 KB
new450.66 KB
PASSED: [[SimpleTest]]: [MySQL] 42,814 pass(es).
[ View ]

So, chx and I worked on this together in IRC today, and here's a new patch. It definitely still needs tests, and I may find some time to work on them sometime soon (in which case I'll assign the issue to myself temporarily)... this patch is totally untested and may even have PHP errors (I guess the bot will take care of at least verifying that maybe?)

What we discussed and I did:
- Added a bunch of documentation.
- Use the quick return value if string is all ASCII (the functions were ignoring it)
- Split some methods up into smaller methods
- Made some in-method static variables into class variables - they may or may not want to be class statics, depending on if memory usage is more/less important than page load time?
- Got rid of the "ICU" class, because we don't think the PECL class is reliable and it's definitely not documented well

Here's a patch and an interdiff... chx: your move!

Assigned:Unassigned» jhodgdon

OK, there are almost certainly some code problems in this patch. :) Anyway, I'm taking on the next step, which is making some tests, which should get it at least to the "this runs" stage.

Assigned:jhodgdon» Unassigned
StatusFileSize
new64.66 KB
new139.88 KB
new456.87 KB
PASSED: [[SimpleTest]]: [MySQL] 42,808 pass(es).
[ View ]

Here's a new patch. The code is essentially the same, except I fixed a few static variable problems. And I added a couple of tests for the basic functionality. And fixed some stray characters I put by mistake into the data files... It seems to work...

Now it's probably time for a real review and some more comprehensive tests and some attention to the data tables? maybe chx now?

(I think maybe the interdiff is going the right way now. One goes back to my last patch, and the other goes back to the patch in #39)

Some thoughts about the data arrays (language-specific overrides and generic tables for transliteration):
- There are language-specific overrides for Spanish and several other languages in language_overrides.php, but I am not sure they are really actually all different from the generic character overrides? We should see if they are and take the ones out that are not; we should also verify the ones that are in there.
- We might want to consider re-architecting this slightly. Currently, the language overrides for every language are read in from one file, but I can imagine we might have a lot for languages like Chinese and Japanese if we were to do this right? I think they use the same characters but would have different transliterations of them? That is likely true of Cantonese vs. Mandarin too... Probably doing them is really kind of impossible? Anyway, it could in principle be kind of inefficient to read in all language overrides in this constructor when the constructor is language-specific? Probably we should only read in the ones that we need, and they should go into a non-static variable too.
- In the tests, I only included a couple of examples for language overrides (which maybe will be in the set of ones taken out?) and of generic characters (only for European languages). There should probably be more test lines added (it's easy to add them, should be pretty obvious) -- in particular, some that go above the 0x00 set of Unicode characters, and some two-byte characters probably?

Thoughts?

I've done a transliteration library that we use in Midgard: https://github.com/bergie/midgardmvc_helper_urlize (available via Composer). Would be great to collaborate!

There is also a very nice library for Node.js that has better transliteration tables than what we have. They're available as YAML files, and so possibly also usable in PHP.

Note: Midgard has been using this for transliteration since 2007.

RE #46 - thanks henribergius! The classes here for transliteration will work well for Drupal's framework, I think... Our main need will be for the data (tables of "character x in language y should be transliterated to z"). Do you have some good data that has been vetted for various languages with linguists or native speakers or something that you can point me to in that repo?

Assigned:Unassigned» jhodgdon

RE #46 - chx and I took a look at the code on https://github.com/bergie/midgardmvc_helper_urlize -- it looks very familiar (obviously came from the same root source/algorithm some time back). Our code mainly differs in that it has the ability to do language overrides, so henribergius you might want to look at that (it's in class PHPTransliterate in the somewhat huge patch (huge because it includes the data files)).

Anyway, chx and I think we'll want to keep our code, which is also integrated into our container/factory stuff.

I'm working on this again today. Plan:
a) The language overrides are currently all having to be in one file. This should be split up by language instead of loading all languages when they aren't all needed (each instance is only one language).

b) We are never going to have full data for all world languages and dialects in core. So, we need to add a hook that would allow contrib to supply language overrides.

c) We now know of 3 sets of transliteration data:
1. The set in this patch.
2. The set at https://github.com/bergie/midgardmvc_helper_urlize
3. The set from node.js at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data
4. And there is probably some open-source data in the Pecl library that we aren't after all using?
(OK, that's probably 4 sets of data.) Anyway, it would be good to understand what the differences are between these data sets, which shoud be fairly easy to analyze.

Oh, and by the way the country/language code for Danish/Denmark is dk, not da as was in the patches above. Hmph.

StatusFileSize
new10.27 KB
new459.03 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch transliteration-full-50.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Also, the Spanish language overrides, plus some of the German and Danish ones, did not actually override anything (they are standard accented characters done the standard way), so I got rid of those...

Anyway, here's a new patch with:
- hook_transliteration_overrides_alter() [with docs -- I added it to language.api.php for now at least]
- language overrides in separate files, without the extra characters that are not actually overrides, and with Danish having the right language code on the file name.
- A better test that tests language overrides specifically with three languages with different override files (including one without any)

Still to do: look at the differences in the data files (#48c).

Status:Needs review» Needs work

The last submitted patch, transliteration-full-50.patch, failed testing.

+++ b/core/lib/Drupal/Component/Transliteration/data/de.php
@@ -0,0 +1,15 @@
+$overrides['de'] = array(
+  0xC4 => 'Ae',
+  0xD6 => 'Oe',
+  0xDC => 'Ue',
+  0xE4 => 'ae',
+  0xF6 => 'oe',
+  0xFC => 'ue',
+);

I have not yet looked into how this data is generated, the different sources, etc., but one thing that I notice is that there is a missing replacement for the German language: The German "ß" letter should be transformed into "ss", i.e.

<?php
$overrides
['de'] = array(
  ...
 
0xDF => 'ss',
);
?>

Don't know if that helps, but thought I'd point it out.

If you check the x00.php table you can already see it's an ss because this is not language dependent. language overrides are necessary when it's language dependent say ö can be oe or o. Yes, it's german-language specific but if it's specific to one language it never needs an override, actually.

StatusFileSize
new459.09 KB
PASSED: [[SimpleTest]]: [MySQL] 46,205 pass(es).
[ View ]

Looks like that patch needs a reroll, surprise surprise (cough. views. cough) :)

Here's the reroll, straight reroll, no other changes.

Status:Needs work» Needs review

duh.

Status:Needs review» Needs work

Re #53: Thanks very much for the explanation, and sorry for the noise.

Status:Needs work» Needs review

Sorry, that was a x-post.

Some notes that I think we should add to the documentation of the hook and/or class in the next patch:

Transliteration is the process of translating individual non-US-ASCII characters into ASCII characters, which will always be inexact and language-dependent. For instance, the character Ö (O with an umlaut) is commonly transliterated as O, but in German text, the convention would be to transliterate it as Oe or OE, depending on the context (beginning of a capitalized word, or in an all-capital letter context).

The Drupal transliteration process uses a database of generic character transliterations and language-specific overrides. Drupal Core provides the generic transliteration character tables and overrides for a few common languages; modules can implement hook_transliteration_overrides_alter() to provide further language-specific overrides. Characters without default or overridden transliterations are transliterated to the defined "unknown character" in output, which is '?' by default. However, character context (such as all-capitals vs. initial capital letter only) is not taken into account, and in transliterations of capital letters that result in two or more letters, by convention only the first is capitalized in the Drupal transliteration result. So, the process has limitations; however, since the reason for transliteration is typically to create machine names or file names, this should not really be a problem.

Emphasizing the 'dumbness' of the converter IMO is important: The Drupal transliteration process works on one character at a time. The ASCII transliteration can come from generic character transliterations or language-specific overrides but even those are applied to one input Unicode character a time.

@jhodgdon on #58 you are concerned about capital letters. As machine-names are now always lowercase, why not just transliterate everything to lowercase?

Also this patch does not seem to fix the file misc/machine-name.js.
It seems that the best thing to do is to re-open #1517024: Transliterate accented vowels in machine name strings via JS instead of substituting them with underscores and include a bigger table, such as this one, but using only its lowercase part.

Let's take the JS issue separately/later, we should keep it in mind but it's really not the hardest part here :)

New proposed documentation (I think this should go in a @defgroup probably) to make some more things clearer:

Transliteration is the process of translating individual non-US-ASCII characters into ASCII characters, which specifically does not transform non-printable and punctuation characters in any way. This process will always be both inexact and language-dependent. For instance, the character Ö (O with an umlaut) is commonly transliterated as O, but in German text, the convention would be to transliterate it as Oe or OE, depending on the context (beginning of a capitalized word, or in an all-capital letter context).

The Drupal transliteration process transliterates text character by character using a database of generic character transliterations and language-specific overrides. Drupal Core provides the generic transliteration character tables and overrides for a few common languages; modules can implement hook_transliteration_overrides_alter() to provide further language-specific overrides. Characters without default or overridden transliterations are transliterated to the defined "unknown character" in output, which is '?' by default. However, character context (such as all-capitals vs. initial capital letter only) is not taken into account, and in transliterations of capital letters that result in two or more letters, by convention only the first is capitalized in the Drupal transliteration result. So, the process has limitations; however, since the reason for transliteration is typically to create machine names or file names, this should not really be a problem.

After transliteration, other transformation or validation may be necessary, such as converting spaces to another character, removing non-printable characters, lower-casing, etc.

I spent a while looking around for transliteration data today. What seems to be out there:

a) php.net/pecl extension is basically a wrapper on the ICU project's C++-based Transform code. Some links and info:
- overview http://userguide.icu-project.org/transforms/general
- info on their data http://userguide.icu-project.org/icudata
- data grabber tool http://userguide.icu-project.org/icudata -- but I downloaded the transliteration data and it is not in a format that is human-readable or PHP-readable in any obvious way (it's a binary file of some sort that is set up specifically for their classes).

So I don't think we can use their data in any obvious way.

b) our data -- no idea where that came from but maybe amateescu can illuminate us on that.

c) The clean URL generator for Midgard from https://github.com/bergie/midgardmvc_helper_urlize

d) The node.js stuff at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data

So it looks like it's down to comparing b, c, and d, unless someone has an idea on a.

Oh, I finally located the source code for the Pecl/ICU things: http://source.icu-project.org/repos/icu/icu/trunk/source/data/translit/
but they are in rule format so I don't think we can really use them still.

#54: transliteration-full-54.patch queued for re-testing.

Russian translit table should be checked as suggested (russian)

I did not closely follow this issue recently, so I'm not sure what exactly "our" data refers to, but the Transliteration module's original data sources are documented on its project page:

UTF-8 normalization is based on UtfNormal.php from MediaWiki and transliteration uses data from Sean M. Burke's Text::Unidecode module.

EDIT: UtfNormal source code
EDIT: Unidecode data source files

Those two sources seem to be missing in the list.

@amateescu told me in Munich that he wanted to try to auto-generate the mappings by passing every possible Unicode character through an ICU-enabled PHP 5.4 installation, but I'm not sure whether that happened, or whether the data is still the original from Transliteration module.

RE #67 - that idea of passing every unicode character through ICU is exactly what occurred to me during the long drive home yesterday. :) So, I plan to do that.

RE #66 - if someone could take the time to make a string with the Russian alphabet (both lower and upper case, and it can be all smooshed together like 'AaBbCc' or separated or whatever) and a second string with the transliterated Russian alphabet in it (with the same spacing), I'd be happy to add it to the tests. Note that if you have a two-letter transliteration, like ц -> ts, when it is capitalized it should be Ts not TS. I could make these strings myself but I don't have a Russian keyboard and besides that I have no idea what the capital letters are. Well, a small idea (I did learn some Russian a few years ago), but not much. :)

That said, if this Russian test doesn't get created, I am not too worried. We are simply not going to have tests for all of the thousands of characters in hundreds of languages.

Oh yeah, the other thing that needs to be done is a test for the overrides hook. So the current tasks are:
a) Check about 5 data sources (see comments above) and see what the differences are.
b) Probably we want to trust the ICU source, so if possible, make ours match theirs.
c) Maybe add tests for other languages, if people want to supply test strings containing alphabets.
d) Add a test for the overrides hook, using a test module.
e) Add the documentation from #62.

Two sentences in Turkish alphabet, containing non-ASCII characters.:

Abayı serdiler bize. Söyleyeceğim yüzlerine.

Sanırım hepimiz aynı şeyi düşünüyoruz.

Regards,

RE Turkish sentences in #70 - can you also please provide the correct transliterations of these two sentences? I could guess, but then it wouldn't be a very good test. :) Thanks!

Teşekkürler! Added this to https://github.com/bergie/midgardmvc_helper_urlize/commit/2e1de3971460a1... and the tests pass :-)

I think that the transliteration data can be split off into a separate issue -- we should be able to commit this patch with "good enough" data (the current data), and then patch later with better data. So I filed:
#1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance

Meanwhile, I will update the issue summary with a link to this new issue, and I am today taking on:
- adding to the docs
- a test of the alter hook
which I think will get this issue to "ready for commit" stage. Coming shortly...

Issue summary:View changes

Revamped issue summary.

I'm totally down with the proposal in #73. The transliteration data has to be maintained anyway.

Erm, now that I say this...

The actual "blocker" for committing this is "one or more maintainers" who are willing to take on that job of transliteration data maintenance and have at least some minimal experience with the art of transliteration.

Changes to the transliteration data always need to be reviewed and verified by "locals", so the actual job of maintaining the data boils down to 1) ensuring test coverage as needed and appropriate, 2) committing.

Commits are handled by core maintainers already (and I don't see why @jhodgdon couldn't also commit transliteration data changes).

The job of transliteration data maintainers would be to check + review + discuss related open issues for "basic transliteration data changes" compliance. That does not involve changes to the Transliteration PHP component at all, only the data.

StatusFileSize
new9.33 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch interdiff-translit-54-74.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
new462.98 KB
PASSED: [[SimpleTest]]: [MySQL] 46,234 pass(es).
[ View ]

OK, I've taken care of a test for the alter hook. There's a new module system/tests/modules/transliterate_test, and a new line added to the test that verifies it.

I also added a defgroup with the documentation in #62. This is in common.inc, and the classes/hook have @ingroup added so they are part of this topic/group.

I also had to change the test to be WebTestBase instead of UnitTestBase in order to test the alter hook (module_invoke type things do not work in UnitTestBase).

Other than that, this patch is the same as the last few. And I think it may actually be ready to go in now!

whoops, that interdiff file should have been named .txt... sorry testbot!

Status:Needs review» Needs work

The last submitted patch, interdiff-translit-54-74.patch, failed testing.

Status:Needs work» Needs review

RE #74 - I think we should adopt the ICU group's base data (see #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance) and call it What The Outside World Uses and Good Enough.

At that point, if we want to get language-specific overrides in core, that's a matter of adding a small xx.php file for the language code xx (if that language team agrees). Or they can also have a contrib module that provides the overrides, especially if it's a matter of "in this country, we do things a little differently".

Also, as sun says, that does not affect the functionality at all... Of course if the ICU updates their data, we should be able to patch core pretty easily (we could even provide a script to see if there are updates, which should be an output of that other issue anyway).

Ah, ok, now I understand the actual intentions of #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance -- updated that issue's summary accordingly.

However, just to clarify:

If the data in this patch is really the same as in Transliteration module, then it is actually derived (forked) data. That starts with the two original data sources. But the transliteration data has been adjusted and improved by users and contributors of Transliteration module over the years. (@see fixed issues)

That inherently means that this data does not equal any external/upstream resource. Which means we'd have to maintain it ourselves. It also means that this current data might contain improvements that may be missing in a new data source, which could be interpreted as a "regression."

We obviously, ideally, don't want to maintain this data if we don't have to. (Although it already crossed my mind that it would be crazy cool if http://localize.drupal.org would have a Transliteration UI ;)) [And if all fails, since this is a component now, we could as well just move the component onto github and make it a true go-to source for all PHP applications.] Anyway, let's defer that investigation to #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance.

Dear jhodgdon,

Correct literations of the Turkish sentences in #70 would be:

Abayı serdiler bize. Söyleyeceğim yüzlerine. ---> Abayi serdiler bize. Soyleyecegim yuzlerine.

and

Sanırım hepimiz aynı şeyi düşünüyoruz. ---> Sanirim hepimiz aynı seyi dusunuyoruz.

Thanks Kartagis! We could add this Turkish to the tests now... I would love it if someone could review the patch though?

Is it possible to provide a do-not-test.patch for review that does not contain the transliteration data?

StatusFileSize
new23.61 KB

Sure! Here is the patch without the data.

Issue tags:-Needs tests

Thanks!

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php
@@ -0,0 +1,241 @@
+class PHPTransliteration extends Transliteration implements TransliterationInterface {
...
+  public function transliterate($string) {

This code still seems to be largely based on the original UtfNormal.php (some lines are actually identical to ::quickIsNFCVerify()), so we should at least mention this is in the phpDoc of the class (UtfNormal is GPL2+), and actually, I'm not a licensing expert, but don't we have to take over the author/copyright statement, too?

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php
@@ -0,0 +1,241 @@
+    // Initialize the tail bytes table.
+    if (!isset(self::$tailBytes)) {
+      self::setupTailBytesTable();
+    }

Shouldn't this be moved into __construct()?

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php
@@ -0,0 +1,241 @@
+    // parts can be handled much more quickly. Don't chop up Unicode areas for
+    // punctuation, though, as that wastes energy.
...
+      // Counting down is faster. I'm *so* sorry.
...
+        if ($remaining = self::$tailBytes[$c]) {
+          // UTF-8 head byte!
+          $sequence = $head = $c;
...
+            // Look for the defined number of tail bytes...
...
+              // Legal tail bytes are nice.

I realize that the original code comments (mainly originating from UtfNormal) are not really proper, but some of inline comments here have been added by us. Quite a couple of them articulate emotional quirks/frustration (talking with your code, eh? ;)), but do not actually explain what is being done and why.

This is extremely complex code logic that only a handful of people are able to dissect and understand. We really need proper and appropriate code comments to allow new developers who are reading this code in the future to understand what is going on, and why the logic was implemented in *this* way and not another.

However, to clarify: If the comment is literally contained in the original UtfNormal, then we should keep it as is. (!)

Comparing the above outlined comments once again with UtfNormal, only the first two are ours.

That said.

Given all this, I truly wonder why we don't simply incorporate MediaWiki's original UtfNormal.php as is?

It seems to be actively maintained.

For now, we could just simply dump the class somewhere and trick it into our autoloader and/or make the Drupal/Component/Transliteration/PHPTransliteration class load it from within its namespace directory. Ideally, of course, we'd ask MediaWiki to convert it into a PSR-0 component (they require PHP 5.3 already).

That would mean we wouldn't maintain this UTF-8 normalization/transliteration code at all and only pull in updates from upstream. (Frankly, I don't know why @smk-ka decided to fork the code instead of keeping the class as-is when we originally invented the file_translit/transliteration modules — I can only assume it might not have been actively maintained back then.)

Wouldn't that make a tonne more sense?

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php
@@ -0,0 +1,241 @@
+          $n = ord($head);
+          if ($n <= 0xdf) {
+            $ord = ($n - 192) * 64 + (ord($sequence[1]) - 128);
+          }
+          elseif ($n <= 0xef) {
+            $ord = ($n - 224) * 4096 + (ord($sequence[1]) - 128) * 64 + (ord($sequence[2]) - 128);
+          }
+          elseif ($n <= 0xf7) {
+            $ord = ($n - 240) * 262144 + (ord($sequence[1]) - 128) * 4096 + (ord($sequence[2]) - 128) * 64 + (ord($sequence[3]) - 128);
+          }
+          elseif ($n <= 0xfb) {
+            $ord = ($n - 248) * 16777216 + (ord($sequence[1]) - 128) * 262144 + (ord($sequence[2]) - 128) * 4096 + (ord($sequence[3]) - 128) * 64 + (ord($sequence[4]) - 128);
+          }
+          elseif ($n <= 0xfd) {
+            $ord = ($n - 252) * 1073741824 + (ord($sequence[1]) - 128) * 16777216 + (ord($sequence[2]) - 128) * 262144 + (ord($sequence[3]) - 128) * 4096 + (ord($sequence[4]) - 128) * 64 + (ord($sequence[5]) - 128);
+          }
+          $result .= $this->replace($ord);
+          $head = '';

We need some comments here.

I'm not able to locate this snippet in UtfNormal. There are a couple of snippets that seemingly look similar, but are using different ranges/conditions.

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php
@@ -0,0 +1,241 @@
+  static protected function setupTailBytesTable() {
...
+    for ($n = 0; $n < 256; $n++) {
+      if ($n < 0xc0) {
+        $remaining = 0;
+      }
+      elseif ($n < 0xe0) {
+        $remaining = 1;
+      }
+      elseif ($n < 0xf0) {
+        $remaining = 2;
+      }

This chunk also looks identical to the original code snippet in UtfNormal::quickIsNFCVerify().

Any chance to add at least a few inline comments here?

+++ b/core/lib/Drupal/Component/Transliteration/Transliteration.php
@@ -0,0 +1,106 @@
+    $this->dataDirectory = dirname(__FILE__) . '/data';

We can use __DIR__ in Drupal 8.

+++ b/core/lib/Drupal/Component/Transliteration/TransliterationFactory.php
@@ -0,0 +1,31 @@
+class TransliterationFactory {
+++ b/core/lib/Drupal/Core/CoreBundle.php
@@ -126,6 +126,9 @@ public function build(ContainerBuilder $container) {
+    // Register a factory for transliteration.
+    $container->register('transliteration', 'Drupal\Component\Transliteration\TransliterationFactory');

I really don't understand why we need a factory for this. We have an interface. That should be more than sufficient.

If someone wants to swap out the implementation he/she can swap out the entire service, done.

I also don't see why we would need a new class instance for a different language and/or different $unknown_character parameter. The latter actually seems to be a bug, since there's a good chance for calling code to ask for a different $unknown_character, depending on what it wants to do, but still for a language that was requested before already. No need to reinstantiate and reload the entire thing for that.

So I really think we should just get rid of the factory.

Assigned:jhodgdon» Unassigned
Status:Needs review» Needs work

sun I think you forgot to change this to "needs work" based on your review.

At this point, I am bowing out of anything further on this patch. I was brought in to document and add tests to an existing patch (the one on #39), which was my starting point. I didn't create the code, and I made minimal changes to the code in the process beyond what was necessary to get it to pass the tests and a few other minor refactorings. I have no knowledge of where the code came from before I stepped in, and I have no opinion on whether we should adopt a 3rd-party component, whether we need a factory, etc. So I cannot address the review comments from #84.

I will just note that the reason for the factory was that in the original patch, it had a choice of using the PHPTransliterator class or one that directly used the ICU PECL class; we took that option out (see discussion above). If we want the ability to use that in the future, we'll need a factory; if we don't care, we can get rid of the factory.

Anyway. Good luck with this effort. I'll go ahead and un-follow at this point and leave it to you to hash out what should be done. Beyond my sphere...

Assigned:Unassigned» sun

Assigned:sun» Unassigned

I don't know why this is assigned to me. I provided an in-depth review and feedback already.

Assigned:Unassigned» sun

See #85 why.

Assigned:sun» Unassigned
Status:Needs work» Needs review

I do not know the answers to the questions, because I asked them. Asking me to answer them is a logical error.

I'd love to move forward here, but I do think that #84 raised a range of critical issues with the current implementation and possibly direction. Although I silently followed this issue, I do not know who effectively authored the current code, and in turn, who'd be able to address the issues and questions that have been raised.

Oh well, no pathauto for Drupal 8 then. One release can't have everything. Pity.

Was there a reason for adding that factory in the first place?

After reading the review from #84, I think we have two three actionable points here:
1) remove the factory, it is indeed not needed anymore
2) investigate if we can bring in UtfNormal.php
3) (minor) yep, use __DIR__, my bad :/

If the answer for 2) is no for whatever technical/legal reason, we should document that our code is based on it and clean up *our* comments. About putting more comments on *their* code, well.. I certainly don't feel up for the task, so feel free to suggest them.

I think sun raised valid questions in #84, but I don't think they are "critical". I'd categorize these as possible code clean-ups. Would be nice if we could do these before a commit, but based on what I read, I don't think they are showstoppers.

RE #92 - I took a quick look at the UtfNormal.php file. I don't think we want to incorporate it directly -- it's basically doing the factory thing (either using ICU classes or its own code) as well as the transliterate thing all bundled up in one, and as such it's overly convoluted IMO.

But if the code in the patches submitted above came from UtfNormal.php (and the Transliteration project page indicates that is the case), then we should acknowledge this in our file header and/or copyright.txt. It should be fine -- it's GPL -- but it wasn't previously acknowledged in this patch.

Regarding the factory, it still seems like a good idea to me to have the factory class, if only because I believe it makes it easier for a contrib module to swap out for an ICU-based class (or something else perhaps). At least, I think it would, and I think this is desirable. Thoughts?

That leaves swapping out _FILE_ for _DIR_, which is easy enough to do.

Meaning the To Dos would be:

a) Swap out _FILE_ for _DIR_ (quick/easy/not all that necessary anyway, but OK).
b) Add copyright notice for the code provided by UtfNormal.php.
c) Maybe add some in-code comments to further clarify what is being done (I did read/understand the code so can probably do that).

Thoughts -- is this a reasonable course of action? Anything else that needs to be done?

Assigned:Unassigned» jhodgdon
Status:Needs review» Needs work

chx and I chatted in core about the question of the factory.

We came to the conclusion that the main advantage of using a factory class is that the transliteration class can be transient, and once the transliterate call is done its memory can be freed. But the typical use of transliteration is to create machine names (typically once per page load and typically it's an AJAX call so this is the main/only purpose of the page load). So we probably do not need to worry about freeing up the memory and can instead get rid of the complexity of the factory and in CoreBundle just do:

container->register('transliteration', 'Drupal\Component\Transliteration\TransliterationClass');

instead of using the factory (with the corresponding changes to the transliterate() wrapper function in common.inc.

So... I will go ahead and make that change along with a-b-c in #94 and upload a new patch.

Status:Needs work» Needs review
StatusFileSize
new462.58 KB
PASSED: [[SimpleTest]]: [MySQL] 46,415 pass(es).
[ View ]
new23.21 KB

OK, here's a new patch... it's different enough that I do not think an interdiff is all that useful. Changes:
- No factory class. No base class. There is just the interface and the PHP implementation class.
- The class is now not language-specific, but the transliterate() method is.
- I separated the data table loading out into separate functions instead of mixing it into the replace() method.
- Added a test for Turkish (see above text provided by Kartagis; note that the transliteration supplied still had one non-ascii character in it but I fixed that).
- Updated the documentation.
- Use __DIR__ instead of __FILE__
- Add copyright attribution for the MediaWiki class.
- Updated/added some in-code comments (still some of the code is a bit obscure and not sure how to comment it).

I've included one patch with all the data, and another with just the code... hopefully got everything!

Status:Needs review» Reviewed & tested by the community

That looks much better. Thanks.

There are some minor details that could be improved (such as clarifying the differences to UtfNormal in the class' phpDoc), but I think those can be addressed in a follow-up issue/patch.

The character-by-character approach is way insufficient to result in useful transliteration for most languages, so we should focus on building a intl-based transliterator in a follow-up issue.

Status:Reviewed & tested by the community» Needs review

RE #98 - I don't think the PECL intl classes are going to be a good approach. As noted above in more detail:
- The classes have very little documentation
- You have to know which transforms to use for each language, which comes from the ICU project and/or the Unicode.org project, and this also has very little documentation (see #1823454-5: Verify transliteration data sources and their quality, and potentially eliminate maintenance for more on this).
- This extension is not bundled with PHP in general, and is not widely installed.

The approach here, while not a perfect solution, should be good enough:
- It has the ability for language overrides built in, so we can add those in Core if we know what they are.
- Contrib modules can also use a hook to provide additional language overrides or to override Core's overrides.
- And keep in mind that this is not meant to be a perfect solution anyway -- the idea is to provide usable defaults for machine names and URL aliases, which users can always override if they are less than perfect -- we are not claiming to be a good way to transliterate prose.

Status:Needs review» Reviewed & tested by the community

@jhodgdon: I'm not talking about removing the fallback implementation. intl is going to be the best source of transliteration data that we can get our hands on. The extension is not very well documented, but it's just a thin wrapper around ICU. As I explained in the other issue, it's going to be possible (but will require a little work) to use it when available.

In the meantime, I am perfectly happy with this issue getting in, and the intl-base implementation handled as a follow-up issue.

Sorry about the earlier status change from RTBC to CNR (unintentional).

Yes - for now, this is good to go. The implementation is simple enough and swappable. The actual transliteration data needs to be re-evaluated anyway.

That said, even if that won't happen in time for D8, then that would only mean we'd ship with the same transliteration data quality as the contrib Transliteration module does right now, and apparently, that served at least certain users and languages well. So even "if all fails", this state is a considerable improvement over status quo.

+++ b/core/includes/common.incundefined
@@ -7234,3 +7234,65 @@ function queue($name, $reliable = FALSE) {
+function transliterate($string, $langcode = NULL, $unknown_character = '?') {
+  if (empty($langcode)) {
+    $langcode = language(LANGUAGE_TYPE_INTERFACE)->langcode;
+  }
+
+  return drupal_container()->get('transliteration')->transliterate($string, $langcode, $unknown_character);

Not sure we need this helper function, especially since it's not being used anywhere yet, could just use the container directly?

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.phpundefined
@@ -0,0 +1,336 @@
+  public function __construct($data_directory = NULL) {
+    // Set up data directory and tail bytes table.
+    $this->dataDirectory = (is_string($data_directory)) ? $data_directory : __DIR__ . '/data';
+    self::setupTailBytesTable();

Why is_string() instead of isset()?

+++ b/core/lib/Drupal/Component/Transliteration/PHPTransliteration.phpundefined
@@ -0,0 +1,336 @@
+  /**
+   * Implements TransliterationInterface::transliterate().
+   */
+  public function transliterate($string, $langcode = 'en', $unknown_character = '?') {
+    // See if $string is pure US-ASCII, and if so, just return it.
+    if (!preg_match('/[\x80-\xff]/', $string)) {
+      return $string;

This is frightening code but presumably 1. it'll rarely change 2. if we do eventually use the original library again or subclass it we won't have to directly maintain it

@@ -0,0 +1,15 @@
+<?php
+
+/**
+ * @file
+ * German transliteration data for the PHPTransliteration class.
+ */
+
+$overrides['de'] = array(
+  0xC4 => 'Ae',
+  0xD6 => 'Oe',
+  0xDC => 'Ue',
+  0xE4 => 'ae',
+  0xF6 => 'oe',
+  0xFC => 'ue',

It's a little bit odd that core language support is in these raw PHP files and contrib has to implement hooks. Is this something we could maybe move to YAML files at some point? Then contrib could provide those and they could be merged in? There's not currently a nice generic way to grab YAML files so definitely a follow-up - not many modules are going to be implementing this stuff and the code doesn't run in the critical path.

Status:Reviewed & tested by the community» Needs work

Thanks for the review catch ==> just talked in person (woohoo, BADCamp!)... I am going to:
- See if the above are bugs or have reasons ==> fix or document why, depending
- Remove the common.inc wrapper, move the @defgroup to language.api.php, and put the one-liner from the common.inc wrapper into that @defgroup documentation.

Will get to this shortly...

Status:Needs work» Needs review
StatusFileSize
new461.23 KB
PASSED: [[SimpleTest]]: [MySQL] 46,864 pass(es).
[ View ]
new21.86 KB

RE #103 - thanks again for the review! Addressing the comments in order:

- I removed the common.inc wrapper function, and moved the @defgroup to language.api.php. I rearranged the defgroup documentation slightly, and included basically the body of that wrapper function in it to demonstrate how to instantiate the class and transliterate text. Check the language.php segment of the patch to read the revised docs.

- I made the suggested isset change (I was just trying to be careful since I need it to be a string, but it's documented so it will just throw an error if a non-string is passed in, but people in the sprint room here assure me we aren't normally that careful. :) ).

- I'm not sure if you were commenting on the overall scariness of the code or just these specific lines:

    // See if $string is pure US-ASCII, and if so, just return it.
    if (!preg_match('/[\x80-\xff]/', $string)) {
      return $string;
    }

but I realized that these particular lines were actually wrong. This regexp is checking specifically to see if there are any characters in the string in the range of 128-255, which is to say, things like a-with-umlaut. But it doesn't check for things like "some two-byte Japanese character" being in the string... So I just took this particular early-return out -- I'm pretty sure it's just not the right check to be making in the general case of a string of possibly multi-byte characters. It's just a small "let's try to be efficient" thing anyway -- probably best to sacrifice a bit of efficiency for certainty that the code will work correctly.

Regarding the overall scariness of those weird loops, I totally agree. That is the part of the code lifted from Mediawiki and it is definitely hard to follow... I tried to comment most of it but it is definitely dense and yes, unlikely to need to change very often. Some more tests with some Arabic or Japanese or something might help us become more confident in it, but we would need some test language samples in order to do that.

- That's an interesting point on language files (core) vs. alter hook (contrib)... and you might also (fairly) note that the format of the language files is not much the same as the format of the generic files... YML for language overrides could be OK... Follow-up? Is it crucial? If you think so, it could be done. In d7 idiom, an alter hook seems like definitely the way to go; in D8 I am not too sure so would defer to your judgement. Thoughts?

Anyway, here's a new patch. The only changes are:
- common.inc has no wrapper function [had to take this out of the test too, obviously]
- defgroup has been slightly rewritten and moved to language.api.php
- isset instead of is_string as suggested in #103
- took out the above mentioned early return check.

Status:Needs review» Needs work
Issue tags:-Needs architectural review, -API addition, -D8MI

The last submitted patch, transliterate-567832-105-full.patch, failed testing.

Status:Needs work» Needs review
Issue tags:+Needs architectural review, +API addition, +D8MI

#105: transliterate-567832-105-full.patch queued for re-testing.

Status:Needs review» Reviewed & tested by the community

I think we're good to go here. The code looks nice and works well, has tests, and catch's feedback has been addressed.

<?php
   
// See if $string is pure US-ASCII, and if so, just return it.
   
if (!preg_match('/[\x80-\xff]/', $string)) {
      return
$string;
    }
?>

^ Site note: this was perfectly correct. By definition if you only have 00-7F bytes in a UTF8 string, it doesn't have any multi-bytes characters.

RE #109 - really? OK... well we could put that back in. I definitely wasn't sure about that.

Status:Reviewed & tested by the community» Needs work

Darn, sounds like this still needs a little bit of work. Will be thrilled to see it RTBC again! :)

Status:Needs work» Needs review
StatusFileSize
new459.54 KB
FAILED: [[SimpleTest]]: [MySQL] Repository checkout: failed to checkout from [git://git.drupal.org/project/drupal.git].
[ View ]

Here, I rewrote your patch. You are welcome. #goingsun

Status:Needs review» Needs work

The last submitted patch, 567832.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new458.46 KB
FAILED: [[SimpleTest]]: [MySQL] 47,674 pass(es), 2 fail(s), and 0 exception(s).
[ View ]

Removed more junk.

Status:Needs review» Needs work

The last submitted patch, 567832.patch, failed testing.

This is failing on PHP 5.3, because of an implementation detail:

Fatal error: Using $this when not in object context in /var/lib/drupaltestbot/sites/default/files/checkout/core/lib/Drupal/Component/Transliteration/PHPTransliteration.php on line 95
FATAL Drupal\system\Tests\Transliteration\TransliterationTest: test runner returned a non-zero error code (255).

I'm not sure how to proceed. We should be able to use ($this), or do we need to alias $this to another object?

The tests I had written all passed with my version of this patch. Can you make an interdiff so we can see what has changed at least? Or post the code on line 95?

Status:Needs work» Needs review
StatusFileSize
new10.91 KB
new458.92 KB
FAILED: [[SimpleTest]]: [MySQL] 47,804 pass(es), 1 fail(s), and 0 exception(s).
[ View ]

This one should pass on PHP 5.3. I don't know why anyone would use anonymous functions on PHP 5.3, the handling of $this make them useless in an object context.

Assigned:jhodgdon» Damien Tournoud
Status:Needs review» Needs work

Ummm...

/**
-   * Table of tail bytes for Unicode characters.
+   * Language code of the current transliteration operation.

I stopped reading the interdiff here... Can you explain in outline what you've done in your changes? In the previous version, there was basically one transliteration object being created per page request, and it handled all languages. Apparently you're doing something totally different... can you explain what's going on? You'll also need to update the @defgroup documentation (at least the code sample in it) and the documentation for the classes, since you have apparently changed them quite a bit.

Anyway, this is apparently not my issue any more (which is good, since I really was only going to document and test it, which I did). But the documentation is now not in synch with the code... So, re-assigning it and marking it "needs work" for documentation at leats.

Assigned:Damien Tournoud» Unassigned
Status:Needs work» Needs review

Nothing changed in the behavior of the class, and the code comments are 100% accurate. Nothing I need to do here.

Status:Needs review» Needs work

The last submitted patch, 567832.patch, failed testing.

Status:Needs work» Needs review

OK, I went back and looked through the whole patch. It looks like Damien has rewritten a lot of the Mediawiki code, which is fine.... But since I didn't fully understand all of that code before, I also don't understand Damien's rewrite of it, and I'm unable to review the patch. Some in-code comments would really help in a few places in the rewritten PHPTransliteration class to explain the following... or at least put some comments here so someone else can update the code documentation:

1.

+  /**
+   * Implements TransliterationInterface::transliterate().
+   */
+  public function transliterate($string, $langcode = 'en', $unknown_character = '?') {
+    $this->currentLangcode = $langcode;
+    $this->currentUnknownCharacter = $unknown_character;
+    return preg_replace_callback('/([\x{0080}-\x{ffff}])/u', array($this, 'transliterateCallback'), $string);
+  }

What is this preg_replace_callback actually doing? An in-code comment would be helpful.

2. Apparently there is no Mediawiki code left here (is that true?) in which case we can probably take out the attribution at the top?

3.

+  /**
+   * Convert a single UTF-8 character into its code point.
+   *
+   * @param string $character
+   *   A single UTF-8 character.
+   *
+   * @return int
+   *   The code point of the UTF-8 character.
+   */
+  protected function convertUTF8ToCodePoint($character) {
+    $first_byte = ord($character[0]);

I had to ask Damien in IRC what "code point" means... and the doc should start with "Converts" anyway... How about changing the first line to say "Returns the numeric character code of a UTF-8 character." and not using the industry-specific "code point" term that may not be familiar to non-experts? That's assuming that this is what the function is really doing...

I guess I don't really get this anyway. Isn't the character code just the same bytes as the character string? Why is all of that munging necessary? Again, some documentation would be useful.

(the test failure above is due to HEAD being broken on Views, not from this patch).

Damien asked in IRC for a review of the approach. I think it's probably a good approach -- I like the idea of a preg_replace_callback, although it does complicate things a bit in that you have to basically have a class-global variable there to keep track of the parameters of the "current" transliteration... This would not work if the same class was being used from multiple places in a multi-threaded environment (the DIC will only instantiate it once)... Is that even possible in PHP? Probably not, but such things always scare me from my past as a Java programmer.

So... basically the "current state" variables are scary to me, and I think this patch needs some extra documentation, but I guess the approach is OK. I would feel a lot more comfortable without the need for the "current state" variables.

@jhodgdon the regex '/([\x{0080}-\x{ffff}])/u' can be replaced by something like this in PHP:

function is_ascii($str) {
    $len = strlen($str);
    for ($i = 0; $i < $len; $i++) {
        if (ord($str[$i]) > 127) {
          return false; // not ASCII
        }
    }
    return true; // is ASCII
}

Not sure if it really helps.

Status:Needs review» Needs work
Issue tags:+Needs tests

So... a few more thoughts on the current patch:

- If you used preg_split (or mb_split?) instead of preg_replace_callback in the transliterate() method, I think you could avoid having those scary (to me) state variables on the class. I'm assuming in D8 we can count on the mb* functions being available, right?

- I was looking at the comments on the php.net page for the ord() function, which only works for ASCII or at least only for 1-byte characters, and there someone suggested that to convert a UTF-character into its integer code you can just do hexdec(bin2hex($char)) -- which is a lot simpler than what this class is currently doing in convertUTF8ToCodePoint() (and a lot simpler than the MediaWiki code too). Someone else suggested unpack("N", $char). I think either of these might work... is there a reason we need all of that logic?

- We probably ought to put in at least a few tests that transliterate to multi-byte characters, especially since you've rewritten the code extensively from the presumably working version that has been in the Transliteration contrib module (and MediaWiki) for a while. We can pull the expected transliterations out of the current data files (verifying the goodness of the data files is a separate issue anyway) -- what we want to test is that all the unpacking and file reading stuff is still working. Probably at least one two-byte, one three-byte, and one four-byte character would be a good idea, plus maybe a test involving the characters in the 'eo' and 'kz' language override files. I'm unclear myself on which characters are two-byte, three-byte, and four-byte, so maybe you could pick some out of the appropriate xNN.php files and add them to the tests?

Given this, and #122's request for more in-code comments, and #123's concern about the state variables (I think we can do without them, see above in this comment), I think the correct status is "needs work".

I'm actually willing to take this on again if you don't want to work on it, so let me know Damien.... Some guidance from someone who knows more than I do about multi-byte characters on which characters to pick out for the tests would be helpful though!

Assigned:Unassigned» jhodgdon

No response from Damien (possibly due to time zones) so I am going to take a stab at another patch since it does not appear that will be a duplicated effort. :)

Status:Needs work» Needs review
StatusFileSize
new19.71 KB
new9.59 KB
new459.08 KB
PASSED: [[SimpleTest]]: [MySQL] 47,877 pass(es).
[ View ]

OK, here is a new patch, without the state variables -- I used preg_split() instead.

It turns out we need the somewhat complex logic used to construct the Unicode character code from the character. That's apparently just how they work.

I also added a couple more tests for 3 and 4-byte characters. It's all working and I think it's ready to go (again)... the code is actually making more sense and getting cleaner as we go...

Attachments:
- full patch (hopefully committable)
- interdiff from #118 (from damien's final patch)
- code-only patch without the data files (easier to review)

Status:Needs review» Reviewed & tested by the community

Wow, this looks pretty awesome. Could very well imagine this to become a leading implementation.

There are a couple of mistakes contained though, primarily regarding the class being in Drupal\Component, but the documentation/comments has Drupalisms and Drupal-specific things all over the place; including a drupal_alter() in readLanguageOverrides().

However, those issues are extremely minor, and we can happily fix them post-commit (most likely, by subclassing the PHPTransliteration class into Drupal\Core\Transliteration, just to extend the generic class and plug in the drupal_alter()) — no need to hold up this wonderful patch for that.

The issue with preg_split() is that we are decoding the full string in memory. We can use preg_replace_callback() and avoid storing state variables in the main class by creating a PHPTransliterationOperation class just for this, and move ->ordUTF8 and ->replace in there.

RE #129 - I don't see why this is a problem. The string is already in memory, of course, so it's not all that much overhead to make it into an array of characters. The typical use case for this type of machine-provided lookup-table transliteration is to make a URL or a machine name for something like that, so it's typically only a few characters. I imagine that in most cases, the overhead involved in instantiating a new class (which would likely involve making a copy of the string to be transliterated anyway) would be far greater than the overhead involved in splitting it into an array of characters?

Also, if you were really trying to transliterate long pieces of text (e.g., prose rather than making machine names), this class is unlikely to be adequate, due to issues discussed above in this issue (and mentioned in the @defgroup text in this patch, and in the documentation for the class).

Assigned:jhodgdon» Unassigned

I realized maybe I should un-assign this to myself, since lately "assigned to jhodgdon and RTBC" in the core queue usually means "jhodgdon is going to commit this". :)

Priority:Normal» Major

This seems worthy of calling a "major" feature request. Unfortunately we're all stuffed up behind major bugs, so I can't commit it right now. :(

Title:Transliteration in coreChange notice: Transliteration in core
Category:feature» task
Priority:Major» Critical
Status:Reviewed & tested by the community» Active

OK this has been RTBC for several days now, and we're about right with thresholds (101 major bugs as I type this), so committed/pushed to 8.x, thanks all!

Let's open a follow-up for some of these issues before closing this one:

- moving the Drupal-specific stuff out of the component to a sub-class in Core, but leaving most of it in component, per sun in #128. No drupal_alter() allowed in \Component

- whether to find another format for the raw PHP files per me in #103.

- can we use this for uploaded file names as the contrib project does at all?

- could do with a MAINTAINERS.txt entry

RE #133:

- No drupal_alter() in \Component -- to explore this, I filed:
#1842726: Transliteration component must not contain drupal_alter()

- Regarding the file format, I think this is regarding only the language overrides... assuming so, I've filed:
#1842710: Transliteration: Make language-specific character overrides use YAML files
which hopefully captures this idea.

- Regarding using this in core for uploaded file names, I am not exactly sure what we want to use Transliteration for in core, but I've filed this issue to explore and/or do that:
#1842718: Use new Transliteration functionality in core for machine names

- Regarding MAINTAINERS, are you asking me to sign on as maintaining this? I guess I could do that if you think this one class needs a MAINTAINER. :) But I didn't make a patch yet... wasn't sure whether to put it into Base System or Language System, and I didn't want to sign on as a maintainer of either of those systems (but I guess we can included it as a specific item within one of those). Thoughts?

I'll add these three follow-up issues to the issue summary and create a change notice after I save this comment.

Issue summary:View changes

totally rewrite the summary

Title:Change notice: Transliteration in coreTransliteration in core
Category:task» feature
Priority:Critical» Major
Status:Active» Fixed

I've added an issue summary that I think captures the current state of the transliteration class. As some of those follow-ups get acted upon, we can iterate on the change notice as needed to keep up with the changes.
http://drupal.org/node/1842748

Since the issue summary has been updated, the change notice created, and all the follow-ups filed (and noted in the issue summary), I think this base issue can now be put back to "fixed", right?

Status:Fixed» Needs review
StatusFileSize
new578 bytes
PASSED: [[SimpleTest]]: [MySQL] 48,184 pass(es).
[ View ]

Regarding MAINTAINERS.txt, I think we definitely need some names there, since this is a very complex topic.

How about this?

I'm fine with that :)

So you think this needs to be its own Component? OK... and I'm OK with being included as a maintainer.

Guess we need the +1 of Damien, so leaving to him to RTBC.

Status:Needs review» Reviewed & tested by the community

+1 here, too.

Status:Reviewed & tested by the community» Fixed

Committed and pushed to 8.x. Thanks!

For anyone following this, there's a new follow-up issue:
#1858376: Provide tests and documentation for long Unicode characters

Do we need to have an issue tag or issue Component for Transliteration by the way?

Issue summary:View changes

Added follow-up issues to the summary

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Status:Closed (fixed)» Needs work

Hi,

I git clone'd d8 today, and immediately tried the new transliteration thingie. I created a content type "Fotoğrafçılık" and its machine name was converted to "fotograf__l_k" when it should have been "fotografcilik" instead.

Regards,

Status:Needs work» Closed (fixed)

@Kartagis: while the base function was added, it should still be applied to Drupal core at various places, such as content types. See #1842718: Use new Transliteration functionality in core for machine names. This issue was about the base functionality.

Component:base system» transliteration system

Component update for posterity.

Issue summary:View changes

add another follow-up issue to the list