Updated: as of comment #56

Follow-up to #567832: Transliteration in core:

Problem

It is unclear whether our transliteration data is correct, and whether it is the best source long-term.

Goal

  1. Ideally, find a reliable upstream data source that completely eliminates downstream maintenance in Drupal core.
  2. Alternatively, ensure that we're using the best of available data sources.

Possible Data sources

Here is a list of possible data sources. I've given each one a nickname for reference below.

"core current"
The data that is currently in the Transliteration component in Drupal core. This came from the contrib Transliteration module, which had the following notes on the data source:

UTF-8 normalization is based on UtfNormal.php from MediaWiki and transliteration uses data from Sean M. Burke's Text::Unidecode module.

Links:

"cpan"
The Text::Unidecode CPAN module mentioned as the source for the Transliteration contrib module.
"intl"
The data in the php.net/pecl intl extension's Transliterate class, which is a wrapper on the ICU project's C++-based Transform code. Some links and info:
  • PECL classes: http://php.net/manual/book.intl.php
  • ICU overview http://userguide.icu-project.org/transforms/general
  • Info on ICU data http://userguide.icu-project.org/icudata
  • In order to transliterate with this extension, you have to define a set of transformations to apply, and they are really meant for transliterating bodies of text rather than single characters. However, within the scope of this issue, we're trying to figure out better character-by-character transliteration data tables for our existing Transliteration component, which is meant to be used for things like making machine names for files, not trying to make really good transliterations of, for instance, Russian prose, so that is probably OK. There are discussions of this topic in comment #13, and #16-#19.
"unicode.org"
The Unicode.org data that the ICU project apparently bases their transform code on. This is available from http://cldr.unicode.org/index/downloads
"midgard"
The data in the clean URL generator for Midgard from https://github.com/bergie/midgardmvc_helper_urlize
"node.js"
The node.js stuff at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data -- this is in YAML format.
"JUnicode"
The JUnicode project, which claims to be based on the Perl Text::Unidecode module (see above) with some additional data. http://www.ippatsuman.com/projects/junidecode/index.html

Tasks

  1. Read in these various data sources, and output each one into the format of our current files.
  2. Note what the differences are between the data from each source.
  3. Come up with a patch that replaces our current data.

Notes on these tasks and the data sources

  • jhodgdon made a script called "readdata.php" that reads in various data sources and outputs them in a standard format (the format used in the Drupal Transliteration component) for comparison. The latest version of this script is attached to comment #56 at this time.
  • The data sources that can be compared with this script are: "core current", "midgard", "cpan", "node.js", "JUnidecode", and "intl". For the "intl" data source, a hopefully canonical list of transformations was chosen that covers many Unicode scripts.
  • In comment #5, an output file was generated that highlights differences between these data sets (except "intl"). In comment #11, these differences were analyzed and summarized. The conclusion there was that:
    - Case differences between our current data set and other data sets were generally correct in ours.
    - There were also many other differences (not just case) between our data set and others; these were not systematically analyzed.
    - Some of the other data sets have data where we don't, and we should probably import this data if we can so at least we have something to work with in those character ranges.
  • In comment #20, the first viable pass at the "intl" data set was done (after choosing a set of hopefully canonical transformations), and after spot checking, the conclusion was that we should use output from intl in place of ours where they differ, and as additions to ours where ours is missing, as it's more accurate. The other conclusion was that data from the other data sets was not very trustworthy to adopt where ours was missing.
  • After comment #20, there was a side excursion into 5-byte characters that ended up being put on a different issue.
  • Then there were several iterations on making a patch that fixes up our data set using the hopefully canonical set of "intl" transformations, and the latest patch at this point is in comment #56, along with the script used to generate it.
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jhodgdon’s picture

As a note to whoever takes this on: Attach any scripts you used to do your testing/comparison to this issue please! We probably will want to be able to run the "take the ICU data out of PECL and put it into a format we can use" later on.

jhodgdon’s picture

Issue summary: View changes

add link to pecl class

sun’s picture

Title: Test data sources and have better data for transliteration » Verify transliteration data sources and their quality, and potentially eliminate maintenance
Issue tags: +transliteration

I revamped the summary, also adjusting title for the actual goal here.

sun’s picture

Quoting myself from #567832-79: Transliteration in core:

if all fails, since this is a component now, we could as well just move the component onto github and make it a true go-to source for all PHP applications.

I hope that this investigation will result in a solution that eliminates that idea. However, in case it does not, that appears to be a pretty good idea to me.

jhodgdon’s picture

Assigned: Unassigned » jhodgdon

I am working on this.

jhodgdon’s picture

Status: Active » Needs review
FileSize
14.56 KB
2.16 MB
1.59 MB

So. I have created a MONSTER!! Well, actually it's a script that will read in all these data sources and compare them.

I ended up not doing the 'intl' PECL extension. The reason is that its transform operations are based on knowing exactly which scripts you want to transliterate, which is fine for some use cases but not this one. I don't want to become a Unicode/language expert and figure out which of their 257 possible transformations would be good to apply, and I am not even sure that these include a complete list.

I also found a couple more data sources:
a) unicode.org has XML-based transliteration data, which seems to be what the 257 transformations in the ICU project are based on, so again not useful unless you want to figure out which specific transforms to use... These can be downloaded at http://cldr.unicode.org/index/downloads if you are interested, and there is also some documentation there.

b) The JUnidecode project, which claims to be based on the Perl Text::Unidecode module (see above) with some additional data. http://www.ippatsuman.com/projects/junidecode/index.html

Anyway... With that said [and I'll update the issue summary], I have done an analysis of all of this data. Attached is the script I used and the CSV output of the differences. There is one differences file that has all differences; the other omits the lines where our data set had a value and one or more of the other data sets was missing a value.

I'm not sure what to do with this information though... It looks like there are some real differences between our data set and the others (like 30,000 of them!), and a few cases where there is data provided by one of the other data sets that is missing from ours. But it is difficult to evaluate whether our data or the other data sets is better.

So I'd be inclined to mark this as "won't fix", after all... thoughts?

jhodgdon’s picture

Assigned: jhodgdon » Unassigned

This is about all I can do...

jhodgdon’s picture

Issue summary: View changes

Revamped issue summary.

jhodgdon’s picture

Oh and I just want to note about the script in #5 -- I'm sure it could be made cleaner and more efficient. My goal was to get this done, not make beautiful efficient code. It runs in a few seconds, so I think it's "good enough" for now (plus full of nice documentation and I think correct at least).

Damien Tournoud’s picture

I ended up not doing the 'intl' PECL extension. The reason is that its transform operations are based on knowing exactly which scripts you want to transliterate, which is fine for some use cases but not this one. I don't want to become a Unicode/language expert and figure out which of their 257 possible transformations would be good to apply, and I am not even sure that these include a complete list.

Believe me, it *is* possible to use the intl extension, and this is the *best* source of transliteration data that we can get. It will require some work, for sure, but it is possible.

The main (possibly only) issue is that the intl extension (and possibly ICU too) does not implement the proper inheritance rules between language tags that are defined in the CLDR repository.

We are going to have to implement those ourselves, and to do that we need a (limited) number of data tables:

  • The likely-subtags table (supplemental/likelySubtags.xml from CLDR), that gives us the most likely expanded tag of a language (ie. it will transform "tr" to "tr_Latn_TR");
  • Optionally, the supplemental language data table (supplemental/supplementalData.xml from CLDR), that gives us other scripts for the language (because some languages have several scripts, or it would not be fun).
Damien Tournoud’s picture

As an additional note: we need the intl extension because character-by-character transliteration is not going to be useful for many languages around the world. On the other hand, implementing a full PHP native transliteration engine based on the raw CLDR data is not going to be a lot of fun (and it will require the data tables from the Unicode Character Database - that I just found can be nicely packed into a SQLite database).

jhodgdon’s picture

RE #8 - What I meant was that for this exercise, I was not able to generate tables to compare to the other data sources from the 'intl' extension with the time I had available for this process.

And #9 is not relevant to this issue, which is about improving (maybe) the character-by-character tables, not about doing transliteration in a totally different way. Please file a different issue to discuss using 'intl' and doing transliteration better if you want to discuss that.

By the way, what is this "CLDR" data you are talking about -- is that unicode.org under a different name or a different data source not listed above? And it looks like that sqlite database is a database of unicode character bitmaps, not transliteration (or maybe it includes both?)... if so, not relevant for this purpose...

jhodgdon’s picture

FileSize
1.51 MB
14.85 KB

I am returning to this issue... and did a few checks on this data in #5 today:

a) I checked over the places where we had case differences, by going to unicode.org and looking at the characters. There were actually only a few, and our data turned out to be correct, while Midgard, CPan, and Node.js had incorrect data:
0x014a,"NG" vs. "ng" (upper case is correct)
0x014b,"ng" vs. "NG" (upper case is correct)
0x1e83,"w" vs. "W" (lower case is correct)

There were a couple of other inconsequential case differences, like whether the TM symbol should be transliterated as (tm) or (TM), but I do not think these matter much if at all. So we do not need to patch our data for this, and the only other data set that had the correct values was JUnidecode.

b) There are several character ranges where other data sets have data and ours has none:

- 0x0a01 - 0x0a7f - Gurmukhi (a South Asian language) - Midgard has data
- 0x0a80 - 0x0aff - Gujarati (South Asian) - Midgard
- 0x0b00 - 7f - Oriya (South Asian) - Midgard
- 0x0b80 - ff - Tamil (South Asian) - Midgard
- 0x20b0-20ba - currency symbols - JUnidecode
- 0x2100-214f - letter-like symbols - JUnidecode
etc.

I think we should probably go ahead and adopt the data from these other sources where we can, and I can make a patch... it seems to me that if Midgard or one of the other projects has some way to transliterate the characters, it would be better to adopt those than to have nothing? I checked on licensing:
- Mediawiki is GPL
- The CPAN module is GPL
- I cannot find a license statement about Midgard or the node.js stuff, so we would need to ask them I guess? Oh, they're by the same person (Henri Bergius, henribergius, http://drupal.org/user/1885584) and he commented on the original Transliteration issue and wants to collaborate. :) I'll send off a query.
- JUnidecode is BSD (is that compatible?).

c) There are also a ton of places where we have differences between our choices and other data sets' choices... Those will be much harder to sort out!

Meanwhile, I updated my script a bit and have generated a new "differences" file (with fewer false positives), so attaching those.

sun’s picture

Wow, thanks a lot for doing that detailed analysis, @jhodgdon!

I agree that if there's any additional data that we can incorporate, then it cannot hurt to just simply do it. :)

In turn, doing so could mean that Midgard might be able to fully replace its own implementation + data with our Transliteration component.

Regarding differences though, I'm rather skeptical... it's perfectly possible that some or most of them are caused by earlier transliteration data patches/contributions to the contrib Transliteration module. I guess we'd have to check the git history/log to be sure, and I don't really think that's worth the effort.

jhodgdon’s picture

FileSize
1.44 KB

So... Regarding #8 above commenting about using the intl package...

The issue at hand here is whether we can generate a better set of tables for the simple lookup-based transliteration that we are currently doing in Core (and which is similar to what many other projects are doing, see issue summary). Using the intl class to do transliteration better (thinking about context etc.) is a separate issue, beyond the scope of this issue.

So for this issue... What I'm wondering is whether we can mine the intl data somehow to get a better set of "the generic way to transliterate character X to US-ASCII", and also if we can mine it somehow to get a better set of "language-specific overrides of the transliteration of character X to US-ASCII" -- both of these are data sets we need in Core for the PHPTransliteration class(es) that we have.

The problem I had in trying to do that is that intl doesn't have a concept of "generic" transliteration. It has instead a concept of "transform text from one representation to another", including upper to lower case, etc. So what I don't know is which of their 257 supported transformation rules (of which 52 of them are something -> Latin), plus other transformations you can do (such as "separate out accents"), would be appropriate to apply to generate generic and/or language-specific tables for our classes.

What I've done so far is to use these operations:
NFD
Cyrillic-Latin
Greek-Latin
Hangul-Latin
Hiragana-Latin
Katakana-Latin
NFD
[:Nonspacing Mark:] Remove
NFC

Note: NFD means "separate accents from characters". Later on, the non-spacing mark thing removes the accents completely, and then NFC means re-compose everything.

Does anyone have any suggestions about the other 47 rules -- which ones should be applied to get as complete a set of characters as possible? And are there other rules I don't even know about? Attached is a list of the (something) -> Latin rules that I generated from print_r(Transliterator::listIDs()); followed by a grep... I can't really find any documentation on these rules though... any thoughts?

jhodgdon’s picture

RE #12 - I am inclined to agree on that -- leave the differences alone. The main objective here I think would be to get *something* for as many characters as possible, and if there are errors in the data, and people are using the transliteration classes and find them, then we can patch Core to fix the errors.

Damien Tournoud’s picture

Regarding #13, Unicode doesn't have any specific transformations for "transliterate to ASCII". This is actually rather different then "transliterate to latin, then transliterate to ASCII". In particular, there is nothing in the Unicode tables that encompasses the concept that "ö should be transliterated as oe for germanic languages, but o for most other languages".

So, this is rather a dead end. What the Unicode data could be useful for is for generating transliteration tables for scripts we don't have transliteration for, or for scripts we are not sure about the quality of the transliteration data.

jhodgdon’s picture

RE #15 - that is a subtle distinction. Obviously we want to transliterate to Latin and then from there to ASCII, right? And of course there are language differences (that is why we have language overrides)...

But what I'm getting at is even the question of if we actually knew the language of some particular text (because normally we do, after all, in Drupal), how would we generate the tables for that language in intl? Which of the 52 something-to-Latin rules would be used, and which other transformations would also need to be applied? There doesn't seem to be any documentation for which rules to use for which language... not that I found anyway. Do you know of some?

Damien Tournoud’s picture

But what I'm getting at is even the question of if we actually knew the language of some particular text (because normally we do, after all, in Drupal), how would we generate the tables for that language in intl? Which of the 52 something-to-Latin rules would be used, and which other transformations would also need to be applied?

It's a process described in the LDML documentation (see the "Inheritance" section), which is the specification of the format of the transliteration data (among other things):

  • First, we need to get the maximized locale ID of the language. That means transforming "jp" into "ja_Jpan_JP";
  • Second, we need to find the other scripts for the country, we can use the supplementary data in CDLR for that, and we also need to know that Jpan is an alias for Han + Hiragana + Katakana (but I have no idea where we find this information);
  • Third, we need to look for an appropriate transformation following the inheritance rules. So for example, if we are looking for a "Japanese to Latin" transliteration, we would likely end up with selecting any (or all of?) Han-Latin, Hiragana-Latin, Katakana-Latin-BGN as the transformation(s).
jhodgdon’s picture

RE #17 -- That is overwhelming... And the page you referenced doesn't tell me how to figure out which of the transforms in the intl package should be used for each language or each range of characters -- it seems to be more about how these transforms are constructed? Anyway, I guess we could probably go to unicode.org and systematically go through their tables of character codes, figure out what each script is called, and see if there is a corresponding transform in intl that would transform it.

I guess we could also just ask our specific-language experts from localize.drupal.org to provide transliteration tables for their scripts and skip intl. That actually might be a good idea... which would also involve having a list from unicode.org anyway.

I'll work on this a bit...

jhodgdon’s picture

I finally found this "roadmap" document:
http://www.unicode.org/roadmaps/bmp/
which at least gives some clue as to which characters are in which range of Unicode (as opposed to starting at the Chart pages http://www.unicode.org/charts/ and clicking through each script to figure out the character range).

jhodgdon’s picture

Status: Needs review » Active
FileSize
15.67 KB
1.84 MB

So, I went through the "roadmap" Unicode table, and picked out transforms from intl that seemed to correspond to the scripts listed there. A new readdata.php script is attached, and a new differences file.

Some notes on differences:

a) There are more case differences. Aside from conventions (our data is mostly taking the convention that if the result is a capital AE we present it as Ae, and other data sets might use AE or Ae depending), it looks like the intl data set is more accurate than ours (at least by looking at the actual Unicode characters for some specific characters). So we should patch our data set to fix the case problems. There are about 20 that should be fixed.

b) The transforms do provide some additional transform data that we don't have, such as Greek/Coptic characters in the 0x03xx range, Latin phonetic extensions in the 0x1dxx range, etc.

c) I also noticed that in some cases where both Midgard and intl provide data we don't have, the intl data is more accurate. For instance, there are some "enclosed alphanumerics" in the 0x24xx range:
http://www.unicode.org/charts/PDF/U2460.pdf
and for example, Midgard says that 0x24a0 should be transliterated as "A", but intl has a more accurate (at least so I would gather from the bitmap chart) "(e)". This makes me suspicious of using the Midgard data, unless we can figure out where it came from.

d) I would also be suspicious of any data for the CJK range of 0x33xx - 0x9fxx, because presumably various Chinese languages, Japanese, and possibly Korean would have different transliterations of the same character. That would seem to be a place where we could generate language-specific transliterations (possibly from intl) and should probably leave the generic data tables blank. The Midgard data has a bunch of results in that range that appear to be Chinese (just guessing) -- I think we shouldn't adopt them.

e) We should also not have anything in our tables from the UTF-16 and "private use" ranges between 0xd8 and 0xf8.

I'll see if I can make a patch for some of this stuff. And here are the updated script and output, if anyone is interested. If you have suggestions for additional intl rules that should be included to get a more complete data set, I'm receptive!

Damien Tournoud’s picture

A side note on code point range: UTF-8 allows all the characters of the Basic Multilingual Plane (the range U+0 to U+FFFF) to be represented. Higher code points can be represented using surrogate characters, which basically means that they are represented by two UTF-8 characters paired.

I have *no idea* if our current implementation (using PCRE to split the characters in the UTF-8 stream) deal with those characters higher then U+FFFF correctly.

jhodgdon’s picture

I am reasonably certain that we are not doing anything sane for those characters that are more than 4 bytes at this time.

Damien Tournoud’s picture

How bad it is depends on how PCRE split those combined characters. If PCRE does the right thing, it should get us both characters in one match, and our UTF-8 decoder is going to try to parse only the first one. Not sure what the result of that is going to be...

Those supplemental planes do not seem to have a lot of practical applications right now, but we should still make sure we support them in a decent way.

jhodgdon’s picture

Assigned: Unassigned » jhodgdon

I'll look into this... and I realized I forgot to assign this issue to myself, but I'm obviously working on it. :)

jhodgdon’s picture

Regarding characters outside of 4-byte Unicode... It looks like some good test cases would be, for instance, the Gothic alphabet:
http://en.wikipedia.org/wiki/Gothic_alphabet
So I tried adding these characters to the Transliteration test, and I got a database error in the testing system:

General error: 1366 Incorrect string value: '\xF0\x90\x8C\xB0\xF0\x90...' for column 'message' at row 1: INSERT INTO {simpletest} (test_id, test_class, status, message, message_group, function, line, file) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7); Array ( [:db_insert_placeholder_0] => 2 [:db_insert_placeholder_1] => Drupal\system\Tests\Transliteration\TransliterationTest [:db_insert_placeholder_2] => fail [:db_insert_placeholder_3] => [:db_insert_placeholder_3] =>

jhodgdon’s picture

Ah, it looks like MySQL does not support characters outside the Basic plane of Unicode:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

I'll do something else for the test then.

jhodgdon’s picture

Status: Active » Needs review
FileSize
2.72 KB

I got the test to work. And it passes!

Which is to say that 5-byte characters are working correctly for transliteration (correctly == returning "unknown character").

Here's a patch for the test for the 5-byte character, which should perhaps be on a different issue, but I'm attaching it here so it doesn't get lost, and it should be committed.

After this it's back to improving the tables for the 4-and-fewer byte characters.

Damien Tournoud’s picture

@jhodgdon: if I read the patch correctly, you are not actually testing that 5-bytes characters are correctly transliterated, just that they are replaced by ?. Could we have a test that shows that if you add a transliteration rule for one of those it is not actually applied?

jhodgdon’s picture

RE #28 - @Damien Tournoud - I am not sure if we can even add rules for 5-byte characters, except I guess language overrides, which I guess we could test. Good idea if that is what you are suggesting. (Is that what you were suggesting?)

Damien Tournoud’s picture

@jhodgdon: yes, this is what I'm suggesting, because I'm pretty sure it doesn't work :)

jhodgdon’s picture

Status: Needs review » Needs work

OK then, the patch above needs work then. I'll probably have time to make a new patch tomorrow.

jhodgdon’s picture

Status: Needs work » Active

Language overrides for multi-byte characters do actually work. I'm moving this to another issue though:
#1858376: Provide tests and documentation for long Unicode characters
and setting this issue back to "active" since we still need patches for the actual transliteration data sources. Damien: thanks very much for bringing this up!

jhodgdon’s picture

FileSize
17.94 KB
307.78 KB

So, back to where I left off at #20.

Regarding our current data set:
- It isn't consistently using capitalization for cases where a single upper-case character is transliterated into two ASCII characters -- sometimes it uses AA and sometimes Aa.
- Sometimes it has NULL in there and sometimes '' to indicate an untransliterated character.
- As I mentioned in #20, the intl data set appears to be more accurate about capitalization and characters in general than our data set where they differ.

So here's a first patch, which substitutes the output of intl where it differs from our data. And the PHP script I used to make the patch (though note that the x00 file came out wrong so I deleted that one from the patch). The patch definitely needs to be checked over...

jhodgdon’s picture

Status: Active » Needs review

Status: Needs review » Needs work

The last submitted patch, 1823454-intldata.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
300.91 KB

Dang. Scratch that patch, some of the differences are just lower-case vs. upper-case in the array keys. Let's try this one...

Status: Needs review » Needs work

The last submitted patch, 1823454-intldata-35.patch, failed testing.

jhodgdon’s picture

Sorry, that patch has a problem too. Back to the drawing board (tomorrow probably).

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
300.2 KB

One more try.

Status: Needs review » Needs work

The last submitted patch, 1823454-intldata-37.patch, failed testing.

jhodgdon’s picture

Ugh. Well this will need some more work:
- Some outputs are printing out with invalid PHP syntax (quote escaping problems)
- Some files without data in some areas are not printing out the rest as ''.

jhodgdon’s picture

More work to be done: The missing transliterations need to be output as NULL and not ''. The transliteration code treats these differently (if it's '' that gets returned, but if it's NULL it returns the "unknown character", which defaults to "?"). I'm not sure whether or not this distinction is followed well in our current data files or not -- there seems to be a mix of '' and NULL in the files, and I'm not sure whether the '' in there really means "this character should be translated as an empty character" or if it means "we don't know".

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
254.34 KB
18.41 KB

OK! I think I have a viable patch this time I think. I reviewed the changes (I suggest applying the patch and using
git diff --color-words --unified=0
to review) and they all look like reasonable substitutions and additions to me. This will at least get us in synch with a lot of the intl data set. See attached script (function read_intl_data()) for details of how the intl data set was generated, and for how the updated data files were generated.

Someone please review this? :)

Status: Needs review » Needs work

The last submitted patch, 1823454-intldata-43.patch, failed testing.

jhodgdon’s picture

Unrelated test failure:
Two entities loaded by uid without caring about property translatability. EntityTranslationTest.php Drupal\system\Tests\Entity\EntityTranslationTest->testMultilingualProperties()

jhodgdon’s picture

Status: Needs work » Needs review
Issue tags: -transliteration

#43: 1823454-intldata-43.patch queued for re-testing.

jhodgdon’s picture

Issue tags: +transliteration

#43: 1823454-intldata-43.patch queued for re-testing.

jhodgdon’s picture

sun, amateescu, and Damien: You are the co-maintainers of the Transliteration system... Can one of you please review this patch so we can get our data set for Transliteration updated? Thanks!

There's also a patch on
#1858376: Provide tests and documentation for long Unicode characters
that needs a review. That one is simpler -- it adds a test and updates documentation on how to transliterate long Unicode characters.

jhodgdon’s picture

Component: base system » transliteration system

New issue component!

jhodgdon’s picture

#43: 1823454-intldata-43.patch queued for re-testing.

jhodgdon’s picture

This patch still needs a review by the other maintainers of the Transliteration system, or at least someone. Please?

jhodgdon’s picture

#43: 1823454-intldata-43.patch queued for re-testing.

jhodgdon’s picture

Assigned: jhodgdon » Damien Tournoud

I'm temporarily assigning this issue to Damien Tournoud... Could you review the patch? It would be good to get this into Core sometime. If you do not wish to review it, maybe you can assign it to one of the other people who volunteered to maintain the transliteration system (amateescu or sun)?

Damien Tournoud’s picture

Assigned: Damien Tournoud » jhodgdon
Status: Needs review » Needs work

@jhodgdon, sorry for the delay.

Could you fix the patch? It is full of console characters and as a consequence was not really applied by the test bot:

[07:19:29] Command [git apply --check -p1 /var/lib/drupaltestbot/sites/default/files/review/1823454-intldata-43.patch 2>&1] succeeded.
[07:19:29] Command [git apply -v -p1 /var/lib/drupaltestbot/sites/default/files/review/1823454-intldata-43.patch 2>&1] failed
  Duration 0 seconds
  Directory [/var/lib/drupaltestbot/sites/default/files/checkout]
  Status [1]
 Output: [error: No changes].
[07:19:29] Applied [1823454-intldata-43.patch] to:
 > no files

Which is apparently a condition that PIFR is happily ignoring.

jhodgdon’s picture

Weird. I never noticed that... or else it got corrupted at some point after I uploaded it... who knows. Anyway, I'll see about making a new patch. I'll have to start over with the script, as it's been 7 months now since I originally posted the patch. :(

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
93.79 KB
18.47 KB
285.71 KB

OK, here's a new patch and a new script (the script only added one comment from the previous version). The patch appears to be fine at this time but I'll look after I upload it and see what it looks like... that was weird and I'm sure the patch didn't have that problem when I uploaded it in December.

Anyway. Some review notes:

a) I suggest applying the patch and using

git diff --color-words --unified=0

to review the differences.

b) There are notes in the script about how it was generated. A summary:
- The script reads in all the existing Drupal transliteration data from core/lib/Drupal/Component/Transliteration/data to form the "drupal" data set.
- The script uses the PECL "intl" package to generate a (hopefully) canonical set of transliteration data.
- To use "intl", you have to specify a set of transliterations to do. The ones that were applied are listed in the body of the read_intl_data() function in the script.
- All characters of interest are run through this transliteration individually, to create the "intl" data set.
- Then the script goes through character by character and checks for differences. If the "intl" data set has data where the "drupal" set doesn't, or if "intl" has different data from "drupal", the "intl" data is taken; otherwise we keep the "drupal" data. This logic is in the patch_drupal() function.

After using the script to make data files, I just copied them to the core/lib/Drupal/Component/Transliteration/data directory and used git diff to make a patch. Just to prove I'm not insane, I made a screen shot of the beginning of the uploaded patch file after I uploaded it. It definitely doesn't have all those weird characters in it currently.

I'll update the issue summary because it's out of date. Meanwhile, here's the patch and script.

jhodgdon’s picture

Issue summary: View changes

update issue summary with additional data sets and notes

jhodgdon’s picture

Issue summary is updated!

amateescu’s picture

I wonder if there's a specific reason to not include the generic 'Any-Latin' script at the end of the list in read_intl_data().

jhodgdon’s picture

RE #58 - thanks for finally offering some suggestions. I made up that list of which transformations to do myself and didn't know about (or notice?) 'Any-Latin'. Are there any others I have missed? And if we do Any-Latin, do we need all the others? The documentation for the existing transforms is really not all that illuminating...

I am assuming this should go at the end of the list of foo->Latin transforms, right before the Latin->ASCII transform... I can try it out and see if it makes much of a difference.

amateescu’s picture

Yes, I think we do need all the others first, and you're correct about where to place it :) It would be interesting to have then an interdiff between the patch from #56 and the new one, to see if it was actually useful or not.

jhodgdon’s picture

OK, here's a patch and interdiff with the Any-Latin transformation added (new script attached also; the only difference is the addition of that Any-Latin line).

It added many more characters to the data set, so on the theory that providing transliterations for characters is better than not providing transliterations, this seems to be an improvement.

amateescu’s picture

Status: Needs review » Reviewed & tested by the community

Reviewed the patch with git diff --color-words and it seems we do have a lot of useful additions for characters that weren't previously transliterated. Since our data will be gradually improved throughout the 8.x cycle by native speakers submitting patches (at least this is what happens in the contrib module), I think this is good to go. Thank you, Jennifer, for not giving up on this patch after such a long time :)

klonos’s picture

#62 +1. Lets commit and improve based on feedback.

jhodgdon’s picture

Just as a note on #62 - We need to be careful on native speakers submitting patches to the core character-by-character data tables. If the additions are specific to one language (such as conventions for o-with-umlaut becoming just o or oe depending on the language), then they should go into language data files, not character data files. We'll have to watch that as patches are submitted -- we can support both types of data files.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed 9c15199 and pushed to 8.x. Nice work @jhodgdon!

Do you think it's worth opening a followup to add the script to core/scripts?

jhodgdon’s picture

I am not sure about adding the script... Are there other scripts in there that are meant for component maintainers?

jhodgdon’s picture

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Update the issue summary as it was out of date and incomplete

klonos’s picture

Issue summary: View changes
Parent issue: » #567832: Transliteration in core

...testing issue relationships ;)