Problem/Motivation
In order to create automatic machine names, file names, and URL paths in Drupal, we need to transliterate Unicode characters into the ASCII range. This is currently handled by the contrib Transliteration module (http://drupal.org/project/transliteration).
Proposed resolution
Get transliteration functionality into Core.
The idea is to create a factory class, with a reference PHP implementation that does basic transliteration, with a hook allowing contrib modules to supply language-specific overrides of the basic transliteration tables.
There was a discussion of providing an implementation that wrapped the PECL intl's Transliterator class (http://php.net/manual/book.intl.php), which is based on the ICU project (http://userguide.icu-project.org/transforms/general), but as this class was poorly documented and not widely available in standard PHP installations, this idea was dropped.
Remaining tasks
a) [done] Finish the reference implementation.
b) [done] Finish the tests of the reference implementation.
c) [done] Commit this patch.
d) Follow-up tasks and issues:
- #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance
- #1842726: Transliteration component must not contain drupal_alter()
- #1842710: Transliteration: Make language-specific character overrides use YAML files
- #1842718: Use new Transliteration functionality in core for machine names
- #1858376: Provide tests and documentation for long Unicode characters
User interface changes
None. This patch does not provide a user interface.
API changes
No API changes -- this is an API addition that will facilitate getting PathAuto into core, but it doesn't change existing APIs.
Comment | File | Size | Author |
---|---|---|---|
#136 | drupal8.translit-maintainers.136.patch | 578 bytes | sun |
#127 | transliteration-567832-127-full.patch | 459.08 KB | jhodgdon |
#127 | transliteration-567832-127-interdiff.txt | 9.59 KB | jhodgdon |
#127 | transliteration-567832-127-code-only-do-not-test.patch | 19.71 KB | jhodgdon |
#118 | 567832.patch | 458.92 KB | Damien Tournoud |
Comments
Comment #1
snufkin CreditAttribution: snufkin commentedsubscribe
Comment #2
alippai CreditAttribution: alippai commentedWe have to:
Comment #3
JirkaRybka CreditAttribution: JirkaRybka commentedRelated older discussion (that I'm aware of):
#110972: Implement drupal-standard transliteration (resulted in won't fix, unfortunately)
#153574: Add transliteration to core
#43505: Files/URLs are not accessible if they contain reserved or non-ascii characters (and various duplicates linked from there)
Contrib module: Transliteration (edit: Has a bit of good summary on the project page, too.)
Overall, I'm highly +1 to transliteration in core, which is a must for every non-English site.
Comment #4
alippai CreditAttribution: alippai commentedThe main difference is that Transliteration is available now, which implements standardized transliteration.
Comment #5
alippai CreditAttribution: alippai commentedFor machine readable names it's ux issue
Comment #6
snufkin CreditAttribution: snufkin commentedActually this is worse than it looks because completely breaks this functionality for languages like russian. Bumping up to critical with blessing from Dries :)
The transcoded version of эжду3 becomes _3.
Summarizing transliteration comments from other threads: overall transliteration its hard to do, because there are many different contexts for transliteration. Filenames, urls, inline html classes and ids, php functions can contain completely different elements. Most of the earlier issues were focused on filenames and their encoding, this has been progressing in #284899: Drupal url problem with clean urls.
There is a contrib solution: transliteration and a PHP extension.
There is also a good discussion on the subject in #43505: Files/URLs are not accessible if they contain reserved or non-ascii characters but it was marked as duplicate because the focus shifted to file name transliteration instead of a transliteration API in core.
Comment #7
Damien Tournoud CreditAttribution: Damien Tournoud commentedA few more research on this:
[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm
Comment #9
JirkaRybka CreditAttribution: JirkaRybka commentedQuick grep on 7.x-dev core tarball shows about a dozen of preg_replace() occurences involving the 'A-Za-z' pattern, mostly replacing everything else with underscores or empty strings for various contexts (let alone another handful of preg_match() calls). All these are potential candidates for transliteration, and can't rely on contrib.
BTW, my site have a lot of broken transliterations already fixed in it's machine-readable names of various things (old CCK fields, propagating also to database tables, css classes, and Views fields, old URL aliases etc.), and had trouble with inaccessible filenames - contrary to what was said about %-encoded utf-8 being legal, some browsers seem to truncate URLs on first occurence of an unusual character, resulting in huge spikes of 404-not found traffic (I observed repeatedly). Although the history, implications, ways to fix, and current (often resolved somehow) state of these problems are very different, the mere fact of their amount, repeated occurences, and impossibility of a clean later fix on existing data, all indicate that transliteration as core feature and policy would be helpful to prevent further problems of this sort being introduced all over Drupal codebase, as well as code duplication.
Comment #10
alippai CreditAttribution: alippai commentedComment #11
Bojhan.core CreditAttribution: Bojhan.core commentedCan you please not apply tags incorrectly? No reason this to be in my queue.
Comment #12
alippai CreditAttribution: alippai commentedI'm sorry, I thought, that D7UX tag is for confusing screen elements (like machine readable names as _ds____c___ - another question is, why do we need to display machine readable names).
Comment #13
sunComment #14
snufkin CreditAttribution: snufkin commentedExcuse me, but why is this a feature request, when it breaks the UI for core drupal for non-latin languages? i understand that transliteration in core is a big thing in itself, and i also dont feel it can be done for d7, but then the auto-generated machine friendly names via JS should be removed.
Comment #15
snufkin CreditAttribution: snufkin commented#576180: Make the machine-name generation pluggeable opened to revert this part of the UI, thus changing this issue to a feature request as sun proposed.
Comment #16
xslim CreditAttribution: xslim commentedHi!
I posted the same issue - #646466: Machine name generating not correctly for languages other than English
maybe we can make a check in JS file - if transliteration module is present - make AJAX call to it?
Comment #19
hejazee CreditAttribution: hejazee commentedFor those languages which have completely non-ascii characters like Russian, Arabic, Persian, Chinese, etc... The machine readable name is hidden from the user !
For example a content type named 'مقاله' will have an empty machine readable name with no option to define one!
I suggest that it's better to show the machine readable textfield in these situation where user has entered something, but this field is empty.
Comment #20
webchickI think the bug you're talking about is #646466: Machine name generating not correctly for languages other than English. This is a feature request.
Comment #21
klonosI don't really see a decent reason for keeping #646466: Machine name generating not correctly for languages other than English around as a bug report against D7 when this feature request here is its solution (or at the very least *a* solution). Well, unless of course this feature request here is not accepted/fixed or if the final solution is not backportable to D7.
Now since #576180: Make the machine-name generation pluggeable is in and while we wait for a solution in core, I believe that this could be addressed in contrib. Did anyone file an issue in the Transliteration project's queue?
Comment #22
klonos...never mind. I just did: #1436596: Support transliteration of machine names
Comment #23
Gábor HojtsyAnybody wants to take on moving transliteration to core in D8?
Comment #24
klonosPerhaps the best thing would be to talk to the respective project's maintainers?
Comment #25
Fidelix CreditAttribution: Fidelix commentedThis issue could be addressed by this implementation, although it is javascript-based:
#1517024: Transliterate accented vowels in machine name strings via JS instead of substituting them with underscores
Comment #26
alexweber CreditAttribution: alexweber commentedTransliteration might be too much of an overhead for core if the gripe here is machine names. +1 for the patch proposed in #25.
Comment #27
Mac_Weber CreditAttribution: Mac_Weber commentedI heard about this thread today chatting on IRC.
As @alexweber said on #26, transliteration seems to be an overhead just for machine names and uploaded file names.
As @Damien Tournoud said on #7, the module Transliteration has many transformations like 'han_transliterate', 'greek_transliterate', 'lowercase_latin'.
Why not separating these transformations per language (or language groups), and downloading them automatically if/when that language is added to the interface? It would help to have less lines of "useless" code.
Comment #28
xjmThe issue in #25 is interesting, but does not address non-Latin character sets.
Comment #29
Damien Tournoud CreditAttribution: Damien Tournoud commentedDrupal 8 onward, we should rely on the
intl
PHP extension. Being part of PHP core since 5.3, this extension is going to be virtually available on all PHP 5.3 installations moving forward.The library relies on ICU which has a very rich set of transformation capabilities.
Comment #30
klonosThe issue summary needs an update by someone that knows where we're headed here :/
Comment #31
amateescu CreditAttribution: amateescu commentedThe problem is that the Transliterator class from
intl
is only available in PHP 5.4, so we still need a fallback for 5.3. I started working on this but there is very little progress so nothing to show at the moment. I will assign this issue to me only when I'll have more time to work on it..Comment #32
sunI wanted to agree with @Damien Tournoud, but @amateescu is right:
http://php.net/manual/transliterator.transliterate.php
Furthermore, there's a user comment stating that intl isn't enabled on some distros:
http://php.net/manual/en/intl.installation.php
Also, there seems to be a base requirement for the external ICU library itself, and it's not clear to me whether that is also bundled with PHP (doesn't sound so) or which OS/platforms have it installed:
http://php.net/manual/intl.requirements.php
Of course, what we could do is to introduce support for it in core, but which only works on PHP 5.4 (or lower with the PECL intl extension installed).
Lastly, regarding #7, Transliteration module's sources, the project page states:
IIRC, at the time when the project was started, there was no generic external user-space library for PHP. MediaWiki's implementation was incomplete, so additional data form Perl was incorporated.
Also relatively important: AFAIK, there have been attempts to contribute improvements back to upstream sources, but that didn't work so well. The transliteration data in the module diverged from the original sources in the meantime.
In general, I hope that @smk-ka can shed some more light on Transliteration module's history, data, and architecture, since he is the main author of the module and has incredible experience in that field.
Comment #33
amateescu CreditAttribution: amateescu commentedI know for sure that the extension is bundled with the official build of PHP 5.4 for Windows.
As far as the data is concerned, I was planning to update Transliteration's tables with fresh data by running PHP's Transliterator on each UTF-8 char :)
Comment #34
Damien Tournoud CreditAttribution: Damien Tournoud commentedThat would actually be a nice approach. That would allow us to use the faster native
intl
when available and fallback to a pre-generated map when not. I assume the pre-generated map will only work for NFC (so it will be slightly less robust / secure thenintl
), but it doesn't really matter as we currently assume that all input is in a proper NFC form.Comment #35
klonos...any news here? As I said back in #30 it would be nice if the issue summary got updated so that everyone can tell what needs to be done here. Perhaps even help.
Comment #36
andypostAny ideas about implementation? mostly in context of D8 plugins?
Also we really need a mention of @smk-ka
EDIT: summary changed a bit
Comment #36.0
andypostUpdated issue summary.
Comment #36.1
andypostUpdated issue summary.
Comment #37
sunI've revamped the issue summary with the actual, current proposal.
Comment #38
klonosThanx ;)
Comment #39
amateescu CreditAttribution: amateescu commentedI had some time to work on this for a couple of days during Drupalcon Munich and was really enthusiastic about the result.. until I hit the last part of this patch: updating the transliteration tables with ICU data.
It was only then that I realized that the Transliterator class from PHP 5.4 (or the 5.3 pecl extension) is not a silver bullet for our US-ASCII transliteration needs :( Short story: the Latin-ASCII script doesn't work as expected, it leaves some (read: a lot of) non-ASCII characters untransliterated.
I'm out of energy for this patch so I'm posting what I have so far, hoping that someone who has more experience with ICU's implementation can either bring it home or at least shed some light on it.
The first patch contains only the review-able code and the second one is the full patch that also has the current data tables from the Transliteration contrib module.
Marking as NR to get some attention on the patch, but it's really NW.. I know there are some doxygen blocks missing, and I also didn't get to the part where I wanted to switch the data tables to JSON for better memory management and also for making a native Javascript implementation possible.
Comment #40
chx CreditAttribution: chx commentedOur approach here should be to get something that works well enough into core. I am extremely hesitant to rely on intl, it's PHP documentation is nonexistent and hosting support is very spotty.
Here's a quote about transliteration itself
.
So let's get something practical in which these data tables definitely look like into core and then we can get pathauto in which everyone would love.
Comment #41
andypostAwesome! icu + fallback covers mostly all
@todo in code could be resolved in follow-up
Also this require tests and review form current maintainers of translitiration
Suppose transliteration makes no sense when site in English (no language module enabled)
Comment #42
jhodgdonSo, chx and I worked on this together in IRC today, and here's a new patch. It definitely still needs tests, and I may find some time to work on them sometime soon (in which case I'll assign the issue to myself temporarily)... this patch is totally untested and may even have PHP errors (I guess the bot will take care of at least verifying that maybe?)
What we discussed and I did:
- Added a bunch of documentation.
- Use the quick return value if string is all ASCII (the functions were ignoring it)
- Split some methods up into smaller methods
- Made some in-method static variables into class variables - they may or may not want to be class statics, depending on if memory usage is more/less important than page load time?
- Got rid of the "ICU" class, because we don't think the PECL class is reliable and it's definitely not documented well
Here's a patch and an interdiff... chx: your move!
Comment #43
jhodgdonOK, there are almost certainly some code problems in this patch. :) Anyway, I'm taking on the next step, which is making some tests, which should get it at least to the "this runs" stage.
Comment #44
jhodgdonHere's a new patch. The code is essentially the same, except I fixed a few static variable problems. And I added a couple of tests for the basic functionality. And fixed some stray characters I put by mistake into the data files... It seems to work...
Now it's probably time for a real review and some more comprehensive tests and some attention to the data tables? maybe chx now?
(I think maybe the interdiff is going the right way now. One goes back to my last patch, and the other goes back to the patch in #39)
Comment #45
jhodgdonSome thoughts about the data arrays (language-specific overrides and generic tables for transliteration):
- There are language-specific overrides for Spanish and several other languages in language_overrides.php, but I am not sure they are really actually all different from the generic character overrides? We should see if they are and take the ones out that are not; we should also verify the ones that are in there.
- We might want to consider re-architecting this slightly. Currently, the language overrides for every language are read in from one file, but I can imagine we might have a lot for languages like Chinese and Japanese if we were to do this right? I think they use the same characters but would have different transliterations of them? That is likely true of Cantonese vs. Mandarin too... Probably doing them is really kind of impossible? Anyway, it could in principle be kind of inefficient to read in all language overrides in this constructor when the constructor is language-specific? Probably we should only read in the ones that we need, and they should go into a non-static variable too.
- In the tests, I only included a couple of examples for language overrides (which maybe will be in the set of ones taken out?) and of generic characters (only for European languages). There should probably be more test lines added (it's easy to add them, should be pretty obvious) -- in particular, some that go above the 0x00 set of Unicode characters, and some two-byte characters probably?
Thoughts?
Comment #46
henribergius CreditAttribution: henribergius commentedI've done a transliteration library that we use in Midgard: https://github.com/bergie/midgardmvc_helper_urlize (available via Composer). Would be great to collaborate!
There is also a very nice library for Node.js that has better transliteration tables than what we have. They're available as YAML files, and so possibly also usable in PHP.
Note: Midgard has been using this for transliteration since 2007.
Comment #47
jhodgdonRE #46 - thanks henribergius! The classes here for transliteration will work well for Drupal's framework, I think... Our main need will be for the data (tables of "character x in language y should be transliterated to z"). Do you have some good data that has been vetted for various languages with linguists or native speakers or something that you can point me to in that repo?
Comment #48
jhodgdonRE #46 - chx and I took a look at the code on https://github.com/bergie/midgardmvc_helper_urlize -- it looks very familiar (obviously came from the same root source/algorithm some time back). Our code mainly differs in that it has the ability to do language overrides, so henribergius you might want to look at that (it's in class PHPTransliterate in the somewhat huge patch (huge because it includes the data files)).
Anyway, chx and I think we'll want to keep our code, which is also integrated into our container/factory stuff.
I'm working on this again today. Plan:
a) The language overrides are currently all having to be in one file. This should be split up by language instead of loading all languages when they aren't all needed (each instance is only one language).
b) We are never going to have full data for all world languages and dialects in core. So, we need to add a hook that would allow contrib to supply language overrides.
c) We now know of 3 sets of transliteration data:
1. The set in this patch.
2. The set at https://github.com/bergie/midgardmvc_helper_urlize
3. The set from node.js at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data
4. And there is probably some open-source data in the Pecl library that we aren't after all using?
(OK, that's probably 4 sets of data.) Anyway, it would be good to understand what the differences are between these data sets, which shoud be fairly easy to analyze.
Comment #49
jhodgdonOh, and by the way the country/language code for Danish/Denmark is dk, not da as was in the patches above. Hmph.
Comment #50
jhodgdonAlso, the Spanish language overrides, plus some of the German and Danish ones, did not actually override anything (they are standard accented characters done the standard way), so I got rid of those...
Anyway, here's a new patch with:
- hook_transliteration_overrides_alter() [with docs -- I added it to language.api.php for now at least]
- language overrides in separate files, without the extra characters that are not actually overrides, and with Danish having the right language code on the file name.
- A better test that tests language overrides specifically with three languages with different override files (including one without any)
Still to do: look at the differences in the data files (#48c).
Comment #52
tstoecklerI have not yet looked into how this data is generated, the different sources, etc., but one thing that I notice is that there is a missing replacement for the German language: The German "ß" letter should be transformed into "ss", i.e.
Don't know if that helps, but thought I'd point it out.
Comment #53
chx CreditAttribution: chx commentedIf you check the x00.php table you can already see it's an ss because this is not language dependent. language overrides are necessary when it's language dependent say ö can be oe or o. Yes, it's german-language specific but if it's specific to one language it never needs an override, actually.
Comment #54
jhodgdonLooks like that patch needs a reroll, surprise surprise (cough. views. cough) :)
Here's the reroll, straight reroll, no other changes.
Comment #55
jhodgdonduh.
Comment #56
tstoecklerRe #53: Thanks very much for the explanation, and sorry for the noise.
Comment #57
tstoecklerSorry, that was a x-post.
Comment #58
jhodgdonSome notes that I think we should add to the documentation of the hook and/or class in the next patch:
Transliteration is the process of translating individual non-US-ASCII characters into ASCII characters, which will always be inexact and language-dependent. For instance, the character Ö (O with an umlaut) is commonly transliterated as O, but in German text, the convention would be to transliterate it as Oe or OE, depending on the context (beginning of a capitalized word, or in an all-capital letter context).
The Drupal transliteration process uses a database of generic character transliterations and language-specific overrides. Drupal Core provides the generic transliteration character tables and overrides for a few common languages; modules can implement hook_transliteration_overrides_alter() to provide further language-specific overrides. Characters without default or overridden transliterations are transliterated to the defined "unknown character" in output, which is '?' by default. However, character context (such as all-capitals vs. initial capital letter only) is not taken into account, and in transliterations of capital letters that result in two or more letters, by convention only the first is capitalized in the Drupal transliteration result. So, the process has limitations; however, since the reason for transliteration is typically to create machine names or file names, this should not really be a problem.
Comment #59
chx CreditAttribution: chx commentedEmphasizing the 'dumbness' of the converter IMO is important: The Drupal transliteration process works on one character at a time. The ASCII transliteration can come from generic character transliterations or language-specific overrides but even those are applied to one input Unicode character a time.
Comment #60
Mac_Weber CreditAttribution: Mac_Weber commented@jhodgdon on #58 you are concerned about capital letters. As machine-names are now always lowercase, why not just transliterate everything to lowercase?
Also this patch does not seem to fix the file
misc/machine-name.js
.It seems that the best thing to do is to re-open #1517024: Transliterate accented vowels in machine name strings via JS instead of substituting them with underscores and include a bigger table, such as this one, but using only its lowercase part.
Comment #61
nod_Let's take the JS issue separately/later, we should keep it in mind but it's really not the hardest part here :)
Comment #62
jhodgdonNew proposed documentation (I think this should go in a @defgroup probably) to make some more things clearer:
Transliteration is the process of translating individual non-US-ASCII characters into ASCII characters, which specifically does not transform non-printable and punctuation characters in any way. This process will always be both inexact and language-dependent. For instance, the character Ö (O with an umlaut) is commonly transliterated as O, but in German text, the convention would be to transliterate it as Oe or OE, depending on the context (beginning of a capitalized word, or in an all-capital letter context).
The Drupal transliteration process transliterates text character by character using a database of generic character transliterations and language-specific overrides. Drupal Core provides the generic transliteration character tables and overrides for a few common languages; modules can implement hook_transliteration_overrides_alter() to provide further language-specific overrides. Characters without default or overridden transliterations are transliterated to the defined "unknown character" in output, which is '?' by default. However, character context (such as all-capitals vs. initial capital letter only) is not taken into account, and in transliterations of capital letters that result in two or more letters, by convention only the first is capitalized in the Drupal transliteration result. So, the process has limitations; however, since the reason for transliteration is typically to create machine names or file names, this should not really be a problem.
After transliteration, other transformation or validation may be necessary, such as converting spaces to another character, removing non-printable characters, lower-casing, etc.
Comment #63
jhodgdonI spent a while looking around for transliteration data today. What seems to be out there:
a) php.net/pecl extension is basically a wrapper on the ICU project's C++-based Transform code. Some links and info:
- overview http://userguide.icu-project.org/transforms/general
- info on their data http://userguide.icu-project.org/icudata
- data grabber tool http://userguide.icu-project.org/icudata -- but I downloaded the transliteration data and it is not in a format that is human-readable or PHP-readable in any obvious way (it's a binary file of some sort that is set up specifically for their classes).
So I don't think we can use their data in any obvious way.
b) our data -- no idea where that came from but maybe amateescu can illuminate us on that.
c) The clean URL generator for Midgard from https://github.com/bergie/midgardmvc_helper_urlize
d) The node.js stuff at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data
So it looks like it's down to comparing b, c, and d, unless someone has an idea on a.
Comment #64
jhodgdonOh, I finally located the source code for the Pecl/ICU things: http://source.icu-project.org/repos/icu/icu/trunk/source/data/translit/
but they are in rule format so I don't think we can really use them still.
Comment #65
andyceo CreditAttribution: andyceo commented#54: transliteration-full-54.patch queued for re-testing.
Comment #66
andypostRussian translit table should be checked as suggested (russian)
Comment #67
sunI did not closely follow this issue recently, so I'm not sure what exactly "our" data refers to, but the Transliteration module's original data sources are documented on its project page:
EDIT: UtfNormal source code
EDIT: Unidecode data source files
Those two sources seem to be missing in the list.
@amateescu told me in Munich that he wanted to try to auto-generate the mappings by passing every possible Unicode character through an ICU-enabled PHP 5.4 installation, but I'm not sure whether that happened, or whether the data is still the original from Transliteration module.
Comment #68
jhodgdonRE #67 - that idea of passing every unicode character through ICU is exactly what occurred to me during the long drive home yesterday. :) So, I plan to do that.
RE #66 - if someone could take the time to make a string with the Russian alphabet (both lower and upper case, and it can be all smooshed together like 'AaBbCc' or separated or whatever) and a second string with the transliterated Russian alphabet in it (with the same spacing), I'd be happy to add it to the tests. Note that if you have a two-letter transliteration, like ц -> ts, when it is capitalized it should be Ts not TS. I could make these strings myself but I don't have a Russian keyboard and besides that I have no idea what the capital letters are. Well, a small idea (I did learn some Russian a few years ago), but not much. :)
That said, if this Russian test doesn't get created, I am not too worried. We are simply not going to have tests for all of the thousands of characters in hundreds of languages.
Comment #69
jhodgdonOh yeah, the other thing that needs to be done is a test for the overrides hook. So the current tasks are:
a) Check about 5 data sources (see comments above) and see what the differences are.
b) Probably we want to trust the ICU source, so if possible, make ours match theirs.
c) Maybe add tests for other languages, if people want to supply test strings containing alphabets.
d) Add a test for the overrides hook, using a test module.
e) Add the documentation from #62.
Comment #70
KartagisTwo sentences in Turkish alphabet, containing non-ASCII characters.:
Abayı serdiler bize. Söyleyeceğim yüzlerine.
Sanırım hepimiz aynı şeyi düşünüyoruz.
Regards,
Comment #71
jhodgdonRE Turkish sentences in #70 - can you also please provide the correct transliterations of these two sentences? I could guess, but then it wouldn't be a very good test. :) Thanks!
Comment #72
henribergius CreditAttribution: henribergius commentedTeşekkürler! Added this to https://github.com/bergie/midgardmvc_helper_urlize/commit/2e1de3971460a1... and the tests pass :-)
Comment #73
jhodgdonI think that the transliteration data can be split off into a separate issue -- we should be able to commit this patch with "good enough" data (the current data), and then patch later with better data. So I filed:
#1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance
Meanwhile, I will update the issue summary with a link to this new issue, and I am today taking on:
- adding to the docs
- a test of the alter hook
which I think will get this issue to "ready for commit" stage. Coming shortly...
Comment #73.0
jhodgdonRevamped issue summary.
Comment #74
sunI'm totally down with the proposal in #73. The transliteration data has to be maintained anyway.
Erm, now that I say this...
The actual "blocker" for committing this is "one or more maintainers" who are willing to take on that job of transliteration data maintenance and have at least some minimal experience with the art of transliteration.
Changes to the transliteration data always need to be reviewed and verified by "locals", so the actual job of maintaining the data boils down to 1) ensuring test coverage as needed and appropriate, 2) committing.
Commits are handled by core maintainers already (and I don't see why @jhodgdon couldn't also commit transliteration data changes).
The job of transliteration data maintainers would be to check + review + discuss related open issues for "basic transliteration data changes" compliance. That does not involve changes to the Transliteration PHP component at all, only the data.
Comment #75
jhodgdonOK, I've taken care of a test for the alter hook. There's a new module system/tests/modules/transliterate_test, and a new line added to the test that verifies it.
I also added a defgroup with the documentation in #62. This is in common.inc, and the classes/hook have @ingroup added so they are part of this topic/group.
I also had to change the test to be WebTestBase instead of UnitTestBase in order to test the alter hook (module_invoke type things do not work in UnitTestBase).
Other than that, this patch is the same as the last few. And I think it may actually be ready to go in now!
Comment #76
jhodgdonwhoops, that interdiff file should have been named .txt... sorry testbot!
Comment #78
jhodgdonRE #74 - I think we should adopt the ICU group's base data (see #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance) and call it What The Outside World Uses and Good Enough.
At that point, if we want to get language-specific overrides in core, that's a matter of adding a small xx.php file for the language code xx (if that language team agrees). Or they can also have a contrib module that provides the overrides, especially if it's a matter of "in this country, we do things a little differently".
Also, as sun says, that does not affect the functionality at all... Of course if the ICU updates their data, we should be able to patch core pretty easily (we could even provide a script to see if there are updates, which should be an output of that other issue anyway).
Comment #79
sunAh, ok, now I understand the actual intentions of #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance -- updated that issue's summary accordingly.
However, just to clarify:
If the data in this patch is really the same as in Transliteration module, then it is actually derived (forked) data. That starts with the two original data sources. But the transliteration data has been adjusted and improved by users and contributors of Transliteration module over the years. (@see fixed issues)
That inherently means that this data does not equal any external/upstream resource. Which means we'd have to maintain it ourselves. It also means that this current data might contain improvements that may be missing in a new data source, which could be interpreted as a "regression."
We obviously, ideally, don't want to maintain this data if we don't have to. (Although it already crossed my mind that it would be crazy cool if http://localize.drupal.org would have a Transliteration UI ;)) [And if all fails, since this is a component now, we could as well just move the component onto github and make it a true go-to source for all PHP applications.] Anyway, let's defer that investigation to #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance.
Comment #80
KartagisDear jhodgdon,
Correct literations of the Turkish sentences in #70 would be:
Abayı serdiler bize. Söyleyeceğim yüzlerine. ---> Abayi serdiler bize. Soyleyecegim yuzlerine.
and
Sanırım hepimiz aynı şeyi düşünüyoruz. ---> Sanirim hepimiz aynı seyi dusunuyoruz.
Comment #81
jhodgdonThanks Kartagis! We could add this Turkish to the tests now... I would love it if someone could review the patch though?
Comment #82
sunIs it possible to provide a do-not-test.patch for review that does not contain the transliteration data?
Comment #83
jhodgdonSure! Here is the patch without the data.
Comment #84
sunThanks!
This code still seems to be largely based on the original UtfNormal.php (some lines are actually identical to ::quickIsNFCVerify()), so we should at least mention this is in the phpDoc of the class (UtfNormal is GPL2+), and actually, I'm not a licensing expert, but don't we have to take over the author/copyright statement, too?
Shouldn't this be moved into __construct()?
I realize that the original code comments (mainly originating from UtfNormal) are not really proper, but some of inline comments here have been added by us. Quite a couple of them articulate emotional quirks/frustration (talking with your code, eh? ;)), but do not actually explain what is being done and why.
This is extremely complex code logic that only a handful of people are able to dissect and understand. We really need proper and appropriate code comments to allow new developers who are reading this code in the future to understand what is going on, and why the logic was implemented in *this* way and not another.
However, to clarify: If the comment is literally contained in the original UtfNormal, then we should keep it as is. (!)
Comparing the above outlined comments once again with UtfNormal, only the first two are ours.
That said.
Given all this, I truly wonder why we don't simply incorporate MediaWiki's original UtfNormal.php as is?
It seems to be actively maintained.
For now, we could just simply dump the class somewhere and trick it into our autoloader and/or make the Drupal/Component/Transliteration/PHPTransliteration class load it from within its namespace directory. Ideally, of course, we'd ask MediaWiki to convert it into a PSR-0 component (they require PHP 5.3 already).
That would mean we wouldn't maintain this UTF-8 normalization/transliteration code at all and only pull in updates from upstream. (Frankly, I don't know why @smk-ka decided to fork the code instead of keeping the class as-is when we originally invented the file_translit/transliteration modules — I can only assume it might not have been actively maintained back then.)
Wouldn't that make a tonne more sense?
We need some comments here.
I'm not able to locate this snippet in UtfNormal. There are a couple of snippets that seemingly look similar, but are using different ranges/conditions.
This chunk also looks identical to the original code snippet in UtfNormal::quickIsNFCVerify().
Any chance to add at least a few inline comments here?
We can use __DIR__ in Drupal 8.
I really don't understand why we need a factory for this. We have an interface. That should be more than sufficient.
If someone wants to swap out the implementation he/she can swap out the entire service, done.
I also don't see why we would need a new class instance for a different language and/or different $unknown_character parameter. The latter actually seems to be a bug, since there's a good chance for calling code to ask for a different $unknown_character, depending on what it wants to do, but still for a language that was requested before already. No need to reinstantiate and reload the entire thing for that.
So I really think we should just get rid of the factory.
Comment #85
jhodgdonsun I think you forgot to change this to "needs work" based on your review.
At this point, I am bowing out of anything further on this patch. I was brought in to document and add tests to an existing patch (the one on #39), which was my starting point. I didn't create the code, and I made minimal changes to the code in the process beyond what was necessary to get it to pass the tests and a few other minor refactorings. I have no knowledge of where the code came from before I stepped in, and I have no opinion on whether we should adopt a 3rd-party component, whether we need a factory, etc. So I cannot address the review comments from #84.
I will just note that the reason for the factory was that in the original patch, it had a choice of using the PHPTransliterator class or one that directly used the ICU PECL class; we took that option out (see discussion above). If we want the ability to use that in the future, we'll need a factory; if we don't care, we can get rid of the factory.
Anyway. Good luck with this effort. I'll go ahead and un-follow at this point and leave it to you to hash out what should be done. Beyond my sphere...
Comment #86
chx CreditAttribution: chx commentedComment #87
sunI don't know why this is assigned to me. I provided an in-depth review and feedback already.
Comment #88
chx CreditAttribution: chx commentedSee #85 why.
Comment #89
sunI do not know the answers to the questions, because I asked them. Asking me to answer them is a logical error.
I'd love to move forward here, but I do think that #84 raised a range of critical issues with the current implementation and possibly direction. Although I silently followed this issue, I do not know who effectively authored the current code, and in turn, who'd be able to address the issues and questions that have been raised.
Comment #90
chx CreditAttribution: chx commentedOh well, no pathauto for Drupal 8 then. One release can't have everything. Pity.
Comment #91
cweagansWas there a reason for adding that factory in the first place?
Comment #92
amateescu CreditAttribution: amateescu commentedAfter reading the review from #84, I think we have
twothree actionable points here:1) remove the factory, it is indeed not needed anymore
2) investigate if we can bring in UtfNormal.php
3) (minor) yep, use __DIR__, my bad :/
If the answer for 2) is no for whatever technical/legal reason, we should document that our code is based on it and clean up *our* comments. About putting more comments on *their* code, well.. I certainly don't feel up for the task, so feel free to suggest them.
Comment #93
Dries CreditAttribution: Dries commentedI think sun raised valid questions in #84, but I don't think they are "critical". I'd categorize these as possible code clean-ups. Would be nice if we could do these before a commit, but based on what I read, I don't think they are showstoppers.
Comment #94
jhodgdonRE #92 - I took a quick look at the UtfNormal.php file. I don't think we want to incorporate it directly -- it's basically doing the factory thing (either using ICU classes or its own code) as well as the transliterate thing all bundled up in one, and as such it's overly convoluted IMO.
But if the code in the patches submitted above came from UtfNormal.php (and the Transliteration project page indicates that is the case), then we should acknowledge this in our file header and/or copyright.txt. It should be fine -- it's GPL -- but it wasn't previously acknowledged in this patch.
Regarding the factory, it still seems like a good idea to me to have the factory class, if only because I believe it makes it easier for a contrib module to swap out for an ICU-based class (or something else perhaps). At least, I think it would, and I think this is desirable. Thoughts?
That leaves swapping out _FILE_ for _DIR_, which is easy enough to do.
Meaning the To Dos would be:
a) Swap out _FILE_ for _DIR_ (quick/easy/not all that necessary anyway, but OK).
b) Add copyright notice for the code provided by UtfNormal.php.
c) Maybe add some in-code comments to further clarify what is being done (I did read/understand the code so can probably do that).
Thoughts -- is this a reasonable course of action? Anything else that needs to be done?
Comment #95
jhodgdonchx and I chatted in core about the question of the factory.
We came to the conclusion that the main advantage of using a factory class is that the transliteration class can be transient, and once the transliterate call is done its memory can be freed. But the typical use of transliteration is to create machine names (typically once per page load and typically it's an AJAX call so this is the main/only purpose of the page load). So we probably do not need to worry about freeing up the memory and can instead get rid of the complexity of the factory and in CoreBundle just do:
container->register('transliteration', 'Drupal\Component\Transliteration\TransliterationClass');
instead of using the factory (with the corresponding changes to the transliterate() wrapper function in common.inc.
So... I will go ahead and make that change along with a-b-c in #94 and upload a new patch.
Comment #96
jhodgdonOK, here's a new patch... it's different enough that I do not think an interdiff is all that useful. Changes:
- No factory class. No base class. There is just the interface and the PHP implementation class.
- The class is now not language-specific, but the transliterate() method is.
- I separated the data table loading out into separate functions instead of mixing it into the replace() method.
- Added a test for Turkish (see above text provided by Kartagis; note that the transliteration supplied still had one non-ascii character in it but I fixed that).
- Updated the documentation.
- Use __DIR__ instead of __FILE__
- Add copyright attribution for the MediaWiki class.
- Updated/added some in-code comments (still some of the code is a bit obscure and not sure how to comment it).
I've included one patch with all the data, and another with just the code... hopefully got everything!
Comment #97
sunThat looks much better. Thanks.
There are some minor details that could be improved (such as clarifying the differences to UtfNormal in the class' phpDoc), but I think those can be addressed in a follow-up issue/patch.
Comment #98
Damien Tournoud CreditAttribution: Damien Tournoud commentedThe character-by-character approach is way insufficient to result in useful transliteration for most languages, so we should focus on building a intl-based transliterator in a follow-up issue.
Comment #99
jhodgdonRE #98 - I don't think the PECL intl classes are going to be a good approach. As noted above in more detail:
- The classes have very little documentation
- You have to know which transforms to use for each language, which comes from the ICU project and/or the Unicode.org project, and this also has very little documentation (see #1823454-5: Verify transliteration data sources and their quality, and potentially eliminate maintenance for more on this).
- This extension is not bundled with PHP in general, and is not widely installed.
The approach here, while not a perfect solution, should be good enough:
- It has the ability for language overrides built in, so we can add those in Core if we know what they are.
- Contrib modules can also use a hook to provide additional language overrides or to override Core's overrides.
- And keep in mind that this is not meant to be a perfect solution anyway -- the idea is to provide usable defaults for machine names and URL aliases, which users can always override if they are less than perfect -- we are not claiming to be a good way to transliterate prose.
Comment #100
Damien Tournoud CreditAttribution: Damien Tournoud commented@jhodgdon: I'm not talking about removing the fallback implementation.
intl
is going to be the best source of transliteration data that we can get our hands on. The extension is not very well documented, but it's just a thin wrapper around ICU. As I explained in the other issue, it's going to be possible (but will require a little work) to use it when available.In the meantime, I am perfectly happy with this issue getting in, and the
intl
-base implementation handled as a follow-up issue.Comment #101
jhodgdonSorry about the earlier status change from RTBC to CNR (unintentional).
Comment #102
sunYes - for now, this is good to go. The implementation is simple enough and swappable. The actual transliteration data needs to be re-evaluated anyway.
That said, even if that won't happen in time for D8, then that would only mean we'd ship with the same transliteration data quality as the contrib Transliteration module does right now, and apparently, that served at least certain users and languages well. So even "if all fails", this state is a considerable improvement over status quo.
Comment #103
catchNot sure we need this helper function, especially since it's not being used anywhere yet, could just use the container directly?
Why is_string() instead of isset()?
This is frightening code but presumably 1. it'll rarely change 2. if we do eventually use the original library again or subclass it we won't have to directly maintain it
It's a little bit odd that core language support is in these raw PHP files and contrib has to implement hooks. Is this something we could maybe move to YAML files at some point? Then contrib could provide those and they could be merged in? There's not currently a nice generic way to grab YAML files so definitely a follow-up - not many modules are going to be implementing this stuff and the code doesn't run in the critical path.
Comment #104
jhodgdonThanks for the review catch ==> just talked in person (woohoo, BADCamp!)... I am going to:
- See if the above are bugs or have reasons ==> fix or document why, depending
- Remove the common.inc wrapper, move the @defgroup to language.api.php, and put the one-liner from the common.inc wrapper into that @defgroup documentation.
Will get to this shortly...
Comment #105
jhodgdonRE #103 - thanks again for the review! Addressing the comments in order:
- I removed the common.inc wrapper function, and moved the @defgroup to language.api.php. I rearranged the defgroup documentation slightly, and included basically the body of that wrapper function in it to demonstrate how to instantiate the class and transliterate text. Check the language.php segment of the patch to read the revised docs.
- I made the suggested isset change (I was just trying to be careful since I need it to be a string, but it's documented so it will just throw an error if a non-string is passed in, but people in the sprint room here assure me we aren't normally that careful. :) ).
- I'm not sure if you were commenting on the overall scariness of the code or just these specific lines:
but I realized that these particular lines were actually wrong. This regexp is checking specifically to see if there are any characters in the string in the range of 128-255, which is to say, things like a-with-umlaut. But it doesn't check for things like "some two-byte Japanese character" being in the string... So I just took this particular early-return out -- I'm pretty sure it's just not the right check to be making in the general case of a string of possibly multi-byte characters. It's just a small "let's try to be efficient" thing anyway -- probably best to sacrifice a bit of efficiency for certainty that the code will work correctly.
Regarding the overall scariness of those weird loops, I totally agree. That is the part of the code lifted from Mediawiki and it is definitely hard to follow... I tried to comment most of it but it is definitely dense and yes, unlikely to need to change very often. Some more tests with some Arabic or Japanese or something might help us become more confident in it, but we would need some test language samples in order to do that.
- That's an interesting point on language files (core) vs. alter hook (contrib)... and you might also (fairly) note that the format of the language files is not much the same as the format of the generic files... YML for language overrides could be OK... Follow-up? Is it crucial? If you think so, it could be done. In d7 idiom, an alter hook seems like definitely the way to go; in D8 I am not too sure so would defer to your judgement. Thoughts?
Anyway, here's a new patch. The only changes are:
- common.inc has no wrapper function [had to take this out of the test too, obviously]
- defgroup has been slightly rewritten and moved to language.api.php
- isset instead of is_string as suggested in #103
- took out the above mentioned early return check.
Comment #107
cweagans#105: transliterate-567832-105-full.patch queued for re-testing.
Comment #108
cweagansI think we're good to go here. The code looks nice and works well, has tests, and catch's feedback has been addressed.
Comment #109
Damien Tournoud CreditAttribution: Damien Tournoud commented^ Site note: this was perfectly correct. By definition if you only have
00-7F
bytes in a UTF8 string, it doesn't have any multi-bytes characters.Comment #110
jhodgdonRE #109 - really? OK... well we could put that back in. I definitely wasn't sure about that.
Comment #111
webchickDarn, sounds like this still needs a little bit of work. Will be thrilled to see it RTBC again! :)
Comment #112
Damien Tournoud CreditAttribution: Damien Tournoud commentedHere, I rewrote your patch. You are welcome. #goingsun
Comment #114
Damien Tournoud CreditAttribution: Damien Tournoud commentedRemoved more junk.
Comment #116
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is failing on PHP 5.3, because of an implementation detail:
I'm not sure how to proceed. We should be able to
use ($this)
, or do we need to alias$this
to another object?Comment #117
jhodgdonThe tests I had written all passed with my version of this patch. Can you make an interdiff so we can see what has changed at least? Or post the code on line 95?
Comment #118
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis one should pass on PHP 5.3. I don't know why anyone would use anonymous functions on PHP 5.3, the handling of
$this
make them useless in an object context.Comment #119
jhodgdonUmmm...
I stopped reading the interdiff here... Can you explain in outline what you've done in your changes? In the previous version, there was basically one transliteration object being created per page request, and it handled all languages. Apparently you're doing something totally different... can you explain what's going on? You'll also need to update the @defgroup documentation (at least the code sample in it) and the documentation for the classes, since you have apparently changed them quite a bit.
Anyway, this is apparently not my issue any more (which is good, since I really was only going to document and test it, which I did). But the documentation is now not in synch with the code... So, re-assigning it and marking it "needs work" for documentation at leats.
Comment #120
Damien Tournoud CreditAttribution: Damien Tournoud commentedNothing changed in the behavior of the class, and the code comments are 100% accurate. Nothing I need to do here.
Comment #122
jhodgdonOK, I went back and looked through the whole patch. It looks like Damien has rewritten a lot of the Mediawiki code, which is fine.... But since I didn't fully understand all of that code before, I also don't understand Damien's rewrite of it, and I'm unable to review the patch. Some in-code comments would really help in a few places in the rewritten PHPTransliteration class to explain the following... or at least put some comments here so someone else can update the code documentation:
1.
What is this preg_replace_callback actually doing? An in-code comment would be helpful.
2. Apparently there is no Mediawiki code left here (is that true?) in which case we can probably take out the attribution at the top?
3.
I had to ask Damien in IRC what "code point" means... and the doc should start with "Converts" anyway... How about changing the first line to say "Returns the numeric character code of a UTF-8 character." and not using the industry-specific "code point" term that may not be familiar to non-experts? That's assuming that this is what the function is really doing...
I guess I don't really get this anyway. Isn't the character code just the same bytes as the character string? Why is all of that munging necessary? Again, some documentation would be useful.
Comment #123
jhodgdon(the test failure above is due to HEAD being broken on Views, not from this patch).
Damien asked in IRC for a review of the approach. I think it's probably a good approach -- I like the idea of a preg_replace_callback, although it does complicate things a bit in that you have to basically have a class-global variable there to keep track of the parameters of the "current" transliteration... This would not work if the same class was being used from multiple places in a multi-threaded environment (the DIC will only instantiate it once)... Is that even possible in PHP? Probably not, but such things always scare me from my past as a Java programmer.
So... basically the "current state" variables are scary to me, and I think this patch needs some extra documentation, but I guess the approach is OK. I would feel a lot more comfortable without the need for the "current state" variables.
Comment #124
Mac_Weber CreditAttribution: Mac_Weber commented@jhodgdon the regex
'/([\x{0080}-\x{ffff}])/u'
can be replaced by something like this in PHP:Not sure if it really helps.
Comment #125
jhodgdonSo... a few more thoughts on the current patch:
- If you used preg_split (or mb_split?) instead of preg_replace_callback in the transliterate() method, I think you could avoid having those scary (to me) state variables on the class. I'm assuming in D8 we can count on the mb* functions being available, right?
- I was looking at the comments on the php.net page for the ord() function, which only works for ASCII or at least only for 1-byte characters, and there someone suggested that to convert a UTF-character into its integer code you can just do hexdec(bin2hex($char)) -- which is a lot simpler than what this class is currently doing in convertUTF8ToCodePoint() (and a lot simpler than the MediaWiki code too). Someone else suggested unpack("N", $char). I think either of these might work... is there a reason we need all of that logic?
- We probably ought to put in at least a few tests that transliterate to multi-byte characters, especially since you've rewritten the code extensively from the presumably working version that has been in the Transliteration contrib module (and MediaWiki) for a while. We can pull the expected transliterations out of the current data files (verifying the goodness of the data files is a separate issue anyway) -- what we want to test is that all the unpacking and file reading stuff is still working. Probably at least one two-byte, one three-byte, and one four-byte character would be a good idea, plus maybe a test involving the characters in the 'eo' and 'kz' language override files. I'm unclear myself on which characters are two-byte, three-byte, and four-byte, so maybe you could pick some out of the appropriate xNN.php files and add them to the tests?
Given this, and #122's request for more in-code comments, and #123's concern about the state variables (I think we can do without them, see above in this comment), I think the correct status is "needs work".
I'm actually willing to take this on again if you don't want to work on it, so let me know Damien.... Some guidance from someone who knows more than I do about multi-byte characters on which characters to pick out for the tests would be helpful though!
Comment #126
jhodgdonNo response from Damien (possibly due to time zones) so I am going to take a stab at another patch since it does not appear that will be a duplicated effort. :)
Comment #127
jhodgdonOK, here is a new patch, without the state variables -- I used preg_split() instead.
It turns out we need the somewhat complex logic used to construct the Unicode character code from the character. That's apparently just how they work.
I also added a couple more tests for 3 and 4-byte characters. It's all working and I think it's ready to go (again)... the code is actually making more sense and getting cleaner as we go...
Attachments:
- full patch (hopefully committable)
- interdiff from #118 (from damien's final patch)
- code-only patch without the data files (easier to review)
Comment #128
sunWow, this looks pretty awesome. Could very well imagine this to become a leading implementation.
There are a couple of mistakes contained though, primarily regarding the class being in Drupal\Component, but the documentation/comments has Drupalisms and Drupal-specific things all over the place; including a
drupal_alter()
inreadLanguageOverrides()
.However, those issues are extremely minor, and we can happily fix them post-commit (most likely, by subclassing the PHPTransliteration class into Drupal\Core\Transliteration, just to extend the generic class and plug in the
drupal_alter()
) — no need to hold up this wonderful patch for that.Comment #129
Damien Tournoud CreditAttribution: Damien Tournoud commentedThe issue with
preg_split()
is that we are decoding the full string in memory. We can usepreg_replace_callback()
and avoid storing state variables in the main class by creating aPHPTransliterationOperation
class just for this, and move->ordUTF8
and->replace
in there.Comment #130
jhodgdonRE #129 - I don't see why this is a problem. The string is already in memory, of course, so it's not all that much overhead to make it into an array of characters. The typical use case for this type of machine-provided lookup-table transliteration is to make a URL or a machine name for something like that, so it's typically only a few characters. I imagine that in most cases, the overhead involved in instantiating a new class (which would likely involve making a copy of the string to be transliterated anyway) would be far greater than the overhead involved in splitting it into an array of characters?
Also, if you were really trying to transliterate long pieces of text (e.g., prose rather than making machine names), this class is unlikely to be adequate, due to issues discussed above in this issue (and mentioned in the @defgroup text in this patch, and in the documentation for the class).
Comment #131
jhodgdonI realized maybe I should un-assign this to myself, since lately "assigned to jhodgdon and RTBC" in the core queue usually means "jhodgdon is going to commit this". :)
Comment #132
webchickThis seems worthy of calling a "major" feature request. Unfortunately we're all stuffed up behind major bugs, so I can't commit it right now. :(
Comment #133
catchOK this has been RTBC for several days now, and we're about right with thresholds (101 major bugs as I type this), so committed/pushed to 8.x, thanks all!
Let's open a follow-up for some of these issues before closing this one:
- moving the Drupal-specific stuff out of the component to a sub-class in Core, but leaving most of it in component, per sun in #128. No drupal_alter() allowed in \Component
- whether to find another format for the raw PHP files per me in #103.
- can we use this for uploaded file names as the contrib project does at all?
- could do with a MAINTAINERS.txt entry
Comment #134
jhodgdonRE #133:
- No drupal_alter() in \Component -- to explore this, I filed:
#1842726: Transliteration component must not contain drupal_alter()
- Regarding the file format, I think this is regarding only the language overrides... assuming so, I've filed:
#1842710: Transliteration: Make language-specific character overrides use YAML files
which hopefully captures this idea.
- Regarding using this in core for uploaded file names, I am not exactly sure what we want to use Transliteration for in core, but I've filed this issue to explore and/or do that:
#1842718: Use new Transliteration functionality in core for machine names
- Regarding MAINTAINERS, are you asking me to sign on as maintaining this? I guess I could do that if you think this one class needs a MAINTAINER. :) But I didn't make a patch yet... wasn't sure whether to put it into Base System or Language System, and I didn't want to sign on as a maintainer of either of those systems (but I guess we can included it as a specific item within one of those). Thoughts?
I'll add these three follow-up issues to the issue summary and create a change notice after I save this comment.
Comment #134.0
jhodgdontotally rewrite the summary
Comment #135
jhodgdonI've added an issue summary that I think captures the current state of the transliteration class. As some of those follow-ups get acted upon, we can iterate on the change notice as needed to keep up with the changes.
http://drupal.org/node/1842748
Since the issue summary has been updated, the change notice created, and all the follow-ups filed (and noted in the issue summary), I think this base issue can now be put back to "fixed", right?
Comment #136
sunRegarding MAINTAINERS.txt, I think we definitely need some names there, since this is a very complex topic.
How about this?
Comment #137
amateescu CreditAttribution: amateescu commentedI'm fine with that :)
Comment #138
jhodgdonSo you think this needs to be its own Component? OK... and I'm OK with being included as a maintainer.
Guess we need the +1 of Damien, so leaving to him to RTBC.
Comment #139
Damien Tournoud CreditAttribution: Damien Tournoud commented+1 here, too.
Comment #140
webchickCommitted and pushed to 8.x. Thanks!
Comment #141
jhodgdonFor anyone following this, there's a new follow-up issue:
#1858376: Provide tests and documentation for long Unicode characters
Do we need to have an issue tag or issue Component for Transliteration by the way?
Comment #141.0
jhodgdonAdded follow-up issues to the summary
Comment #143
KartagisHi,
I git clone'd d8 today, and immediately tried the new transliteration thingie. I created a content type "Fotoğrafçılık" and its machine name was converted to "fotograf__l_k" when it should have been "fotografcilik" instead.
Regards,
Comment #144
Gábor Hojtsy@Kartagis: while the base function was added, it should still be applied to Drupal core at various places, such as content types. See #1842718: Use new Transliteration functionality in core for machine names. This issue was about the base functionality.
Comment #145
batigolixNote that there is now a docs page at http://drupal.org/documentation/modules/transliteration
See also #1994404: Create documentation page for the module Transliteration
Comment #146
jhodgdonComponent update for posterity.
Comment #146.0
jhodgdonadd another follow-up issue to the list