Translation memories are very good in helping you figure out how to translate strings, and especially help you when only little things changed in the source text from a previous version. Ideally we'd have a third party translation memory running which we could fed with all the translations and ask for suggestions on concrete ones when users ask for that.
This needs:
1. Selection of a tool we could use (preferably open source but hopefully at least free so we can skip licensing costs)
2. Integration in feeding it with our data continually (feeding with .po files can be considered a standard feature)
3. Integration in asking for info on certain strings when the user initiates this (via AJAX/AHAH I'd assume)
This will most probably not fit into one issue, but we need to start someplace. Let's keep this for point 1 and a meta-issue for the rest.
Comments
Comment #1
gábor hojtsy#194141: Implement msgmerge type fuzzy matching (which was previously marked duplicate of #563228: Suggestion for similar strings) marked as duplicate.
Comment #2
gábor hojtsyA nice summary with list of tools available at http://en.wikipedia.org/wiki/Computer-assisted_translation#Comparison_of...
Our basic criteria is that it needs to be integrable through some API, ideally by running as a daemon we can communicate with.
Comment #3
hass commented+
Comment #4
gábor hojtsyPoEdit's translation memory algorithm is remarkably simple, explained and implemented (in C) here: http://poedit.svn.sourceforge.net/viewvc/poedit/poedit/trunk/src/transme...
Comment #5
Thomas_Zahreddin commentedsee http://drupal.org/node/608288#comment-3532364
Comment #6
webchicksubscribing. sounds interesting.
Comment #7
balagan commentedVirtaal (a CAT tool written in python) searches on opentran.eu too, and it is quite fast. Don't forget, that we need to solve segmentation issues too. What I mean by this is that there are large translatable segments consisting of many sentences. These are not really reusable. http://translate.sourceforge.net/wiki/toolkit/posegment does this job, if the number of sentences is equivalent in source and target segments. One arguable solution is to offer presegmented text for the translator, so they would never have to deal with large chunks of sentences, it would be solved transparently. The possibility of expanding the segments should be kept for the translator in order to correctly align wrongly determined sentence boundaries.
Some ideas about fuzzy searching algorithms: http://en.wikipedia.org/wiki/Fuzzy_string_searching
I think it might be quite a challenging task, although it would be a wonderful feature. Implementing a wordlist (http://drupal.org/node/263794) sounds a lot easier job, which also has a nice reward (consequency), I think that should be first implemented.
Comment #8
Thomas_Zahreddin commentedi found a nice collection of tools:
http://www.uibk.ac.at/tuxtrans/software.html
- all open source (i think).
Comment #9
balagan commentedBy the way, Pootle, the other product of Virtaal's creators has some built-in translation memory management: http://en.wikipedia.org/wiki/Pootle, http://translate.sourceforge.net/wiki/pootle/updatetm
It is also worth checking: http://open-tran.eu/dev.html
Comment #10
balagan commentedThomas, no real Translation memory server included in the collection, though the alignment tools might be useful for proper segmentation.
Comment #11
balagan commentedI more and more like open-tran.eu
It is open source, written in python, you can find its svn repo here: http://code.google.com/p/open-tran/
Wow, our problem is partially solved already, check it out: http://babelwiki.babelzilla.org/index.php?title=Open-Tran
(I still haven't checked it, will do when have more time)
If we upload all drupal translations to open-tran.eu, for example an a daily basis, and we query the JSON API for suggestions (or use the above firefox extension), we still might want to use the just translated sentences that are not yet in the open-tran.eu memory. These sentences might be very relevant when working on one project (module). I think it might be worth installing open-tran locally, and do the updates immediately, and do a daily transfer of translation memories to open-tran.eu, so other translation tools (like Virtaal) or other communities might use it.
Comment #12
balagan commentedUnfortunately open-tran.eu does not work anymore.
I did some research again, and found the following:
http://amagama.translatehouse.org/
https://github.com/wikimedia/mediawiki-extensions-Translate/tree/master/...
Comment #13
balagan commentedI have also found the following: http://sourceforge.net/p/open-tms/wiki/Home/
Not much info though. This project uses the araya server, which is the product of Heartsome Gmbh: http://www.heartsome.de/en/araya.php
There is also an interesting project, that support .po files: https://translate.zanata.org/zanata/
And opentrans.eu is working again, it was just down for a few days.
Comment #14
fabianx commentedTo solve the simple use case of:
Module A has string 'Foo it great!', but has a typo and then being changed to 'Foo is great!' we can do the following:
On import do a diff of old to new strings, note down all the differences and put them in two lists:
- removed strings
- added strings
Now do a fuzzy match from all removed strings to the added strings.
In this case we get a mapping of:
'Foo it great!'' => 'Foo is great!''Find the unique identifier of the string in the system (e.g. 42) and store the reference to the old string as a suggestion in the new string (4711).
So now you have record:
4711, which references 42
There is now two possibilities:
- 1. Automatically re-create old translations for 42 in 4711, but mark them 'NEEDS REVIEW'
OR
- 2. At run time when the string 4711 is to be translated find the old translation based on how 42 was translated.
I think 1. is likely simpler, but depends on there being a workflow state.
This is the algorithm I think would be simplest to get some very basic, but still powerful fuzzy matching.
Comment #15
miro_dietikerAccidentally found my way to this issue.. :-)
In TMGMT, we currently have a project that adds a translation memory. With translation memory, we mean a very basic own implementation of a storage for segments (that will also be used for an integration point for third party software). But the whole point in it is: TMGMT introduces a segmenter service.
With completing the current projects, TMGMT will try to segment every translatable when captured, annotate segments (if multiple) with a custom HTML 5 tags, and otherwise keep the original structure.
The translate UI also interacts with the segmenter and displays segments separated. (We do a ckeditor plugin, but could also simply result in multiple fields output.)
The memory stores both a version with HTML and a stripped version.
Without fuzzy matching, we will be able to do full segment match reuse at that point.
For later fuzzy matching, we will then look into integrating existing libraries for the (nontrivial fuzzy) matching and our editor will be able to query our memory for that.
While not being our first priority, it is likely some of our pieces will provide great value for reuse in the context of l.d.o.