Find translation memory to integrate with and implement integration [#740494]

Translation memories are very good in helping you figure out how to translate strings, and especially help you when only little things changed in the source text from a previous version. Ideally we'd have a third party translation memory running which we could fed with all the translations and ask for suggestions on concrete ones when users ask for that.

This needs:

1. Selection of a tool we could use (preferably open source but hopefully at least free so we can skip licensing costs)
2. Integration in feeding it with our data continually (feeding with .po files can be considered a standard feature)
3. Integration in asking for info on certain strings when the user initiates this (via AJAX/AHAH I'd assume)

This will most probably not fit into one issue, but we need to start someplace. Let's keep this for point 1 and a meta-issue for the rest.

Comments

Comment #1

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 March 2010 at 10:33

#194141: Implement msgmerge type fuzzy matching (which was previously marked duplicate of #563228: Suggestion for similar strings) marked as duplicate.

Comment #2

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 March 2010 at 10:38

A nice summary with list of tools available at http://en.wikipedia.org/wiki/Computer-assisted_translation#Comparison_of...

Our basic criteria is that it needs to be integrable through some API, ideally by running as a daemon we can communicate with.

Comment #3

hass commented 12 March 2010 at 13:20

Comment #4

gábor hojtsy

he/him

Hungarian

Hungary

commented 16 March 2010 at 10:05

PoEdit's translation memory algorithm is remarkably simple, explained and implemented (in C) here: http://poedit.svn.sourceforge.net/viewvc/poedit/poedit/trunk/src/transme...

Comment #5

Thomas_Zahreddin commented 5 October 2010 at 12:51

see http://drupal.org/node/608288#comment-3532364

Comment #6

webchick

she/they

English

Vancouver 🇨🇦

commented 28 July 2011 at 21:18

subscribing. sounds interesting.

Comment #7

balagan commented 13 August 2012 at 19:25

Virtaal (a CAT tool written in python) searches on opentran.eu too, and it is quite fast. Don't forget, that we need to solve segmentation issues too. What I mean by this is that there are large translatable segments consisting of many sentences. These are not really reusable. http://translate.sourceforge.net/wiki/toolkit/posegment does this job, if the number of sentences is equivalent in source and target segments. One arguable solution is to offer presegmented text for the translator, so they would never have to deal with large chunks of sentences, it would be solved transparently. The possibility of expanding the segments should be kept for the translator in order to correctly align wrongly determined sentence boundaries.
Some ideas about fuzzy searching algorithms: http://en.wikipedia.org/wiki/Fuzzy_string_searching
I think it might be quite a challenging task, although it would be a wonderful feature. Implementing a wordlist (http://drupal.org/node/263794) sounds a lot easier job, which also has a nice reward (consequency), I think that should be first implemented.

Comment #8

Thomas_Zahreddin commented 13 August 2012 at 16:57

i found a nice collection of tools:

http://www.uibk.ac.at/tuxtrans/software.html

- all open source (i think).

Comment #9

balagan commented 13 August 2012 at 19:08

By the way, Pootle, the other product of Virtaal's creators has some built-in translation memory management: http://en.wikipedia.org/wiki/Pootle, http://translate.sourceforge.net/wiki/pootle/updatetm
It is also worth checking: http://open-tran.eu/dev.html

Comment #10

balagan commented 13 August 2012 at 19:17

Thomas, no real Translation memory server included in the collection, though the alignment tools might be useful for proper segmentation.

Comment #11

balagan commented 15 August 2012 at 13:18

I more and more like open-tran.eu
It is open source, written in python, you can find its svn repo here: http://code.google.com/p/open-tran/

Wow, our problem is partially solved already, check it out: http://babelwiki.babelzilla.org/index.php?title=Open-Tran
(I still haven't checked it, will do when have more time)

If we upload all drupal translations to open-tran.eu, for example an a daily basis, and we query the JSON API for suggestions (or use the above firefox extension), we still might want to use the just translated sentences that are not yet in the open-tran.eu memory. These sentences might be very relevant when working on one project (module). I think it might be worth installing open-tran locally, and do the updates immediately, and do a daily transfer of translation memories to open-tran.eu, so other translation tools (like Virtaal) or other communities might use it.

Comment #12

balagan commented 22 February 2014 at 08:32

Issue summary:

View changes

Unfortunately open-tran.eu does not work anymore.
I did some research again, and found the following:
http://amagama.translatehouse.org/
https://github.com/wikimedia/mediawiki-extensions-Translate/tree/master/...

Comment #13

balagan commented 5 April 2014 at 10:22

I have also found the following: http://sourceforge.net/p/open-tms/wiki/Home/
Not much info though. This project uses the araya server, which is the product of Heartsome Gmbh: http://www.heartsome.de/en/araya.php
There is also an interesting project, that support .po files: https://translate.zanata.org/zanata/
And opentrans.eu is working again, it was just down for a few days.

Comment #14

fabianx commented 13 October 2014 at 15:33

To solve the simple use case of:

Module A has string 'Foo it great!', but has a typo and then being changed to 'Foo is great!' we can do the following:

On import do a diff of old to new strings, note down all the differences and put them in two lists:

- removed strings
- added strings

Now do a fuzzy match from all removed strings to the added strings.

In this case we get a mapping of:

'Foo it great!'' => 'Foo is great!''

Find the unique identifier of the string in the system (e.g. 42) and store the reference to the old string as a suggestion in the new string (4711).

So now you have record:

4711, which references 42

There is now two possibilities:

- 1. Automatically re-create old translations for 42 in 4711, but mark them 'NEEDS REVIEW'

- 2. At run time when the string 4711 is to be translated find the old translation based on how 42 was translated.

I think 1. is likely simpler, but depends on there being a workflow state.

This is the algorithm I think would be simplest to get some very basic, but still powerful fuzzy matching.

Comment #15

miro_dietiker

Switzerland

commented 3 June 2016 at 23:13

Accidentally found my way to this issue.. :-)

In TMGMT, we currently have a project that adds a translation memory. With translation memory, we mean a very basic own implementation of a storage for segments (that will also be used for an integration point for third party software). But the whole point in it is: TMGMT introduces a segmenter service.

With completing the current projects, TMGMT will try to segment every translatable when captured, annotate segments (if multiple) with a custom HTML 5 tags, and otherwise keep the original structure.
The translate UI also interacts with the segmenter and displays segments separated. (We do a ckeditor plugin, but could also simply result in multiple fields output.)
The memory stores both a version with HTML and a stripped version.
Without fuzzy matching, we will be able to do full segment match reuse at that point.
For later fuzzy matching, we will then look into integrating existing libraries for the (nontrivial fuzzy) matching and our editor will be able to query our memory for that.

While not being our first priority, it is likely some of our pieces will provide great value for reuse in the context of l.d.o.

Find translation memory to integrate with and implement integration

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

News items

Our community

Documentation

Drupal code base

Governance of community