Change record: mask HTML markup in XLIFF [#1882108]

XLIFF defines masking techniques to deal with e.g. HTML markup in content. These items thus can be protected from wrong manipulation by the translator.

http://www.maxprograms.com/articles/xliff.html

Check these quotes:

<bpt>	Begin paired tag: The <bpt> element is used to delimit the beginning of a paired sequence of native codes. Each <bpt> has a corresponding <ept> element within the translation unit.
<ept>	End paired tag: The <ept> element is used to delimit the end of a paired sequence of native codes. Each <ept> has a corresponding <bpt> element within the translation unit.

So with these proper masking features, we could make sure, HTML is not destructed by translators.
Since i don't know the proper implementation status of this feature in XLIFF compatible software, we'll need to check the adaption.

Note that most translation software provider don't implement raw XLIFF, but only subsets or even dialects.

Comment	File	Size	Author
#55	html-entity-decode_1882108_55.patch	1.43 KB	anybody
#53	html-entity-decode_1882108_53.patch	3.03 KB	anybody
#47	1882108-mask_html-15.interdiff.txt	2.14 KB	blueminds
#47	1882108-mask_html-15.patch	34.56 KB	blueminds
#45	1882108-mask_html-14.patch	34.41 KB	blueminds
#42	1882108-mask_html-13.interdiff.txt	1.74 KB	blueminds
#42	1882108-mask_html-13.patch	34.41 KB	blueminds
#36	1882108-mask_html-12.interdiff-setting-flag.patch	9.49 KB	blueminds
#36	1882108-mask_html-12.interdiff-validation.patch	7.31 KB	blueminds
#36	1882108-mask_html-12.patch	34.45 KB	blueminds
#33	1882108-mask_html-11.patch	30.16 KB	blueminds
#33	1882108-mask_html-11.interdiff.txt	1.47 KB	blueminds
#32	1882108-mask_html-10.interdiff.txt	7.85 KB	blueminds
#32	1882108-mask_html-10.patch	30.14 KB	blueminds
#30	1882108-mask_html-9.interdiff.txt	39.1 KB	blueminds
#30	1882108-mask_html-9.patch	29.54 KB	blueminds
#28	1882108-mask_html-8.interdiff.txt	31.9 KB	blueminds
#28	1882108-mask_html-8.patch	26.78 KB	blueminds
#26	1882108-mask_html-7.interdiff.txt	4.35 KB	blueminds
#26	1882108-mask_html-7.patch	42.67 KB	blueminds
#20	1882108-mask_html-6.interdiff.txt	1.65 KB	blueminds
#20	1882108-mask_html-6.patch	42.33 KB	blueminds
#18	1882108-mask_html-5.interdiff.txt	8.42 KB	blueminds
#18	1882108-mask_html-5.patch	42.11 KB	blueminds
#16	1882108-mask_html-4.patch	40.68 KB	blueminds
#14	1882108-mask_html-3.patch	46.73 KB	blueminds
#10	1882108-mask_html-2.patch	40.16 KB	blueminds
#10	1882108-mask_html-2.interdiff.txt	4.95 KB	blueminds
#4	1882108-mask_html-1.patch	39.57 KB	blueminds

Comments

Comment #1

miro_dietiker

Switzerland

commented 20 May 2013 at 07:52

First we should deal with text formats cleanly and only apply HTML specific tools if HTML applies.

However as a workaround we can for now try to detect html in all cases. This possibly should be a translator setting.
It will help trados display things cleanly, deal with translation memory right and finally prevent editors from destroying html.

Comment #2

ysavourel commented 29 June 2013 at 04:35

> Since i don't know the proper implementation status of
> this feature in XLIFF compatible software, we'll need to check the adaption.

Here are some notes that may help:

There are only in a few rare cases of XLIFF-aware tools that may not understand the inline codes elements such as <bpt>/<ept>/<ph>/etc. The wide majority handle them properly.

Some tend to put all original codes in <ph>. But if you can make the distinction between opening/closing/placeholder types (e.g. , , ) I would strongly recommend to use the proper XLIFF element for each type.

You can also use <g>...</g> and <x/> if you don't want to include the original code in the XLIFF document. But your merging tool will then need to know how to put the original codes back.

I also strongly recommend to handle segmented entries properly when merging back: by stripping <mrk mtype='seg'> elements' tags (not their content). Other <mrk> elements may also be present that you may want to handle to either feed them back into the content (e.g. a comment) or strip them out.

Comment #3

Rob1106 commented 13 November 2013 at 14:22

Quick input from our experience:
Our translator uses SDL Trados Studio 2011 which is not able to leave the html tags out of the translation tasks by default. This means that the translator can not do a proper word count and that makes it difficult for them to make a quote that matches the work that is done or has to be done.

Another thing is that the html tags are counted as words for a translation job inside the module itself. So this means that whenever the translator has a system that does leave out the html-tags you don't have a correct wordcount of the translation job to double-check the translator, which is pretty important as well.

So this would be a nice feature to have inside the tmgmt module itself as far as we are concerned. A nice 'should have' so to say.

Comment #4

blueminds commented 27 November 2013 at 13:03

Issue summary:	View changes
Status:	Active	» Needs review

Status	File	Size
new	1882108-mask_html-1.patch	39.57 KB

- Provides a way to add "processors". A processor can take an action upon translation request and xliff import.
- The patch also includes html masking processor that will mask and unmask the html using and tags.

Comment #5

cgalli commented 27 November 2013 at 15:37

Status:

Needs review

» Needs work

Installed the patch and tested.

When the 'Escape Html' box is checked in the translator settings, the job is masked with bpt and ept markup. Unfortunately, I do not have a program available to import the result to. So I cannot analyse the result.

Shouldn't the 'Escape HTML' setting not be available on the job level?

When the 'Escape HTML' box is NOT checked, the job does not come back after 'Submit to translator' (blank page)

BTW: It is super stupid that the preview for issues shows only the original body (which never changes) and not the comment (which is the only thing that changes)....progress!

Comment #6

Rob1106 commented 27 November 2013 at 15:37

That sounds pretty good. Can't wait to test it in a new release...

Comment #7

cgalli commented 27 November 2013 at 15:56

@Rob

Can you please test the masking in Trados? You can test the patched version by clicking on this link:

http://simplytest.me/project/tmgmt/7.x-1.x?patch[]=https://drupal.org/fi...

Then choose 'Launch Sandbox' which will setup a tmgmt Site with the patch applied.

Or you apply the patch onto you own system.

Please do not forget to check the box 'Escape Html' in the translator's settings.

Comment #8

miro_dietiker

Switzerland

commented 27 November 2013 at 15:52

Rob, you should test it in -dev way before release (or even with applying the patch manually) once committed.
As we need feedback before we can release such a major feature.

cgalli: No, this is nothing that makes sense per job.
The tool / workflow in place at the translator company defines if we should mask or not.
This rule is a perfect fit for configuration per translator.
All jobs created for this translator should follow the same encoding patterns and thus make sure we do not forget to enable masking accidentally.

Masking should be the default as it will help to have clean translation memories, counters and a good translator experience in the tools.

Comment #9

blueminds commented 28 November 2013 at 08:13

Status:

Needs work

» Needs review

@Rob1106 will you be able to test the patch? I do not have a tool to test it in real life, just followed specs. It needs to be tested in real life before we push it forward.

Comment #10

blueminds commented 2 December 2013 at 19:54

Status	File	Size
new	1882108-mask_html-2.interdiff.txt	4.95 KB
new	1882108-mask_html-2.patch	40.16 KB

Tested the attached patch with Trados studio. The basic workflow works, still needs improvements and fixing test coverage, however ready for trial ride and feedback.

Comment #11

2 December 2013 at 20:30

Status:

Needs review

» Needs work

The last submitted patch, 10: 1882108-mask_html-2.patch, failed testing.

Comment #12

2 December 2013 at 20:30

The last submitted patch, 10: 1882108-mask_html-2.patch, failed testing.

Comment #13

miro_dietiker

Switzerland

commented 2 December 2013 at 21:05

This sounds AWESOME! :-)
Let us test this tomorrow together.

We should discuss on what we consider plain XLIFF and what is a Trados specific implementation.
Thus, we might need to introduce vendor specific XLIFF dialects from the beginning.

Comment #14

blueminds commented 5 December 2013 at 09:19

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-3.patch	46.73 KB

Attaching patch hopefully solving all issues. Ready for review.

Comment #15

miro_dietiker

Switzerland

commented 5 December 2013 at 17:39

+++ b/tests/testing_html/broken.html
@@ -0,0 +1,160 @@
+    <strong>Artikeldetails des Herstellers: </strong> <a href="

No commercial URLs and offers in test content please! Take example.com urls and some neutral text like Wikipedia...

Also we talked about adding some validation of the received XLIFF markup / masks which is related to:
#2042861: Validate tags

Comment #16

blueminds commented 6 December 2013 at 16:30

Status	File	Size
new	1882108-mask_html-4.patch	40.68 KB

Added documentation, import integrity check and removed two testing html files and instead moved testing HTML into single sample file.

The current status is that it works with Trados quite nice. Probably would make sense to try it out with different tool as well.

There is still one issue though, and that is img attributes - they are added as attributes of the placeholder element
without x-html: prefix. This is due to fact it fails validation in Trados then. Tried it out with xml: prefix, and that worked, but not sure if it is correct.

Comment #17

miro_dietiker

Switzerland

commented 6 December 2013 at 21:21

Status:

Needs review

» Needs work

Nice status here. Let me do some more extensive review.

+++ b/tests/testing_html/sample.html
@@ -0,0 +1,5 @@
+<p>one paragraph with special characters: äöüľščťžýáíéäňú</p>

I'm missing HTML entities like © and other more crypto ones here. I don't know if it cleanly works through the stack.

+++ b/translators/file/tmgmt_file.format.html.inc
@@ -1,4 +1,4 @@
-<?php
+  <?php

What?

```
+++ b/translators/file/tmgmt_file.module
@@ -36,14 +36,34 @@ function tmgmt_file_theme() {
 function tmgmt_file_import_form_submit($form, &$form_state) {
```
I see way too much business logic inside this single _submit function. Why not create a separate clean method that processes the import overall XLIFF? Am i wrong?

+++ b/translators/file/tmgmt_file.module
@@ -36,14 +36,34 @@ function tmgmt_file_theme() {
+      $validated_job = $importer_controller->validateImport($file->uri);
+      // If we have processors do the validation for each of them.

Only proceed if still valid here. I even thought of an early exit strategy for the abort cases. Anyway, why is this in the _submit function and not in the _validate? Dunno. :-)

+++ b/translators/file/tmgmt_file.module
@@ -36,14 +36,34 @@ function tmgmt_file_theme() {
+      // If we have processors do the validation for each of them.

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,376 @@
+   * {@inheritdoc}

I guess, here you should exactly summarize what you are checking in the implementation. The inherited docs don't state this.

Comment #18

blueminds commented 7 December 2013 at 12:41

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-5.patch	42.11 KB
new	1882108-mask_html-5.interdiff.txt	8.42 KB

comments implemented

Comment #19

miro_dietiker

Switzerland

commented 8 December 2013 at 10:55

Status:

Needs review

» Needs work

+++ b/tests/testing_html/sample.html
@@ -1,4 +1,4 @@
+<p>one paragraph with special characters: äöüľščťžýáíéäňú©«®™»</p>

Those are still special UTF-8 characters and no HTML entities that are already encoded like ©
We need to check what happens if (correctly) encoded entities are already part of the original data.

Comment #20

blueminds commented 9 December 2013 at 07:42

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-6.patch	42.33 KB
new	1882108-mask_html-6.interdiff.txt	1.65 KB

here we go

Comment #21

9 December 2013 at 07:49

The last submitted patch, 20: 1882108-mask_html-6.patch, failed testing.

Comment #22

berdir

German

Switzerland

commented 9 December 2013 at 14:55

First review.

Haven't reviewed everything yet, code looks quite good, but I'm a bit worried about complexity/the overall thing, how this new plugin type plays together with the different file formats and it's kind of XLIFF specific atm but then again not and it also kind of overlaps with the new escape feature that we introduced recently for translatable string placeholder escaping...

+++ b/tests/tmgmt_test.module
@@ -51,3 +51,15 @@ function tmgmt_test_tmgmt_source_suggestions(array $items, TMGMTJob $job) {
+/**
+ * Implements hook_tmgmt_fle_text_processor_plugin_info().

typo: fle

+++ b/tests/tmgmt_test.text_processor.inc
@@ -0,0 +1,80 @@
+   * Will update translation text.

first word should be third tense.

Updates translation text.

Should be enough, no need for will?

+++ b/tests/tmgmt_test.text_processor.inc
@@ -0,0 +1,80 @@
+  protected function updateTranslation(&$data) {
+    if (isset($data['#translation'])) {
+      $data['#translation']['#text'] = str_replace('{processed}', '', $data['#translation']['#text']);
+    }
+    else {
+      foreach (element_children($data) as $key) {
+        $this->updateSourceText($data[$key]);
+      }
+    }
+  }

Shouldn't this call itself not the other method?

Makes me wonder why the tests aren't failing ;)

+++ b/translators/file/tmgmt_file.format.xliff.inc
@@ -50,11 +50,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
     $this->writeAttribute('xml:lang', $this->job->getTranslator()->mapToRemoteLanguage($this->job->target_language));
-    $this->text($element['#text']);
+    $this->writeRaw(!empty($element['#translation']['#text']) ? $element['#translation']['#text'] : $element['#text']);
     $this->endElement();

I don't remember why I added the source also as translation, does that actually make sense? :)

+++ b/translators/file/tmgmt_file.format.xliff.inc
@@ -135,7 +135,12 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
     $data = array();
     foreach ($xml->xpath('//xliff:trans-unit') as $unit) {
-      $data[(string) $unit['id']]['#text'] = (string) $unit->target;
+      // Load the result with XMLReader so that we are able to get the inner
+      // xml.
+      $reader = new XMLReader();
+      $reader->XML($unit->target->asXML());
+      $reader->read();
+      $data[(string) $unit['id']]['#text'] = $reader->readInnerXml();
     }
     return tmgmt_unflatten_data($data);

This looks quite inefficient, can we at least re-use the same class?

+++ b/translators/file/tmgmt_file.module
@@ -32,37 +32,92 @@ function tmgmt_file_theme() {
+      try {
+        $controller_class = tmgmt_file_text_processor_controller($name);
+        /** @var TMGMTFileTextProcessorInterface $processor_controller */
+        $processor_controller = new $controller_class;
+        $processor_controller->validateJobTranslationUponImport($job, $file->uri);
+      }
+      catch (TMGMTException $e) {
+        $job->addMessage($e->getMessage(), array(), 'error');
+        $validated_job = FALSE;
+        break;
+      }

Might be useful to have tests for the validation failure part here, haven't seen the tests yet but the test implementation is empty.

+++ b/translators/file/tmgmt_file.module
@@ -32,37 +32,92 @@ function tmgmt_file_theme() {
+  if (!$validated_job) {
+    $job->addMessage('Failed to validate file, import aborted.', array(), 'error');
+    return FALSE;
+  }
+  elseif ($validated_job->tjid != $job->tjid) {
+    $uri = $validated_job->uri();
+    $label = $validated_job->label();
+    $job->addMessage('Import file is from job <a href="@url">@label</a>, import aborted.', array('@url' => url($uri['path']), '@label' => $label));
+    return FALSE;
+  }
+  else {

You have both if/elseif/else *and* return FALSE.

I think it would be easier to read if you just do return FALSE and have if () return FALSE; if (return FALSE) ... less nesting.

+++ b/translators/file/tmgmt_file.module
@@ -109,6 +167,43 @@ function tmgmt_file_format_controller($plugin = NULL) {
+ *
+ * @return TMGMTFileTextProcessorInterface
+ *   Either a specific text processor plugin controller instance or an array of
+ *   available controllers.
+ */
+function tmgmt_file_text_processor_controller($plugin = NULL) {
+  return _tmgmt_plugin_controller('file_text_processor', $plugin);
+}

Probably copied, but that's not true according to how it's used? there it's a string that's used as new $bla.

+++ b/translators/file/tmgmt_file.plugin.inc
@@ -29,6 +29,17 @@ class TMGMTFileTranslatorPluginController extends TMGMTDefaultTranslatorPluginCo
+      // If we have processors do the source text processing.
+      if ($text_processors = $job->getTranslator()->getSetting('text_processor_plugins')) {
+        foreach ($text_processors as $name) {
+          $controller_class = tmgmt_file_text_processor_controller($name);
+          /** @var TMGMTFileTextProcessorInterface $controller */
+          $controller = new $controller_class;
+          $controller->processJobSourceUponRequest($job);
+        }
+      }
+

Isn't processing tightly coupled with the file format? XLIFF-transformation for HTML export isn't very useful :)

I know, you just aren't supposed to select it, but that's not exactly great UX :)

+++ b/translators/file/tmgmt_file.recursive_iterator.inc
@@ -0,0 +1,93 @@
+ * Class used to iterate through DOMDocument().

DOMDocument is a class, not a function, so no ().

+++ b/translators/file/tmgmt_file.recursive_iterator.inc
@@ -0,0 +1,93 @@
+  /**
+   * Current Position in DOMNodeList
+   * @var Integer
+   */

position small, . at the end and missing empty line before @var

+++ b/translators/file/tmgmt_file.recursive_iterator.inc
@@ -0,0 +1,93 @@
+   *   DOMNode to itterate over.

t-to many t-t's :)

+++ b/translators/file/tmgmt_file.recursive_iterator.inc
@@ -0,0 +1,93 @@
+   * @return DOMNode
...
+   * @return RecursiveDOMIterator
...
+   * @return Boolean
...
+   * @return Integer

int/bool, missing descriptions.

+++ b/translators/file/tmgmt_file.test
@@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
   function setUp() {
-    parent::setUp(array('tmgmt_file', 'tmgmt_ui'));
+    parent::setUp(array('tmgmt_file', 'tmgmt_ui', 'tmgmt_test'));

tmgmt_test is always enabled in the base class.

+++ b/translators/file/tmgmt_file.test
@@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
   }
 
+  function testTextProcessorAPI() {

Missing docblock.

+++ b/translators/file/tmgmt_file.text_processor.interface.inc
@@ -0,0 +1,52 @@
+  /**
+   * Will process job source text.
+   *
+   * The implementation should do whatever necessary to process the source text.
+   * Upon performing needed changes the implementation should take care of
+   * persisting these changes.

"Will" also unnecessary here.

I don't really understand the last sentence, persist?

Closed review window accidently in the middle, will continue later

Comment #23

berdir

German

Switzerland

commented 9 December 2013 at 14:57

About the escape problem, the idea of the API there was that knowing what should be escaped and knowing how it needs to be escaped are two different things and we tried to separate it there, but then again the way html is escaped in xliff is connected, so that might not work well.

Don't have a good solution atm...

Comment #24

berdir

German

Switzerland

commented 9 December 2013 at 16:56

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+ * To process the content DOMDocument object is used due to its ability to
+ * read broken HTML. This also implies that if broken HTML is in the source
+ * content the translation content will be fixed into the extend of DOMDocument
+ * abilities.

What exactly is source and translation content here? My understanding is that source is the actual node and fields and translation content is both the source text that we initially process and the translation that we get back, as both go through DOMDocument?

Wherever we make the split, we end up with a mismatch as we can't fix the actual source, be it our review form or the actual sources.

We can't change this, but maybe we should make this a bit more explicit, e.g. in the form of a description.

More tomorrow

Comment #25

berdir

German

Switzerland

commented 10 December 2013 at 11:09

```
+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+      $job_item->data = $data;
```
Not sure if this can really be considered a public property and should be set like this, it happens to be public because the storage controller needs it.

We have updateData(), but that's rather slow, not sure.

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+   */
...
+  /**
+   * Will check the import integrity.
+   *
+   * It will count all sub elements of the trans-unit/target element and compare
+   * it with the value stored during export.
+   *
+   * {@inheritdoc}

@inheritdoc can't be mixed with additional comments. api.module doesn't support that.

Instead of the second paragraph, you could move that inline and also add some more comments what you're doing there. The first seems unecessary. Also more uneccessary "will"'s :)

```
+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+
+    foreach ($xml->xpath('//xliff:trans-unit') as $unit) {
+      $count = 0;
+      $reader = new XMLReader();
...
+      }
```
Would be important to test this with multiple job items and data items I think, to make sure that this logic here works correctly.

Also again the mix of simplexml and XMLReader. Again should probably at least re-use the reader (if that is actually faster, not sure), the more performant way would probably be to only use one API. Not sure how feasable that is.

All this is doing is counting opening tags, right (comments! :))

+++ b/translators/file/tmgmt_file.test
@@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
+    $job->addItem('test_source', 'test_html', '1');
...
+    // Note that here we use only simple HTML structure so that we can do the
+    // source and translation comparison. If more complex content including also
+    // html entities or other non ASCII chars is used the comparison might not
+    // work due to XML processors manipulating with such chars.

Other tests that require special data usually just override the default, that is IMHO easier to read than an implicit thing through the type. But we're doing it elsewere too, so not sure. Doing it here would make it easier to test multiple different structures. And I know it contradicts with what I'm saying below (setting data directly being problematic), but it's a test here :))

+++ b/translators/file/tmgmt_file.test
@@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
+    $job->requestTranslation();
+    // Make sure the masking process happened.
+    $text = $this->findSourceText($job->getData());
+    $this->assertTrue(strpos($text, '<bpt') !== FALSE);
+    $this->assertTrue(strpos($text, '<ept') !== FALSE);

Testing a bit more specific would make sense I think, e.g a whole string from opening to closing tag, not just a small bit of it.

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+        if (isset($job_items[$tjiid])) {
+          $label = $job_items[$tjiid]->label();
+        }
+        else {
+          $label = t('unknown');
+        }
+        throw new TMGMTException(t('Integrity check for job item @label failed.', array('@label' => $label)));

How could it end in the else case?

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+          if ($reader->getAttribute('ctype') == 'image') {
+            $text .= '<img';
+            while ($reader->moveToNextAttribute()) {
+              // @todo - we have to use x-html: prefixes for attributes.
+              if ($reader->name != 'ctype' && $reader->name != 'id') {
+                $text .= " {$reader->name}=\"{$reader->value}\"";
+              }
+            }

Do we need to worry about escaping/encoding here?

```
+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+
+  /**
+   * Helper function to process the source text.
+   *
+   * @param array $data
+   *   Job data array.
+   */
+  protected function processSource(&$data) {
```
"Helper function to" is IMHO equally useless as "Will", "protected function" tells me that as well, what about "Masks HTML tags" or someting like that, that actually describes what it does?

Also, either the two loop functions should have a default implementation with abstract functions for this processSource() thing, or we give it a better name, not "process".
```
+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+      $this->elementsCount = 0;
```
Are you sure this is the right place to set elementsCount? Won't this set it back to 0 for every data item?

Also, the trick with $this->elementsCount that's then used in the parent function seems... rather weird, I'd either return it (a bit more code but I think easier to understand) or pass down the job so that it can be immediately updated. The second is what I'd do if we have a default implementation for that loop.

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+      $dit = new RecursiveIteratorIterator(
+        new RecursiveDOMIterator($dom),
+        RecursiveIteratorIterator::SELF_FIRST);

Variable names are usually not shortened, can we name this $iterator? Or $dom_iterator?

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+
+        if ($node->nodeType === XML_ELEMENT_NODE) {
+          // We have a new tag, increment the id.
+          $id++;
+          // Currently the incremental id and the elementCount has the same

id in comments should be ID

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+      }
...
+      foreach ($node->attributes as $attribute) {
+        $attributes[] = $attribute->name . '="' . $attribute->value . '"';
...
+    $beginning_tag = '<' . $node->nodeName;
+    if ($node->hasAttributes()) {
+      $attributes = array();
+      /** @var DOMAttr $attribute */
...
+
+      $beginning_tag .= ' '. implode(' ', $attributes);
+    }
+    $beginning_tag .= '>';
+    $writer->startElement('bpt');

Same here, do we need to handle escaping of attribute values? AFAIK, we do parse this again when we import it, no?

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+    $writer->endElement();
...
+    $writer->writeAttribute('ctype', 'image');
+    foreach ($node->attributes as $attribute) {
+      // @todo - uncomment when issue with Trados/sub elements fixed.
+//      if (in_array($attribute->name, array('title', 'alt'))) {
+//        continue;
+//      }
+      $writer->writeAttribute($attribute->name, $attribute->value);
+    }
+//    if ($alt_attribute = $node->getAttribute('alt')) {
+//      $writer->startElement('sub');
+//      $writer->writeAttribute('id', $id . '-img-alt');
+//      $writer->writeAttribute('ctype', 'x-img-alt');
+//      $writer->text($alt_attribute);
+//      $writer->endElement();
+//      $this->elementsCount++;
+//    }
+//    if ($title_attribute = $node->getAttribute('title')) {
+//      $writer->startElement('sub');
+//      $writer->writeAttribute('id', $id . '-img-title');
+//      $writer->writeAttribute('ctype', 'x-img-title');
+//      $writer->text($title_attribute);
+//      $writer->endElement();
+//      $this->elementsCount++;
+//    }

Comment characters need to be indendeted correctly as well, if this is will be committed like this which I understand it will.

+++ b/translators/file/tmgmt_file.ui.inc
@@ -44,6 +44,15 @@ class TMGMTFileTranslatorUIController extends TMGMTDefaultTranslatorUIController
 
+    $text_processors = $translator->getSetting('text_processor_plugins');
+    $form['text_processor_plugins'] = array(
+      '#type' => 'checkboxes',
+      '#title' => t('Text processors'),
+      '#description' => t('Identify which text processors should be used during the import/export operations.'),
+      '#options' => tmgmt_file_text_processor_plugin_labels(),
+      '#default_value' => is_array($text_processors) ? $text_processors : array(),

Ok, so this is a setting on the translator, so it can't be configured per job but you can configure the format per job...

That's it for now with specific feedback, going to discuss the whole thing with Miro asap.

Comment #26

blueminds commented 10 December 2013 at 11:33

Status	File	Size
new	1882108-mask_html-7.patch	42.67 KB
new	1882108-mask_html-7.interdiff.txt	4.35 KB

Here is patch that deals with characters that break xml structure. Have not yet done anything from reviews.

Comment #27

berdir

German

Switzerland

commented 11 December 2013 at 11:36

Status:

Needs review

» Needs work

Ok, to summarize, here is my main problem with the way the plugin system works right now: There's a disconnect between the actual implementation, the way it's called and how it works:

- It works on the job, giving the assumption that it's a generic thing, this also makes the implementations rather complicated (that foreach, which *could* be solved with a base implementation). With the exception of the validate thing, which receives a file name (without knowing anything about it)
- It's called from the file translator, no matter what's configured
- The actual implementation however blindly assumes that the source is HTML and the target will be XLIFF.

The result is that it's a one-off solution for this specific problem, masked as a more or less generic pluggable system.

I think there are 3 options as to what we could do:
1 Write a fancy and complex processing system, that would actually be generic. That would very likely involve introducing a intermediate markup language that the source is converted into and then the translator processes it back into whatever he needs. That would probably keep us busy for a year and is overkill for this issue :)
2. Change the plugin system so that it's specific to xliff, only execute those plugins if xliff is selected and update documentation/possibly methods.
3. Implement this as an actual one-off solution, we can keep the class but just call it directly.

I actual prefer 3, because 2 will just get in the way in case we actually implement something like 1 at some point and 3 would then be easier to incorporate into that new plugin system as it wouldn't be a public API before. We can also make arguments/methods match that specific use case and don't have to think what we need to do to support every possible use case.

Then there's the the problem with saving the data in those methods in the class. That is problematic, because sooner or later, we will introduce things like checking if the source still matches when you review a translation and at least warn if not. That will happen unless we apply the first part of the processing *before* we do that, which we can't unless we'd implement 1. Given that we don't need the fixed source as we have the tag count it's supposed to have (afaik), would it be easier to just run $data through the processor, without persisting it?

Comment #28

blueminds commented 12 December 2013 at 14:03

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-8.patch	26.78 KB
new	1882108-mask_html-8.interdiff.txt	31.9 KB

Did not implement all the comments yet. Focused on your 3. and implemented it. I agree with you here, but the original specification said that it should be pluginable and that there should be an option to turn off masking.

Re that setting - i think it does not have a sense as that setting would be about "Export xliff" and "Export even more xliffed file".

Tested with Trados studio, works fine.

Please see the patch if you like the conceptual changes and will finish other comments later on.

Comment #29

12 December 2013 at 14:37

The last submitted patch, 28: 1882108-mask_html-8.patch, failed testing.

Comment #30

blueminds commented 13 December 2013 at 15:42

Status	File	Size
new	1882108-mask_html-9.patch	29.54 KB
new	1882108-mask_html-9.interdiff.txt	39.1 KB

here is the final thing

Comment #31

miro_dietiker

Switzerland

commented 14 December 2013 at 21:53

Some feedback :-) (strange, my dreditor seems broken...)

+and here we have some link <a href="http://example.com">break line</a>
I guess we should additionally test links with alt, title and both.

+ * Contains TMGMTTestTextProcessor.
I'm strongly missing links to the relevant standard documents about what recommendation we exactly implement. Such as http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-h...

+ 'label' => t('Test html source'),
HTML not html. And 100+ times more. ;-)

+ * Validates imported xliff file.
XLIFF, not xliff. And 10+ times more.

+    // The reason why we use DOMDocument object here and not just XMLReader
+    // is the DOMDocument's ability to deal with broken HTML.

I would be fine to define that HTML needs to be valid. However, TMGMT would thus require a source validation phase and an error might stop a job.

+ $non_pair_tags = array('br', 'img');
Usually, the parser knows what is paired and what is present as single tag. This is only required for validation - and there are quite some other possible non-pair-tags.

+ return $reader->readInnerXml();
Correct: readInnerXML()

Comment #32

blueminds commented 14 January 2014 at 14:50

Status	File	Size
new	1882108-mask_html-10.patch	30.14 KB
new	1882108-mask_html-10.interdiff.txt	7.85 KB

- Implemented comments
- Ensured element ids are unique in the whole xliff document

Comment #33

blueminds commented 14 January 2014 at 15:14

Status	File	Size
new	1882108-mask_html-11.interdiff.txt	1.47 KB
new	1882108-mask_html-11.patch	30.16 KB

make ids "nicer"

Comment #34

miro_dietiker

Switzerland

commented 14 January 2014 at 22:52

Status:

Needs review

» Needs work

2 files were hidden/shown/deleted

Status	File	Size
hidden	1882108-mask_html-9.interdiff.txt	39.1 KB
hidden	1882108-mask_html-10.interdiff.txt	7.85 KB

So we realized that Trados modifies/drops CDATA and while is throws a fatal error when the import does not 100% contain tag masks in source and target side, it makes no guarantee all masked tags are placed.

Thus, failing hard in case the validation shows mismatch is a bad idea.
Instead we should provide a message about the issue, continue and provide more messages in detail during processing.

Later, TMGMT items / data items might get flags about their integrity that would block a user from accepting it.

Still, as initially specified, the whole feature needs to be optional and enabled through the translator. If disabled, there shouldn't be any change.
For future, we should persist documents of old exports and make sure the import stilll works with the new code using automatic test coverage.

EDIT: Sometimes HTML is still written small caps.

Comment #35

miro_dietiker

Switzerland

commented 14 January 2014 at 23:03

The status of #33 is currently being tested in production, meaning we have files created with that defined file format in production.
For the import, we need compatibility and graceful fail within the next few days.

Comment #36

blueminds commented 16 January 2014 at 12:35

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-12.patch	34.45 KB
new	1882108-mask_html-12.interdiff-validation.patch	7.31 KB
new	1882108-mask_html-12.interdiff-setting-flag.patch	9.49 KB

- Here comes the setting to enable/disable xliff processing
- HTML semantics integrity check that triggers only error messages but the job gets imported

Comment #37

16 January 2014 at 12:54

Status:

Needs review

» Needs work

The last submitted patch, 36: 1882108-mask_html-12.interdiff-setting-flag.patch, failed testing.

Comment #38

16 January 2014 at 12:55

The last submitted patch, 36: 1882108-mask_html-12.interdiff-validation.patch, failed testing.

Comment #39

blueminds commented 16 January 2014 at 13:16

36: 1882108-mask_html-12.patch queued for re-testing.

Comment #40

blueminds commented 16 January 2014 at 13:17

Status:

Needs work

» Needs review

Comment #41

berdir

German

Switzerland

commented 16 January 2014 at 14:02

+++ b/translators/file/tmgmt_file.format.xliff.inc
@@ -347,12 +346,14 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
@@ -371,7 +372,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface

@@ -371,7 +372,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
 
     $tray = array();
     $non_pair_tags = array('br', 'img');
-    $elements_count = 0;
+
+    if (!isset($this->job->settings['xliff_validation'])) {
+      $this->job->settings['xliff_validation'] = array();
+    }
+    $xliff_validation = &$this->job->settings['xliff_validation'];

Can you explain why doing this with a by-ref assignment, instead of setting it again at the end? This would be more complicated to port to 8.x (where we will likely have more methods to work with properties/fields) and you don't really save code as the explanation is as long as the code would be :)

+++ b/translators/file/tmgmt_file.format.xliff.inc
@@ -432,8 +440,8 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
 
-    // Set the elements count for the current item into job settings.
-    $this->job->settings['elements_counts'][$tjiid] = $elements_count;
+    // Set the xliff_validation data that were set into the
+    // $this->job->settings['xliff_validation'] array().
     $this->job->save();

The comment also doesn't make too much sense right now, as we don't set anything, we just save it. array() should also be just array.

+++ b/translators/file/tmgmt_file.test
@@ -120,31 +120,25 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
-      if ($message->getMessage() == t('Failed to validate file, import aborted.')) {
+      debug($message->getMessage());
+      if ($message->getMessage() == t('Failed to validate semantic integrity of %key element. Please check also the HTML code of the element in the review process.', array('%key' => '2][dummy][deep_nesting'))) {

Left-over debug.

Comment #42

blueminds commented 16 January 2014 at 14:08

Status	File	Size
new	1882108-mask_html-13.patch	34.41 KB
new	1882108-mask_html-13.interdiff.txt	1.74 KB

fixed

Comment #43

miro_dietiker

Switzerland

commented 26 January 2014 at 15:17

Status:

Needs review

» Needs work

3 files were hidden/shown/deleted

Status	File	Size
hidden	1882108-mask_html-12.patch	34.45 KB
hidden	1882108-mask_html-12.interdiff-validation.patch	7.31 KB
hidden	1882108-mask_html-12.interdiff-setting-flag.patch	9.49 KB

Masking import needs to be fault tolerant.
We will accept imports with wrong data... but should throw a message for every item.
In case it's possible, add it to the (data) item comment.

Comment #44

blueminds commented 10 February 2014 at 08:07

Status:

Needs work

» Needs review

this is already implemented

Comment #45

blueminds commented 10 February 2014 at 08:09

Status	File	Size
new	1882108-mask_html-14.patch	34.41 KB

reroll

Comment #46

miro_dietiker

Switzerland

commented 12 February 2014 at 14:25

Status:

Needs review

» Needs work

OK, tested this.

Works cleanly now.
However, error messages should be assigned to the item. Not to the job only.

Also validation error message is not that friendly and could be improved by e.g. outputting the tag that is the problem and less "bla" text.
And as a followup, the new mrk tag Trados now uses is a good thing we should support in future (followup) also for our own processing.

Comment #47

blueminds commented 12 February 2014 at 14:51

Status:

Needs work

» Needs review

Status	File	Size
new	1882108-mask_html-15.patch	34.56 KB
new	1882108-mask_html-15.interdiff.txt	2.14 KB

here we go

Comment #48

berdir

German

Switzerland

commented 12 February 2014 at 15:40

Title:	mask HTML markup in XLIFF	» Change record: mask HTML markup in XLIFF
Category:	Feature request	» Task
Status:	Needs review	» Active
Issue tags:		+Needs change record

Ok, committed and pushed!

Given that this is a pretty major change, let's document this change as a change notice: https://drupal.org/list-changes/tmgmt

Comment #49

anybody

German

Porta Westfalica

commented 19 November 2014 at 15:37

Sorry, but I have a very important question and I can't find an answer in the
documentation yet: I get the result back from the translator as escaped HTML
and import it, but it doesn't seem to be converted back to HTML at any point
of time. How can I solve that? What am I doing wrong? :(

Comment #50

megadesk3000 commented 25 November 2014 at 14:41

Hello miro and berdir

I have the exact same problem like "Anybody" ;) In the xlif is htmlentites-encoded markup and i want this to converted back to the original HTML-Markup on import. Can you give me a hint how this can be achieved ?

many thanks for your answer
Jan

P.S. Here is an example:

<trans-unit id="53][field_text][0][value" resname="53][field_text][0][value">
	<source xml:lang="fr">&lt;strong&gt;This is text in france&lt;/strong&gt;</source>
	<target xml:lang="nl">&lt;strong&gt;This is text in dutch&lt;/strong&gt;</target>
	<note>Answer</note>
</trans-unit>

Comment #51

anybody

German

Porta Westfalica

commented 1 February 2015 at 18:53

Any progress or even an idea here?

Comment #52

dragonfire353 commented 7 May 2015 at 18:25

Here's a quick fix for new translation imports for 7.x1.0-rc1.

Add this to entity/tmgmt.entity.job_item.inc on line 630 or if that isn't right in function addTranslatedDataRecursive near the end before $this->updateData:

$values['#translation']['#text'] = str_replace('&lt;', '<', $values['#translation']['#text']);
$values['#translation']['#text'] = str_replace('&gt;', '>', $values['#translation']['#text']);

Again, quick fix so use at your own risk.

Comment #53

anybody

German

Porta Westfalica

commented 27 May 2015 at 15:20

Status	File	Size
new	html-entity-decode_1882108_53.patch	3.03 KB

Thanks a lot @dragonfire535 for your quickfix.
The problem with it is that there are several further HTML entities that should be decoded like   and other ascii characters that may be contained in the text.

I've created and attached a patch that uses html_entity_decode() to do that in the last step. You can use it as quick fix.

Anyway I think we should discuss
to add a setting or permission to enable this HTML encoding
add a function to handle the text securely, at least add filter_xss_admin (see patch) or custom filter_xss (See: https://www.drupal.org/node/28984)
Further security implications because code injection will definitely be possible if we're not careful here.

Anyway this has to be fixed because otherwise a proper HTML import is not possible otherwise.

Could a tmgmt maintainer and/or a member of the Drupal security team have a look at this perhaps finally?

And finally for clarification: You need the patch from #42 plus this one to have the full export and import functionality for HTML translations!

Comment #54

anybody

German

Porta Westfalica

commented 27 May 2015 at 15:07

Status:

Active

» Needs review

Comment #55

anybody

German

Porta Westfalica

commented 27 May 2015 at 15:15

Status	File	Size
new	html-entity-decode_1882108_55.patch	1.43 KB

Removed unrelated code style fixes from the patch.

Comment #56

a.milkovsky

he/him

UTC +2

commented 29 March 2016 at 11:41

@Anybody, thank you for the path #55. It was a good hint for me in the Drupal 8 version as well.
I vote +1 for RTBC.

Comment #57

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 14 April 2016 at 01:04

Status:

Needs review

» Needs work

Thanks for the patch! It is working for us but a couple nitpicks below. Cheers.

+++ b/entity/tmgmt.entity.job_item.inc
@@ -627,6 +627,10 @@ class TMGMTJobItem extends Entity {
+        if(!empty($values['#translation']['#text'])) {

Need a space after if.

+++ b/entity/tmgmt.entity.job_item.inc
@@ -627,6 +627,10 @@ class TMGMTJobItem extends Entity {
+          // If the text contains HTML entities, decode them to have proper HTML back in the content.

Comment should wrap at 80 characters (right after HTML without leaving a dangling space).

Change record: mask HTML markup in XLIFF

Comments

Change record drafts

Referenced by