XLIFF defines masking techniques to deal with e.g. HTML markup in content. These items thus can be protected from wrong manipulation by the translator.

http://www.maxprograms.com/articles/xliff.html

Check these quotes:

<bpt>	Begin paired tag: The <bpt> element is used to delimit the beginning of a paired sequence of native codes. Each <bpt> has a corresponding <ept> element within the translation unit.
<ept>	End paired tag: The <ept> element is used to delimit the end of a paired sequence of native codes. Each <ept> has a corresponding <bpt> element within the translation unit.

So with these proper masking features, we could make sure, HTML is not destructed by translators.
Since i don't know the proper implementation status of this feature in XLIFF compatible software, we'll need to check the adaption.

Note that most translation software provider don't implement raw XLIFF, but only subsets or even dialects.

CommentFileSizeAuthor
#55 html-entity-decode_1882108_55.patch1.43 KBanybody
#53 html-entity-decode_1882108_53.patch3.03 KBanybody
#47 1882108-mask_html-15.interdiff.txt2.14 KBblueminds
#47 1882108-mask_html-15.patch34.56 KBblueminds
#45 1882108-mask_html-14.patch34.41 KBblueminds
#42 1882108-mask_html-13.interdiff.txt1.74 KBblueminds
#42 1882108-mask_html-13.patch34.41 KBblueminds
#36 1882108-mask_html-12.interdiff-setting-flag.patch9.49 KBblueminds
#36 1882108-mask_html-12.interdiff-validation.patch7.31 KBblueminds
#36 1882108-mask_html-12.patch34.45 KBblueminds
#33 1882108-mask_html-11.patch30.16 KBblueminds
#33 1882108-mask_html-11.interdiff.txt1.47 KBblueminds
#32 1882108-mask_html-10.interdiff.txt7.85 KBblueminds
#32 1882108-mask_html-10.patch30.14 KBblueminds
#30 1882108-mask_html-9.interdiff.txt39.1 KBblueminds
#30 1882108-mask_html-9.patch29.54 KBblueminds
#28 1882108-mask_html-8.interdiff.txt31.9 KBblueminds
#28 1882108-mask_html-8.patch26.78 KBblueminds
#26 1882108-mask_html-7.interdiff.txt4.35 KBblueminds
#26 1882108-mask_html-7.patch42.67 KBblueminds
#20 1882108-mask_html-6.interdiff.txt1.65 KBblueminds
#20 1882108-mask_html-6.patch42.33 KBblueminds
#18 1882108-mask_html-5.interdiff.txt8.42 KBblueminds
#18 1882108-mask_html-5.patch42.11 KBblueminds
#16 1882108-mask_html-4.patch40.68 KBblueminds
#14 1882108-mask_html-3.patch46.73 KBblueminds
#10 1882108-mask_html-2.patch40.16 KBblueminds
#10 1882108-mask_html-2.interdiff.txt4.95 KBblueminds
#4 1882108-mask_html-1.patch39.57 KBblueminds

Comments

miro_dietiker’s picture

Related: #1415234: Deal with text format permissions and security

Related: #1891840: Treat text formats differently such as HTML / markdown

First we should deal with text formats cleanly and only apply HTML specific tools if HTML applies.

However as a workaround we can for now try to detect html in all cases. This possibly should be a translator setting.
It will help trados display things cleanly, deal with translation memory right and finally prevent editors from destroying html.

ysavourel’s picture

> Since i don't know the proper implementation status of
> this feature in XLIFF compatible software, we'll need to check the adaption.

Here are some notes that may help:

There are only in a few rare cases of XLIFF-aware tools that may not understand the inline codes elements such as <bpt>/<ept>/<ph>/etc. The wide majority handle them properly.

Some tend to put all original codes in <ph>. But if you can make the distinction between opening/closing/placeholder types (e.g. <b>, </b>, <br/>) I would strongly recommend to use the proper XLIFF element for each type.

You can also use <g>...</g> and <x/> if you don't want to include the original code in the XLIFF document. But your merging tool will then need to know how to put the original codes back.

I also strongly recommend to handle segmented entries properly when merging back: by stripping <mrk mtype='seg'> elements' tags (not their content). Other <mrk> elements may also be present that you may want to handle to either feed them back into the content (e.g. a comment) or strip them out.

Rob1106’s picture

Quick input from our experience:
Our translator uses SDL Trados Studio 2011 which is not able to leave the html tags out of the translation tasks by default. This means that the translator can not do a proper word count and that makes it difficult for them to make a quote that matches the work that is done or has to be done.

Another thing is that the html tags are counted as words for a translation job inside the module itself. So this means that whenever the translator has a system that does leave out the html-tags you don't have a correct wordcount of the translation job to double-check the translator, which is pretty important as well.

So this would be a nice feature to have inside the tmgmt module itself as far as we are concerned. A nice 'should have' so to say.

blueminds’s picture

Issue summary: View changes
Status: Active » Needs review
StatusFileSize
new39.57 KB

- Provides a way to add "processors". A processor can take an action upon translation request and xliff import.
- The patch also includes html masking processor that will mask and unmask the html using and tags.

cgalli’s picture

Status: Needs review » Needs work

Installed the patch and tested.

When the 'Escape Html' box is checked in the translator settings, the job is masked with bpt and ept markup. Unfortunately, I do not have a program available to import the result to. So I cannot analyse the result.

Shouldn't the 'Escape HTML' setting not be available on the job level?

When the 'Escape HTML' box is NOT checked, the job does not come back after 'Submit to translator' (blank page)

BTW: It is super stupid that the preview for issues shows only the original body (which never changes) and not the comment (which is the only thing that changes)....progress!

Rob1106’s picture

That sounds pretty good. Can't wait to test it in a new release...

cgalli’s picture

@Rob

Can you please test the masking in Trados? You can test the patched version by clicking on this link:

http://simplytest.me/project/tmgmt/7.x-1.x?patch[]=https://drupal.org/fi...

Then choose 'Launch Sandbox' which will setup a tmgmt Site with the patch applied.

Or you apply the patch onto you own system.

Please do not forget to check the box 'Escape Html' in the translator's settings.

miro_dietiker’s picture

Rob, you should test it in -dev way before release (or even with applying the patch manually) once committed.
As we need feedback before we can release such a major feature.

cgalli: No, this is nothing that makes sense per job.
The tool / workflow in place at the translator company defines if we should mask or not.
This rule is a perfect fit for configuration per translator.
All jobs created for this translator should follow the same encoding patterns and thus make sure we do not forget to enable masking accidentally.

Masking should be the default as it will help to have clean translation memories, counters and a good translator experience in the tools.

blueminds’s picture

Status: Needs work » Needs review

@Rob1106 will you be able to test the patch? I do not have a tool to test it in real life, just followed specs. It needs to be tested in real life before we push it forward.

blueminds’s picture

StatusFileSize
new4.95 KB
new40.16 KB

Tested the attached patch with Trados studio. The basic workflow works, still needs improvements and fixing test coverage, however ready for trial ride and feedback.

Status: Needs review » Needs work

The last submitted patch, 10: 1882108-mask_html-2.patch, failed testing.

The last submitted patch, 10: 1882108-mask_html-2.patch, failed testing.

miro_dietiker’s picture

This sounds AWESOME! :-)
Let us test this tomorrow together.

We should discuss on what we consider plain XLIFF and what is a Trados specific implementation.
Thus, we might need to introduce vendor specific XLIFF dialects from the beginning.

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new46.73 KB

Attaching patch hopefully solving all issues. Ready for review.

miro_dietiker’s picture

+++ b/tests/testing_html/broken.html
@@ -0,0 +1,160 @@
+    <strong>Artikeldetails des Herstellers: </strong> <a href="

No commercial URLs and offers in test content please! Take example.com urls and some neutral text like Wikipedia...

Also we talked about adding some validation of the received XLIFF markup / masks which is related to:
#2042861: Validate tags

blueminds’s picture

StatusFileSize
new40.68 KB

Added documentation, import integrity check and removed two testing html files and instead moved testing HTML into single sample file.

The current status is that it works with Trados quite nice. Probably would make sense to try it out with different tool as well.

There is still one issue though, and that is img attributes - they are added as attributes of the placeholder element
without x-html: prefix. This is due to fact it fails validation in Trados then. Tried it out with xml: prefix, and that worked, but not sure if it is correct.

miro_dietiker’s picture

Status: Needs review » Needs work

Nice status here. Let me do some more extensive review.

  1. +++ b/tests/testing_html/sample.html
    @@ -0,0 +1,5 @@
    +<p>one paragraph with special characters: äöüľščťžýáíéäňú</p>
    

    I'm missing HTML entities like   © and other more crypto ones here. I don't know if it cleanly works through the stack.

  2. +++ b/translators/file/tmgmt_file.format.html.inc
    @@ -1,4 +1,4 @@
    -<?php
    +  <?php
    

    What?

  3. +++ b/translators/file/tmgmt_file.module
    @@ -36,14 +36,34 @@ function tmgmt_file_theme() {
     function tmgmt_file_import_form_submit($form, &$form_state) {
    

    I see way too much business logic inside this single _submit function. Why not create a separate clean method that processes the import overall XLIFF? Am i wrong?

  4. +++ b/translators/file/tmgmt_file.module
    @@ -36,14 +36,34 @@ function tmgmt_file_theme() {
    +      $validated_job = $importer_controller->validateImport($file->uri);
    +      // If we have processors do the validation for each of them.
    

    Only proceed if still valid here. I even thought of an early exit strategy for the abort cases. Anyway, why is this in the _submit function and not in the _validate? Dunno. :-)

  5. +++ b/translators/file/tmgmt_file.module
    @@ -36,14 +36,34 @@ function tmgmt_file_theme() {
    +      // If we have processors do the validation for each of them.
    
    +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,376 @@
    +   * {@inheritdoc}
    

    I guess, here you should exactly summarize what you are checking in the implementation. The inherited docs don't state this.

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new42.11 KB
new8.42 KB

comments implemented

miro_dietiker’s picture

Status: Needs review » Needs work
+++ b/tests/testing_html/sample.html
@@ -1,4 +1,4 @@
+<p>one paragraph with special characters: äöüľščťžýáíéäňú©«®™»</p>

Those are still special UTF-8 characters and no HTML entities that are already encoded like &copy;
We need to check what happens if (correctly) encoded entities are already part of the original data.

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new42.33 KB
new1.65 KB

here we go

The last submitted patch, 20: 1882108-mask_html-6.patch, failed testing.

berdir’s picture

First review.

Haven't reviewed everything yet, code looks quite good, but I'm a bit worried about complexity/the overall thing, how this new plugin type plays together with the different file formats and it's kind of XLIFF specific atm but then again not and it also kind of overlaps with the new escape feature that we introduced recently for translatable string placeholder escaping...

  1. +++ b/tests/tmgmt_test.module
    @@ -51,3 +51,15 @@ function tmgmt_test_tmgmt_source_suggestions(array $items, TMGMTJob $job) {
    +/**
    + * Implements hook_tmgmt_fle_text_processor_plugin_info().
    

    typo: fle

  2. +++ b/tests/tmgmt_test.text_processor.inc
    @@ -0,0 +1,80 @@
    +   * Will update translation text.
    

    first word should be third tense.

    Updates translation text.

    Should be enough, no need for will?

  3. +++ b/tests/tmgmt_test.text_processor.inc
    @@ -0,0 +1,80 @@
    +  protected function updateTranslation(&$data) {
    +    if (isset($data['#translation'])) {
    +      $data['#translation']['#text'] = str_replace('{processed}', '', $data['#translation']['#text']);
    +    }
    +    else {
    +      foreach (element_children($data) as $key) {
    +        $this->updateSourceText($data[$key]);
    +      }
    +    }
    +  }
    

    Shouldn't this call itself not the other method?

    Makes me wonder why the tests aren't failing ;)

  4. +++ b/translators/file/tmgmt_file.format.xliff.inc
    @@ -50,11 +50,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
         $this->writeAttribute('xml:lang', $this->job->getTranslator()->mapToRemoteLanguage($this->job->target_language));
    -    $this->text($element['#text']);
    +    $this->writeRaw(!empty($element['#translation']['#text']) ? $element['#translation']['#text'] : $element['#text']);
         $this->endElement();
    

    I don't remember why I added the source also as translation, does that actually make sense? :)

  5. +++ b/translators/file/tmgmt_file.format.xliff.inc
    @@ -135,7 +135,12 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
         $data = array();
         foreach ($xml->xpath('//xliff:trans-unit') as $unit) {
    -      $data[(string) $unit['id']]['#text'] = (string) $unit->target;
    +      // Load the result with XMLReader so that we are able to get the inner
    +      // xml.
    +      $reader = new XMLReader();
    +      $reader->XML($unit->target->asXML());
    +      $reader->read();
    +      $data[(string) $unit['id']]['#text'] = $reader->readInnerXml();
         }
         return tmgmt_unflatten_data($data);
    

    This looks quite inefficient, can we at least re-use the same class?

  6. +++ b/translators/file/tmgmt_file.module
    @@ -32,37 +32,92 @@ function tmgmt_file_theme() {
    +      try {
    +        $controller_class = tmgmt_file_text_processor_controller($name);
    +        /** @var TMGMTFileTextProcessorInterface $processor_controller */
    +        $processor_controller = new $controller_class;
    +        $processor_controller->validateJobTranslationUponImport($job, $file->uri);
    +      }
    +      catch (TMGMTException $e) {
    +        $job->addMessage($e->getMessage(), array(), 'error');
    +        $validated_job = FALSE;
    +        break;
    +      }
    

    Might be useful to have tests for the validation failure part here, haven't seen the tests yet but the test implementation is empty.

  7. +++ b/translators/file/tmgmt_file.module
    @@ -32,37 +32,92 @@ function tmgmt_file_theme() {
    +  if (!$validated_job) {
    +    $job->addMessage('Failed to validate file, import aborted.', array(), 'error');
    +    return FALSE;
    +  }
    +  elseif ($validated_job->tjid != $job->tjid) {
    +    $uri = $validated_job->uri();
    +    $label = $validated_job->label();
    +    $job->addMessage('Import file is from job <a href="@url">@label</a>, import aborted.', array('@url' => url($uri['path']), '@label' => $label));
    +    return FALSE;
    +  }
    +  else {
    

    You have both if/elseif/else *and* return FALSE.

    I think it would be easier to read if you just do return FALSE and have if () return FALSE; if (return FALSE) ... less nesting.

  8. +++ b/translators/file/tmgmt_file.module
    @@ -109,6 +167,43 @@ function tmgmt_file_format_controller($plugin = NULL) {
    + *
    + * @return TMGMTFileTextProcessorInterface
    + *   Either a specific text processor plugin controller instance or an array of
    + *   available controllers.
    + */
    +function tmgmt_file_text_processor_controller($plugin = NULL) {
    +  return _tmgmt_plugin_controller('file_text_processor', $plugin);
    +}
    

    Probably copied, but that's not true according to how it's used? there it's a string that's used as new $bla.

  9. +++ b/translators/file/tmgmt_file.plugin.inc
    @@ -29,6 +29,17 @@ class TMGMTFileTranslatorPluginController extends TMGMTDefaultTranslatorPluginCo
    +      // If we have processors do the source text processing.
    +      if ($text_processors = $job->getTranslator()->getSetting('text_processor_plugins')) {
    +        foreach ($text_processors as $name) {
    +          $controller_class = tmgmt_file_text_processor_controller($name);
    +          /** @var TMGMTFileTextProcessorInterface $controller */
    +          $controller = new $controller_class;
    +          $controller->processJobSourceUponRequest($job);
    +        }
    +      }
    +
    

    Isn't processing tightly coupled with the file format? XLIFF-transformation for HTML export isn't very useful :)

    I know, you just aren't supposed to select it, but that's not exactly great UX :)

  10. +++ b/translators/file/tmgmt_file.recursive_iterator.inc
    @@ -0,0 +1,93 @@
    + * Class used to iterate through DOMDocument().
    

    DOMDocument is a class, not a function, so no ().

  11. +++ b/translators/file/tmgmt_file.recursive_iterator.inc
    @@ -0,0 +1,93 @@
    +  /**
    +   * Current Position in DOMNodeList
    +   * @var Integer
    +   */
    

    position small, . at the end and missing empty line before @var

  12. +++ b/translators/file/tmgmt_file.recursive_iterator.inc
    @@ -0,0 +1,93 @@
    +   *   DOMNode to itterate over.
    

    t-to many t-t's :)

  13. +++ b/translators/file/tmgmt_file.recursive_iterator.inc
    @@ -0,0 +1,93 @@
    +   * @return DOMNode
    ...
    +   * @return RecursiveDOMIterator
    ...
    +   * @return Boolean
    ...
    +   * @return Integer
    

    int/bool, missing descriptions.

  14. +++ b/translators/file/tmgmt_file.test
    @@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
       function setUp() {
    -    parent::setUp(array('tmgmt_file', 'tmgmt_ui'));
    +    parent::setUp(array('tmgmt_file', 'tmgmt_ui', 'tmgmt_test'));
    

    tmgmt_test is always enabled in the base class.

  15. +++ b/translators/file/tmgmt_file.test
    @@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
       }
     
    +  function testTextProcessorAPI() {
    

    Missing docblock.

  16. +++ b/translators/file/tmgmt_file.text_processor.interface.inc
    @@ -0,0 +1,52 @@
    +  /**
    +   * Will process job source text.
    +   *
    +   * The implementation should do whatever necessary to process the source text.
    +   * Upon performing needed changes the implementation should take care of
    +   * persisting these changes.
    

    "Will" also unnecessary here.

    I don't really understand the last sentence, persist?

Closed review window accidently in the middle, will continue later

berdir’s picture

About the escape problem, the idea of the API there was that knowing what should be escaped and knowing how it needs to be escaped are two different things and we tried to separate it there, but then again the way html is escaped in xliff is connected, so that might not work well.

Don't have a good solution atm...

berdir’s picture

+++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
@@ -0,0 +1,381 @@
+ * To process the content DOMDocument object is used due to its ability to
+ * read broken HTML. This also implies that if broken HTML is in the source
+ * content the translation content will be fixed into the extend of DOMDocument
+ * abilities.

What exactly is source and translation content here? My understanding is that source is the actual node and fields and translation content is both the source text that we initially process and the translation that we get back, as both go through DOMDocument?

Wherever we make the split, we end up with a mismatch as we can't fix the actual source, be it our review form or the actual sources.

We can't change this, but maybe we should make this a bit more explicit, e.g. in the form of a description.

More tomorrow

berdir’s picture

  1. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +      $job_item->data = $data;
    

    Not sure if this can really be considered a public property and should be set like this, it happens to be public because the storage controller needs it.

    We have updateData(), but that's rather slow, not sure.

  2. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +   */
    ...
    +  /**
    +   * Will check the import integrity.
    +   *
    +   * It will count all sub elements of the trans-unit/target element and compare
    +   * it with the value stored during export.
    +   *
    +   * {@inheritdoc}
    

    @inheritdoc can't be mixed with additional comments. api.module doesn't support that.

    Instead of the second paragraph, you could move that inline and also add some more comments what you're doing there. The first seems unecessary. Also more uneccessary "will"'s :)

  3. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +
    +    foreach ($xml->xpath('//xliff:trans-unit') as $unit) {
    +      $count = 0;
    +      $reader = new XMLReader();
    ...
    +      }
    

    Would be important to test this with multiple job items and data items I think, to make sure that this logic here works correctly.

    Also again the mix of simplexml and XMLReader. Again should probably at least re-use the reader (if that is actually faster, not sure), the more performant way would probably be to only use one API. Not sure how feasable that is.

    All this is doing is counting opening tags, right (comments! :))

  4. +++ b/translators/file/tmgmt_file.test
    @@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
    +    $job->addItem('test_source', 'test_html', '1');
    ...
    +    // Note that here we use only simple HTML structure so that we can do the
    +    // source and translation comparison. If more complex content including also
    +    // html entities or other non ASCII chars is used the comparison might not
    +    // work due to XML processors manipulating with such chars.
    

    Other tests that require special data usually just override the default, that is IMHO easier to read than an implicit thing through the type. But we're doing it elsewere too, so not sure. Doing it here would make it easier to test multiple different structures. And I know it contradicts with what I'm saying below (setting data directly being problematic), but it's a test here :))

  5. +++ b/translators/file/tmgmt_file.test
    @@ -19,11 +19,185 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
    +    $job->requestTranslation();
    +    // Make sure the masking process happened.
    +    $text = $this->findSourceText($job->getData());
    +    $this->assertTrue(strpos($text, '<bpt') !== FALSE);
    +    $this->assertTrue(strpos($text, '<ept') !== FALSE);
    

    Testing a bit more specific would make sense I think, e.g a whole string from opening to closing tag, not just a small bit of it.

  6. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +        if (isset($job_items[$tjiid])) {
    +          $label = $job_items[$tjiid]->label();
    +        }
    +        else {
    +          $label = t('unknown');
    +        }
    +        throw new TMGMTException(t('Integrity check for job item @label failed.', array('@label' => $label)));
    

    How could it end in the else case?

  7. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +          if ($reader->getAttribute('ctype') == 'image') {
    +            $text .= '<img';
    +            while ($reader->moveToNextAttribute()) {
    +              // @todo - we have to use x-html: prefixes for attributes.
    +              if ($reader->name != 'ctype' && $reader->name != 'id') {
    +                $text .= " {$reader->name}=\"{$reader->value}\"";
    +              }
    +            }
    

    Do we need to worry about escaping/encoding here?

  8. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +
    +  /**
    +   * Helper function to process the source text.
    +   *
    +   * @param array $data
    +   *   Job data array.
    +   */
    +  protected function processSource(&$data) {
    

    "Helper function to" is IMHO equally useless as "Will", "protected function" tells me that as well, what about "Masks HTML tags" or someting like that, that actually describes what it does?

    Also, either the two loop functions should have a default implementation with abstract functions for this processSource() thing, or we give it a better name, not "process".

  9. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +      $this->elementsCount = 0;
    

    Are you sure this is the right place to set elementsCount? Won't this set it back to 0 for every data item?

    Also, the trick with $this->elementsCount that's then used in the parent function seems... rather weird, I'd either return it (a bit more code but I think easier to understand) or pass down the job so that it can be immediately updated. The second is what I'd do if we have a default implementation for that loop.

  10. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +      $dit = new RecursiveIteratorIterator(
    +        new RecursiveDOMIterator($dom),
    +        RecursiveIteratorIterator::SELF_FIRST);
    

    Variable names are usually not shortened, can we name this $iterator? Or $dom_iterator?

  11. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +
    +        if ($node->nodeType === XML_ELEMENT_NODE) {
    +          // We have a new tag, increment the id.
    +          $id++;
    +          // Currently the incremental id and the elementCount has the same
    

    id in comments should be ID

  12. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +      }
    ...
    +      foreach ($node->attributes as $attribute) {
    +        $attributes[] = $attribute->name . '="' . $attribute->value . '"';
    ...
    +    $beginning_tag = '<' . $node->nodeName;
    +    if ($node->hasAttributes()) {
    +      $attributes = array();
    +      /** @var DOMAttr $attribute */
    ...
    +
    +      $beginning_tag .= ' '. implode(' ', $attributes);
    +    }
    +    $beginning_tag .= '>';
    +    $writer->startElement('bpt');
    

    Same here, do we need to handle escaping of attribute values? AFAIK, we do parse this again when we import it, no?

  13. +++ b/translators/file/tmgmt_file.text_processor.xliff_mask_html.inc
    @@ -0,0 +1,381 @@
    +    $writer->endElement();
    ...
    +    $writer->writeAttribute('ctype', 'image');
    +    foreach ($node->attributes as $attribute) {
    +      // @todo - uncomment when issue with Trados/sub elements fixed.
    +//      if (in_array($attribute->name, array('title', 'alt'))) {
    +//        continue;
    +//      }
    +      $writer->writeAttribute($attribute->name, $attribute->value);
    +    }
    +//    if ($alt_attribute = $node->getAttribute('alt')) {
    +//      $writer->startElement('sub');
    +//      $writer->writeAttribute('id', $id . '-img-alt');
    +//      $writer->writeAttribute('ctype', 'x-img-alt');
    +//      $writer->text($alt_attribute);
    +//      $writer->endElement();
    +//      $this->elementsCount++;
    +//    }
    +//    if ($title_attribute = $node->getAttribute('title')) {
    +//      $writer->startElement('sub');
    +//      $writer->writeAttribute('id', $id . '-img-title');
    +//      $writer->writeAttribute('ctype', 'x-img-title');
    +//      $writer->text($title_attribute);
    +//      $writer->endElement();
    +//      $this->elementsCount++;
    +//    }
    

    Comment characters need to be indendeted correctly as well, if this is will be committed like this which I understand it will.

  14. +++ b/translators/file/tmgmt_file.ui.inc
    @@ -44,6 +44,15 @@ class TMGMTFileTranslatorUIController extends TMGMTDefaultTranslatorUIController
     
    +    $text_processors = $translator->getSetting('text_processor_plugins');
    +    $form['text_processor_plugins'] = array(
    +      '#type' => 'checkboxes',
    +      '#title' => t('Text processors'),
    +      '#description' => t('Identify which text processors should be used during the import/export operations.'),
    +      '#options' => tmgmt_file_text_processor_plugin_labels(),
    +      '#default_value' => is_array($text_processors) ? $text_processors : array(),
    

    Ok, so this is a setting on the translator, so it can't be configured per job but you can configure the format per job...

That's it for now with specific feedback, going to discuss the whole thing with Miro asap.

blueminds’s picture

StatusFileSize
new42.67 KB
new4.35 KB

Here is patch that deals with characters that break xml structure. Have not yet done anything from reviews.

berdir’s picture

Status: Needs review » Needs work

Ok, to summarize, here is my main problem with the way the plugin system works right now: There's a disconnect between the actual implementation, the way it's called and how it works:

- It works on the job, giving the assumption that it's a generic thing, this also makes the implementations rather complicated (that foreach, which *could* be solved with a base implementation). With the exception of the validate thing, which receives a file name (without knowing anything about it)
- It's called from the file translator, no matter what's configured
- The actual implementation however blindly assumes that the source is HTML and the target will be XLIFF.

The result is that it's a one-off solution for this specific problem, masked as a more or less generic pluggable system.

I think there are 3 options as to what we could do:
1 Write a fancy and complex processing system, that would actually be generic. That would very likely involve introducing a intermediate markup language that the source is converted into and then the translator processes it back into whatever he needs. That would probably keep us busy for a year and is overkill for this issue :)
2. Change the plugin system so that it's specific to xliff, only execute those plugins if xliff is selected and update documentation/possibly methods.
3. Implement this as an actual one-off solution, we can keep the class but just call it directly.

I actual prefer 3, because 2 will just get in the way in case we actually implement something like 1 at some point and 3 would then be easier to incorporate into that new plugin system as it wouldn't be a public API before. We can also make arguments/methods match that specific use case and don't have to think what we need to do to support every possible use case.

Then there's the the problem with saving the data in those methods in the class. That is problematic, because sooner or later, we will introduce things like checking if the source still matches when you review a translation and at least warn if not. That will happen unless we apply the first part of the processing *before* we do that, which we can't unless we'd implement 1. Given that we don't need the fixed source as we have the tag count it's supposed to have (afaik), would it be easier to just run $data through the processor, without persisting it?

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new26.78 KB
new31.9 KB

Did not implement all the comments yet. Focused on your 3. and implemented it. I agree with you here, but the original specification said that it should be pluginable and that there should be an option to turn off masking.

Re that setting - i think it does not have a sense as that setting would be about "Export xliff" and "Export even more xliffed file".

Tested with Trados studio, works fine.

Please see the patch if you like the conceptual changes and will finish other comments later on.

The last submitted patch, 28: 1882108-mask_html-8.patch, failed testing.

blueminds’s picture

StatusFileSize
new29.54 KB
new39.1 KB

here is the final thing

miro_dietiker’s picture

Some feedback :-) (strange, my dreditor seems broken...)

+<p>and here we have some link <a href="http://example.com">break line</a></p>
I guess we should additionally test links with alt, title and both.

+ * Contains TMGMTTestTextProcessor.
I'm strongly missing links to the relevant standard documents about what recommendation we exactly implement. Such as http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-h...

+ 'label' => t('Test html source'),
HTML not html. And 100+ times more. ;-)

+ * Validates imported xliff file.
XLIFF, not xliff. And 10+ times more.

+    // The reason why we use DOMDocument object here and not just XMLReader
+    // is the DOMDocument's ability to deal with broken HTML.

I would be fine to define that HTML needs to be valid. However, TMGMT would thus require a source validation phase and an error might stop a job.

+ $non_pair_tags = array('br', 'img');
Usually, the parser knows what is paired and what is present as single tag. This is only required for validation - and there are quite some other possible non-pair-tags.

+ return $reader->readInnerXml();
Correct: readInnerXML()

blueminds’s picture

StatusFileSize
new30.14 KB
new7.85 KB

- Implemented comments
- Ensured element ids are unique in the whole xliff document

blueminds’s picture

StatusFileSize
new1.47 KB
new30.16 KB

make ids "nicer"

miro_dietiker’s picture

Status: Needs review » Needs work

So we realized that Trados modifies/drops CDATA and while is throws a fatal error when the import does not 100% contain tag masks in source and target side, it makes no guarantee all masked tags are placed.

Thus, failing hard in case the validation shows mismatch is a bad idea.
Instead we should provide a message about the issue, continue and provide more messages in detail during processing.

Later, TMGMT items / data items might get flags about their integrity that would block a user from accepting it.

Still, as initially specified, the whole feature needs to be optional and enabled through the translator. If disabled, there shouldn't be any change.
For future, we should persist documents of old exports and make sure the import stilll works with the new code using automatic test coverage.

EDIT: Sometimes HTML is still written small caps.

miro_dietiker’s picture

The status of #33 is currently being tested in production, meaning we have files created with that defined file format in production.
For the import, we need compatibility and graceful fail within the next few days.

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new34.45 KB
new7.31 KB
new9.49 KB

- Here comes the setting to enable/disable xliff processing
- HTML semantics integrity check that triggers only error messages but the job gets imported

Status: Needs review » Needs work

The last submitted patch, 36: 1882108-mask_html-12.interdiff-setting-flag.patch, failed testing.

The last submitted patch, 36: 1882108-mask_html-12.interdiff-validation.patch, failed testing.

blueminds’s picture

36: 1882108-mask_html-12.patch queued for re-testing.

blueminds’s picture

Status: Needs work » Needs review
berdir’s picture

  1. +++ b/translators/file/tmgmt_file.format.xliff.inc
    @@ -347,12 +346,14 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
    @@ -371,7 +372,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
    
    @@ -371,7 +372,11 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
     
         $tray = array();
         $non_pair_tags = array('br', 'img');
    -    $elements_count = 0;
    +
    +    if (!isset($this->job->settings['xliff_validation'])) {
    +      $this->job->settings['xliff_validation'] = array();
    +    }
    +    $xliff_validation = &$this->job->settings['xliff_validation'];
    

    Can you explain why doing this with a by-ref assignment, instead of setting it again at the end? This would be more complicated to port to 8.x (where we will likely have more methods to work with properties/fields) and you don't really save code as the explanation is as long as the code would be :)

  2. +++ b/translators/file/tmgmt_file.format.xliff.inc
    @@ -432,8 +440,8 @@ class TMGMTFileformatXLIFF extends XMLWriter implements TMGMTFileFormatInterface
     
    -    // Set the elements count for the current item into job settings.
    -    $this->job->settings['elements_counts'][$tjiid] = $elements_count;
    +    // Set the xliff_validation data that were set into the
    +    // $this->job->settings['xliff_validation'] array().
         $this->job->save();
    

    The comment also doesn't make too much sense right now, as we don't set anything, we just save it. array() should also be just array.

  3. +++ b/translators/file/tmgmt_file.test
    @@ -120,31 +120,25 @@ class TMGMTFileTestCase extends TMGMTBaseTestCase {
    -      if ($message->getMessage() == t('Failed to validate file, import aborted.')) {
    +      debug($message->getMessage());
    +      if ($message->getMessage() == t('Failed to validate semantic integrity of %key element. Please check also the HTML code of the element in the review process.', array('%key' => '2][dummy][deep_nesting'))) {
    

    Left-over debug.

blueminds’s picture

StatusFileSize
new34.41 KB
new1.74 KB

fixed

miro_dietiker’s picture

Status: Needs review » Needs work

Masking import needs to be fault tolerant.
We will accept imports with wrong data... but should throw a message for every item.
In case it's possible, add it to the (data) item comment.

blueminds’s picture

Status: Needs work » Needs review

this is already implemented

blueminds’s picture

StatusFileSize
new34.41 KB

reroll

miro_dietiker’s picture

Status: Needs review » Needs work

OK, tested this.

Works cleanly now.
However, error messages should be assigned to the item. Not to the job only.

Also validation error message is not that friendly and could be improved by e.g. outputting the tag that is the problem and less "bla" text.
And as a followup, the new mrk tag Trados now uses is a good thing we should support in future (followup) also for our own processing.

blueminds’s picture

Status: Needs work » Needs review
StatusFileSize
new34.56 KB
new2.14 KB

here we go

berdir’s picture

Title: mask HTML markup in XLIFF » Change record: mask HTML markup in XLIFF
Category: Feature request » Task
Status: Needs review » Active
Issue tags: +Needs change record

Ok, committed and pushed!

Given that this is a pretty major change, let's document this change as a change notice: https://drupal.org/list-changes/tmgmt

anybody’s picture

Sorry, but I have a very important question and I can't find an answer in the
documentation yet: I get the result back from the translator as escaped HTML
and import it, but it doesn't seem to be converted back to HTML at any point
of time. How can I solve that? What am I doing wrong? :(

megadesk3000’s picture

Hello miro and berdir

I have the exact same problem like "Anybody" ;) In the xlif is htmlentites-encoded markup and i want this to converted back to the original HTML-Markup on import. Can you give me a hint how this can be achieved ?

many thanks for your answer
Jan

P.S. Here is an example:

<trans-unit id="53][field_text][0][value" resname="53][field_text][0][value">
	<source xml:lang="fr">&lt;strong&gt;This is text in france&lt;/strong&gt;</source>
	<target xml:lang="nl">&lt;strong&gt;This is text in dutch&lt;/strong&gt;</target>
	<note>Answer</note>
</trans-unit>
anybody’s picture

Any progress or even an idea here?

dragonfire353’s picture

Here's a quick fix for new translation imports for 7.x1.0-rc1.

Add this to entity/tmgmt.entity.job_item.inc on line 630 or if that isn't right in function addTranslatedDataRecursive near the end before $this->updateData:

$values['#translation']['#text'] = str_replace('&lt;', '<', $values['#translation']['#text']);
$values['#translation']['#text'] = str_replace('&gt;', '>', $values['#translation']['#text']);

Again, quick fix so use at your own risk.

anybody’s picture

StatusFileSize
new3.03 KB

Thanks a lot @dragonfire535 for your quickfix.
The problem with it is that there are several further HTML entities that should be decoded like &nbsp; and other ascii characters that may be contained in the text.

I've created and attached a patch that uses html_entity_decode() to do that in the last step. You can use it as quick fix.

Anyway I think we should discuss
to add a setting or permission to enable this HTML encoding
add a function to handle the text securely, at least add filter_xss_admin (see patch) or custom filter_xss (See: https://www.drupal.org/node/28984)
Further security implications because code injection will definitely be possible if we're not careful here.

Anyway this has to be fixed because otherwise a proper HTML import is not possible otherwise.

Could a tmgmt maintainer and/or a member of the Drupal security team have a look at this perhaps finally?

And finally for clarification: You need the patch from #42 plus this one to have the full export and import functionality for HTML translations!

anybody’s picture

Status: Active » Needs review
anybody’s picture

StatusFileSize
new1.43 KB

Removed unrelated code style fixes from the patch.

a.milkovsky’s picture

@Anybody, thank you for the path #55. It was a good hint for me in the Drupal 8 version as well.
I vote +1 for RTBC.

kristen pol’s picture

Status: Needs review » Needs work

Thanks for the patch! It is working for us but a couple nitpicks below. Cheers.

  1. +++ b/entity/tmgmt.entity.job_item.inc
    @@ -627,6 +627,10 @@ class TMGMTJobItem extends Entity {
    +        if(!empty($values['#translation']['#text'])) {
    

    Need a space after if.

  2. +++ b/entity/tmgmt.entity.job_item.inc
    @@ -627,6 +627,10 @@ class TMGMTJobItem extends Entity {
    +          // If the text contains HTML entities, decode them to have proper HTML back in the content.
    

    Comment should wrap at 80 characters (right after HTML without leaving a dangling space).