XLIFF defines masking techniques to deal with e.g. HTML markup in content. These items thus can be protected from wrong manipulation by the translator.
http://www.maxprograms.com/articles/xliff.html
Check these quotes:
<bpt> Begin paired tag: The <bpt> element is used to delimit the beginning of a paired sequence of native codes. Each <bpt> has a corresponding <ept> element within the translation unit.
<ept> End paired tag: The <ept> element is used to delimit the end of a paired sequence of native codes. Each <ept> has a corresponding <bpt> element within the translation unit.
So with these proper masking features, we could make sure, HTML is not destructed by translators.
Since i don't know the proper implementation status of this feature in XLIFF compatible software, we'll need to check the adaption.
Note that most translation software provider don't implement raw XLIFF, but only subsets or even dialects.
Comments
Comment #1
miro_dietikerRelated: #1415234: Deal with text format permissions and security
Related: #1891840: Treat text formats differently such as HTML / markdown
First we should deal with text formats cleanly and only apply HTML specific tools if HTML applies.
However as a workaround we can for now try to detect html in all cases. This possibly should be a translator setting.
It will help trados display things cleanly, deal with translation memory right and finally prevent editors from destroying html.
Comment #2
ysavourel commented> Since i don't know the proper implementation status of
> this feature in XLIFF compatible software, we'll need to check the adaption.
Here are some notes that may help:
There are only in a few rare cases of XLIFF-aware tools that may not understand the inline codes elements such as <bpt>/<ept>/<ph>/etc. The wide majority handle them properly.
Some tend to put all original codes in <ph>. But if you can make the distinction between opening/closing/placeholder types (e.g. <b>, </b>, <br/>) I would strongly recommend to use the proper XLIFF element for each type.
You can also use <g>...</g> and <x/> if you don't want to include the original code in the XLIFF document. But your merging tool will then need to know how to put the original codes back.
I also strongly recommend to handle segmented entries properly when merging back: by stripping <mrk mtype='seg'> elements' tags (not their content). Other <mrk> elements may also be present that you may want to handle to either feed them back into the content (e.g. a comment) or strip them out.
Comment #3
Rob1106 commentedQuick input from our experience:
Our translator uses SDL Trados Studio 2011 which is not able to leave the html tags out of the translation tasks by default. This means that the translator can not do a proper word count and that makes it difficult for them to make a quote that matches the work that is done or has to be done.
Another thing is that the html tags are counted as words for a translation job inside the module itself. So this means that whenever the translator has a system that does leave out the html-tags you don't have a correct wordcount of the translation job to double-check the translator, which is pretty important as well.
So this would be a nice feature to have inside the tmgmt module itself as far as we are concerned. A nice 'should have' so to say.
Comment #4
blueminds commented- Provides a way to add "processors". A processor can take an action upon translation request and xliff import.
- The patch also includes html masking processor that will mask and unmask the html using and tags.
Comment #5
cgalli commentedInstalled the patch and tested.
When the 'Escape Html' box is checked in the translator settings, the job is masked with bpt and ept markup. Unfortunately, I do not have a program available to import the result to. So I cannot analyse the result.
Shouldn't the 'Escape HTML' setting not be available on the job level?
When the 'Escape HTML' box is NOT checked, the job does not come back after 'Submit to translator' (blank page)
BTW: It is super stupid that the preview for issues shows only the original body (which never changes) and not the comment (which is the only thing that changes)....progress!
Comment #6
Rob1106 commentedThat sounds pretty good. Can't wait to test it in a new release...
Comment #7
cgalli commented@Rob
Can you please test the masking in Trados? You can test the patched version by clicking on this link:
http://simplytest.me/project/tmgmt/7.x-1.x?patch[]=https://drupal.org/fi...
Then choose 'Launch Sandbox' which will setup a tmgmt Site with the patch applied.
Or you apply the patch onto you own system.
Please do not forget to check the box 'Escape Html' in the translator's settings.
Comment #8
miro_dietikerRob, you should test it in -dev way before release (or even with applying the patch manually) once committed.
As we need feedback before we can release such a major feature.
cgalli: No, this is nothing that makes sense per job.
The tool / workflow in place at the translator company defines if we should mask or not.
This rule is a perfect fit for configuration per translator.
All jobs created for this translator should follow the same encoding patterns and thus make sure we do not forget to enable masking accidentally.
Masking should be the default as it will help to have clean translation memories, counters and a good translator experience in the tools.
Comment #9
blueminds commented@Rob1106 will you be able to test the patch? I do not have a tool to test it in real life, just followed specs. It needs to be tested in real life before we push it forward.
Comment #10
blueminds commentedTested the attached patch with Trados studio. The basic workflow works, still needs improvements and fixing test coverage, however ready for trial ride and feedback.
Comment #13
miro_dietikerThis sounds AWESOME! :-)
Let us test this tomorrow together.
We should discuss on what we consider plain XLIFF and what is a Trados specific implementation.
Thus, we might need to introduce vendor specific XLIFF dialects from the beginning.
Comment #14
blueminds commentedAttaching patch hopefully solving all issues. Ready for review.
Comment #15
miro_dietikerNo commercial URLs and offers in test content please! Take example.com urls and some neutral text like Wikipedia...
Also we talked about adding some validation of the received XLIFF markup / masks which is related to:
#2042861: Validate tags
Comment #16
blueminds commentedAdded documentation, import integrity check and removed two testing html files and instead moved testing HTML into single sample file.
The current status is that it works with Trados quite nice. Probably would make sense to try it out with different tool as well.
There is still one issue though, and that is img attributes - they are added as attributes of the placeholder element
without x-html: prefix. This is due to fact it fails validation in Trados then. Tried it out with xml: prefix, and that worked, but not sure if it is correct.
Comment #17
miro_dietikerNice status here. Let me do some more extensive review.
I'm missing HTML entities like © and other more crypto ones here. I don't know if it cleanly works through the stack.
What?
I see way too much business logic inside this single _submit function. Why not create a separate clean method that processes the import overall XLIFF? Am i wrong?
Only proceed if still valid here. I even thought of an early exit strategy for the abort cases. Anyway, why is this in the _submit function and not in the _validate? Dunno. :-)
I guess, here you should exactly summarize what you are checking in the implementation. The inherited docs don't state this.
Comment #18
blueminds commentedcomments implemented
Comment #19
miro_dietikerThose are still special UTF-8 characters and no HTML entities that are already encoded like ©
We need to check what happens if (correctly) encoded entities are already part of the original data.
Comment #20
blueminds commentedhere we go
Comment #22
berdirFirst review.
Haven't reviewed everything yet, code looks quite good, but I'm a bit worried about complexity/the overall thing, how this new plugin type plays together with the different file formats and it's kind of XLIFF specific atm but then again not and it also kind of overlaps with the new escape feature that we introduced recently for translatable string placeholder escaping...
typo: fle
first word should be third tense.
Updates translation text.
Should be enough, no need for will?
Shouldn't this call itself not the other method?
Makes me wonder why the tests aren't failing ;)
I don't remember why I added the source also as translation, does that actually make sense? :)
This looks quite inefficient, can we at least re-use the same class?
Might be useful to have tests for the validation failure part here, haven't seen the tests yet but the test implementation is empty.
You have both if/elseif/else *and* return FALSE.
I think it would be easier to read if you just do return FALSE and have if () return FALSE; if (return FALSE) ... less nesting.
Probably copied, but that's not true according to how it's used? there it's a string that's used as new $bla.
Isn't processing tightly coupled with the file format? XLIFF-transformation for HTML export isn't very useful :)
I know, you just aren't supposed to select it, but that's not exactly great UX :)
DOMDocument is a class, not a function, so no ().
position small, . at the end and missing empty line before @var
t-to many t-t's :)
int/bool, missing descriptions.
tmgmt_test is always enabled in the base class.
Missing docblock.
"Will" also unnecessary here.
I don't really understand the last sentence, persist?
Closed review window accidently in the middle, will continue later
Comment #23
berdirAbout the escape problem, the idea of the API there was that knowing what should be escaped and knowing how it needs to be escaped are two different things and we tried to separate it there, but then again the way html is escaped in xliff is connected, so that might not work well.
Don't have a good solution atm...
Comment #24
berdirWhat exactly is source and translation content here? My understanding is that source is the actual node and fields and translation content is both the source text that we initially process and the translation that we get back, as both go through DOMDocument?
Wherever we make the split, we end up with a mismatch as we can't fix the actual source, be it our review form or the actual sources.
We can't change this, but maybe we should make this a bit more explicit, e.g. in the form of a description.
More tomorrow
Comment #25
berdirNot sure if this can really be considered a public property and should be set like this, it happens to be public because the storage controller needs it.
We have updateData(), but that's rather slow, not sure.
@inheritdoc can't be mixed with additional comments. api.module doesn't support that.
Instead of the second paragraph, you could move that inline and also add some more comments what you're doing there. The first seems unecessary. Also more uneccessary "will"'s :)
Would be important to test this with multiple job items and data items I think, to make sure that this logic here works correctly.
Also again the mix of simplexml and XMLReader. Again should probably at least re-use the reader (if that is actually faster, not sure), the more performant way would probably be to only use one API. Not sure how feasable that is.
All this is doing is counting opening tags, right (comments! :))
Other tests that require special data usually just override the default, that is IMHO easier to read than an implicit thing through the type. But we're doing it elsewere too, so not sure. Doing it here would make it easier to test multiple different structures. And I know it contradicts with what I'm saying below (setting data directly being problematic), but it's a test here :))
Testing a bit more specific would make sense I think, e.g a whole string from opening to closing tag, not just a small bit of it.
How could it end in the else case?
Do we need to worry about escaping/encoding here?
"Helper function to" is IMHO equally useless as "Will", "protected function" tells me that as well, what about "Masks HTML tags" or someting like that, that actually describes what it does?
Also, either the two loop functions should have a default implementation with abstract functions for this processSource() thing, or we give it a better name, not "process".
Are you sure this is the right place to set elementsCount? Won't this set it back to 0 for every data item?
Also, the trick with $this->elementsCount that's then used in the parent function seems... rather weird, I'd either return it (a bit more code but I think easier to understand) or pass down the job so that it can be immediately updated. The second is what I'd do if we have a default implementation for that loop.
Variable names are usually not shortened, can we name this $iterator? Or $dom_iterator?
id in comments should be ID
Same here, do we need to handle escaping of attribute values? AFAIK, we do parse this again when we import it, no?
Comment characters need to be indendeted correctly as well, if this is will be committed like this which I understand it will.
Ok, so this is a setting on the translator, so it can't be configured per job but you can configure the format per job...
That's it for now with specific feedback, going to discuss the whole thing with Miro asap.
Comment #26
blueminds commentedHere is patch that deals with characters that break xml structure. Have not yet done anything from reviews.
Comment #27
berdirOk, to summarize, here is my main problem with the way the plugin system works right now: There's a disconnect between the actual implementation, the way it's called and how it works:
- It works on the job, giving the assumption that it's a generic thing, this also makes the implementations rather complicated (that foreach, which *could* be solved with a base implementation). With the exception of the validate thing, which receives a file name (without knowing anything about it)
- It's called from the file translator, no matter what's configured
- The actual implementation however blindly assumes that the source is HTML and the target will be XLIFF.
The result is that it's a one-off solution for this specific problem, masked as a more or less generic pluggable system.
I think there are 3 options as to what we could do:
1 Write a fancy and complex processing system, that would actually be generic. That would very likely involve introducing a intermediate markup language that the source is converted into and then the translator processes it back into whatever he needs. That would probably keep us busy for a year and is overkill for this issue :)
2. Change the plugin system so that it's specific to xliff, only execute those plugins if xliff is selected and update documentation/possibly methods.
3. Implement this as an actual one-off solution, we can keep the class but just call it directly.
I actual prefer 3, because 2 will just get in the way in case we actually implement something like 1 at some point and 3 would then be easier to incorporate into that new plugin system as it wouldn't be a public API before. We can also make arguments/methods match that specific use case and don't have to think what we need to do to support every possible use case.
Then there's the the problem with saving the data in those methods in the class. That is problematic, because sooner or later, we will introduce things like checking if the source still matches when you review a translation and at least warn if not. That will happen unless we apply the first part of the processing *before* we do that, which we can't unless we'd implement 1. Given that we don't need the fixed source as we have the tag count it's supposed to have (afaik), would it be easier to just run $data through the processor, without persisting it?
Comment #28
blueminds commentedDid not implement all the comments yet. Focused on your 3. and implemented it. I agree with you here, but the original specification said that it should be pluginable and that there should be an option to turn off masking.
Re that setting - i think it does not have a sense as that setting would be about "Export xliff" and "Export even more xliffed file".
Tested with Trados studio, works fine.
Please see the patch if you like the conceptual changes and will finish other comments later on.
Comment #30
blueminds commentedhere is the final thing
Comment #31
miro_dietikerSome feedback :-) (strange, my dreditor seems broken...)
+<p>and here we have some link <a href="http://example.com">break line</a></p>I guess we should additionally test links with alt, title and both.
+ * Contains TMGMTTestTextProcessor.I'm strongly missing links to the relevant standard documents about what recommendation we exactly implement. Such as http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-h...
+ 'label' => t('Test html source'),HTML not html. And 100+ times more. ;-)
+ * Validates imported xliff file.XLIFF, not xliff. And 10+ times more.
I would be fine to define that HTML needs to be valid. However, TMGMT would thus require a source validation phase and an error might stop a job.
+ $non_pair_tags = array('br', 'img');Usually, the parser knows what is paired and what is present as single tag. This is only required for validation - and there are quite some other possible non-pair-tags.
+ return $reader->readInnerXml();Correct: readInnerXML()
Comment #32
blueminds commented- Implemented comments
- Ensured element ids are unique in the whole xliff document
Comment #33
blueminds commentedmake ids "nicer"
Comment #34
miro_dietikerSo we realized that Trados modifies/drops CDATA and while is throws a fatal error when the import does not 100% contain tag masks in source and target side, it makes no guarantee all masked tags are placed.
Thus, failing hard in case the validation shows mismatch is a bad idea.
Instead we should provide a message about the issue, continue and provide more messages in detail during processing.
Later, TMGMT items / data items might get flags about their integrity that would block a user from accepting it.
Still, as initially specified, the whole feature needs to be optional and enabled through the translator. If disabled, there shouldn't be any change.
For future, we should persist documents of old exports and make sure the import stilll works with the new code using automatic test coverage.
EDIT: Sometimes HTML is still written small caps.
Comment #35
miro_dietikerThe status of #33 is currently being tested in production, meaning we have files created with that defined file format in production.
For the import, we need compatibility and graceful fail within the next few days.
Comment #36
blueminds commented- Here comes the setting to enable/disable xliff processing
- HTML semantics integrity check that triggers only error messages but the job gets imported
Comment #39
blueminds commented36: 1882108-mask_html-12.patch queued for re-testing.
Comment #40
blueminds commentedComment #41
berdirCan you explain why doing this with a by-ref assignment, instead of setting it again at the end? This would be more complicated to port to 8.x (where we will likely have more methods to work with properties/fields) and you don't really save code as the explanation is as long as the code would be :)
The comment also doesn't make too much sense right now, as we don't set anything, we just save it. array() should also be just array.
Left-over debug.
Comment #42
blueminds commentedfixed
Comment #43
miro_dietikerMasking import needs to be fault tolerant.
We will accept imports with wrong data... but should throw a message for every item.
In case it's possible, add it to the (data) item comment.
Comment #44
blueminds commentedthis is already implemented
Comment #45
blueminds commentedreroll
Comment #46
miro_dietikerOK, tested this.
Works cleanly now.
However, error messages should be assigned to the item. Not to the job only.
Also validation error message is not that friendly and could be improved by e.g. outputting the tag that is the problem and less "bla" text.
And as a followup, the new mrk tag Trados now uses is a good thing we should support in future (followup) also for our own processing.
Comment #47
blueminds commentedhere we go
Comment #48
berdirOk, committed and pushed!
Given that this is a pretty major change, let's document this change as a change notice: https://drupal.org/list-changes/tmgmt
Comment #49
anybodySorry, but I have a very important question and I can't find an answer in the
documentation yet: I get the result back from the translator as escaped HTML
and import it, but it doesn't seem to be converted back to HTML at any point
of time. How can I solve that? What am I doing wrong? :(
Comment #50
megadesk3000 commentedHello miro and berdir
I have the exact same problem like "Anybody" ;) In the xlif is htmlentites-encoded markup and i want this to converted back to the original HTML-Markup on import. Can you give me a hint how this can be achieved ?
many thanks for your answer
Jan
P.S. Here is an example:
Comment #51
anybodyAny progress or even an idea here?
Comment #52
dragonfire353 commentedHere's a quick fix for new translation imports for 7.x1.0-rc1.
Add this to entity/tmgmt.entity.job_item.inc on line 630 or if that isn't right in function addTranslatedDataRecursive near the end before $this->updateData:
Again, quick fix so use at your own risk.
Comment #53
anybodyThanks a lot @dragonfire535 for your quickfix.
The problem with it is that there are several further HTML entities that should be decoded like and other ascii characters that may be contained in the text.
I've created and attached a patch that uses html_entity_decode() to do that in the last step. You can use it as quick fix.
Anyway I think we should discuss
to add a setting or permission to enable this HTML encoding
add a function to handle the text securely, at least add filter_xss_admin (see patch) or custom filter_xss (See: https://www.drupal.org/node/28984)
Further security implications because code injection will definitely be possible if we're not careful here.
Anyway this has to be fixed because otherwise a proper HTML import is not possible otherwise.
Could a tmgmt maintainer and/or a member of the Drupal security team have a look at this perhaps finally?
And finally for clarification: You need the patch from #42 plus this one to have the full export and import functionality for HTML translations!
Comment #54
anybodyComment #55
anybodyRemoved unrelated code style fixes from the patch.
Comment #56
a.milkovsky@Anybody, thank you for the path #55. It was a good hint for me in the Drupal 8 version as well.
I vote +1 for RTBC.
Comment #57
kristen polThanks for the patch! It is working for us but a couple nitpicks below. Cheers.
Need a space after if.
Comment should wrap at 80 characters (right after HTML without leaving a dangling space).