Guys,

I did some changes on locale module.

  1. i redesigned the PO import/export system. Its now more generic. New types of translation files, like llXML, TMX or XLIFF, will now be possible, using its new a small "API".
  2. changed the PO files creation that was completely wrong. I was not producing the correct PO translations to languages that has more than one plural.
  3. as i posted on http://groups.drupal.org/node/4258, i created a new module (called Live Translation) that allows users to automatically update all strings from a server. Drupal.org can host a online translation interface and all community will be allowed to contribute. It breaks the each-developer-is-reponsible-for-your-module-translation paradigm. The user front-end is also implemented in this patch. As the Live Translation module (the server-side parts) is getting more close to 1.0, i will make few adjustments on this later.

regards,

massa

CommentFileSizeAuthor
#10 locale_22.patch35.21 KBbrmassa
#4 locale_21.patch31.64 KBbrmassa
locale_20.patch38.58 KBbrmassa

Comments

recidive’s picture

Status: Active » Needs review

Wow! That's a big patch. Needs review though ;)

gábor hojtsy’s picture

Status: Needs review » Needs work

OK, for us to be able to review this, *do not* include three different changes in one patch! Either have them in order, or *at most* the first and second one together, which have an immediate likeliness to get into core, while the third one will be likely target for debates.

gábor hojtsy’s picture

By the way, a few remarks to help you quickly get to better results:

- we have implemented the PO importer with a stream oriented parser with callbacks for each string, because parsing a big PO file into memory (into a big array), and then passing that array around is *very memory intensive* See this issue with explanation and even a graph: http://drupal.org/node/47610 ... so for this reason, this is rather implemented with not one big array but a SAX-like parser. I worked so much on this more then a year ago... It would be a pity to have this memory footprint issue back in again...
- this patch seems to be quite rough, you have stuff commented out, you have calls to _locale_import_message() removed, but not the function itself, and so on...
- you also removed the file name from most errors as the data source might not be a real file, *but* when importing mutliple files in Drupal 6 with the built in batch features, how would a user tell what file to look into/replace/fix/complain about, if he had an error in a file? some creative solution here would be required.
- there is lot of code which breaks Drupal coding standard, like this one does on nearly every occassion:

+      if ($last) {$info["project"] = $last;}
+      else {continue;}
brmassa’s picture

Status: Needs work » Needs review
StatusFileSize
new31.64 KB

Gábor,

your right. here the "Import/Export Translation API"

regards,

massa

brmassa’s picture

Gábor,

sorry, while i was writing, you posted another comment:
1* im not familiar with SAX, but using an array, my intention is to provide a atomic operation: or EVERYTHING or NOTHING. PO files doesnt have many problems, but on XML files, if an error is discovered on the middle of the operation, everything can be defected.
2* _locale_import_message() was deleted since the API is not based on files, but contents. Even not implemented on this patch, its possible to import a translation only by copying the PO content on a textearea. its useful when implementing some kinda validator or other operations.
3* the same as before. i dont know how to proceed.
4* i totally agree about the Code standards and i used to adopt it. i used the D5 locale.module, so this error was there too. im gonna fix it. (BTW: was the coder module ported to D6?)

cheers,

massa

gábor hojtsy’s picture

Title: Live Translation and Import/Export Translation » Refactor locale import/export API to support non-local files and non-gettext formats

Changed title to reflect the issue. Awaiting fixes for my remarks.

brmassa’s picture

Gárbor,

there are two trade offs against performance: the operation will be not atomic and creates a too concatenated process.

  1. I read the post about performance (http://drupal.org/node/47610) ... well... if dries and derek agrees about errors on the middle or broken files... well ok so.
  2. with concatenated process like this, it will be impossible to acess only a few parts of the code. i divided the import/export process in three logical parts: file reading / converting / saving. on importing, the file reading process can be skipped if you have the content on a string variable or from internet; on exporting, you can skip the file creation if you need to show the PO file on screen; and so on... and if i integrate the converting and the saving phase, it will be not possible to allow other modules work with the strings. in other words, its will transform an API into a one-purpose process.

And generally language imports/exports are not oftem and doesnt use big files.

if you really prefer i do this, it you be easy. so... you decide.

cheers,

massa

PS: your last post was on the same time as mine. take a look there

gábor hojtsy’s picture

1* im not familiar with SAX, but using an array, my intention is to provide a atomic operation: or EVERYTHING or NOTHING. PO files doesnt have many problems, but on XML files, if an error is discovered on the middle of the operation, everything can be defected.

There is no difference between finding an error in the middle of an XML file and in the middle on a PO file. With stream based parsing, the error only shows up in the middle, regardless of the format. The SAX XML parsing works so that you register a few callbacks to invoke when certain "events" happen throughout parsing. In PO parsing, you could register a function to be called when a string gets parsed. Now this is always the "_import_one_string" function, but as far as I see, you would like to have that replaced and otherwise leave the algorithm in place (you only replaced the import_one_string calls with an array buildup). If SAX is not known for you, a solution for this can be derived from design patterns such as the "template method" pattern. (Just that we don't have classes to inherit from and define our own processing functions, so we can pass on the function to call back). Sematic code:

function import_from_po_string($data, $callback = 'import_one_string', ...) {
  // whenever a string is found callback $callback($string);
}

Instead of building up a big array, the callback gets called for each array. If you have a contrib module which needs a big array, it can build it up in that callback, but it does not fit the constraints a Drupal site could run at shared hosting servers.

2* _locale_import_message() was deleted since the API is not based on files, but contents. Even not implemented on this patch, its possible to import a translation only by copying the PO content on a textearea. its useful when implementing some kinda validator or other operations.

So this is a reason to have unclear and confusing error messages to users? We had bug reports before from guys with error messages which did not spell out the file name, so the file name involved in the error was unknown. As I have pointed out earlier, if you have a batch process, which imports 16-20 files (this will be so for *any* foreign language Drupal site with Drupal 6), you will have a very hard time debug the error if you have no file name. When you use textareas or a single HTTP source, maybe it is clear that it is the source of the error, but not for the Drupal core use case. It is not that hard to get an extended error message though. Either "componentize" the string (which is not 100% translation friendly), or have a callback provide the error message with a message code or rather a message key.

// componentized error, optional filename, because we might only have a stream
function import_from_po_string($data, ...., $filename = NULL) {
  _locale_import_message('Translation %name broken at ...', $lineno, $filename);
}

function _locale_import_message($string, $lineno, $filename = NULL) {
  $name = isset($filename) ? t('file %filename', array('%filename' => $filename)) : '';
  drupal_set_message($string, array('%name' => $name ...));
}

This clearly omits the filename if there is no file known, and clearly includes it if we know the file. It is not really language friendly though, so it might make sense to instead of passing the $string along, pass an $error_key and build up the message for that properly:

// abstracted error message, optional filename, because we might only have a stream
function import_from_po_string($data, ...., $filename = NULL) {
  _locale_import_message('broken', $lineno, $filename);
}

function _locale_import_message($error_key, $lineno, $filename = NULL) {
  if (isset($filename)) {
    switch ($error_key) {
      case 'broken:'
        $message = t('Translation file %filename broken at ...', array(....));
    }
  }
  drupal_set_message($message, 'error');
}

Or you can use a named callback for this too, just like the string parsing if you wish (again, look at SAX :).

4* i totally agree about the Code standards and i used to adopt it. i used the D5 locale.module, so this error was there too. im gonna fix it. (BTW: was the coder module ported to D6?)

That was an excerpt from your code (an excerpt from locale_get_modules()), so I would not say it comes from the existing code.

And generally language imports/exports are not oftem and doesnt use big files.

Well, we try to push people to use smaller PO files, but nothing stops them from exporting a complete translation from a site and import that on another one. That could be a PO file half a megabyte or bigger in size. Your methods would use twice as much memory for that as the current algorithm if I extrapolate well from the early 2006 findings we had.

gábor hojtsy’s picture

BTW this is the relevant issue where a filename was/is missing, so debugging is hard: http://drupal.org/node/129709

I also forgot to note that you would actually need "readers" and "writers" as parameters.

function import_parse_po_file($reader, $writer, ...) {
  while ($line = $reader()) {
    // parse
    // when required:
    $write(...);
  }
}

The reader would return a line, the writer would write out one string when found. This is slightly more general then the example I provided above (but it does not suggest an *actual* variable name or parameter order set, it is just a simple example to show you the idea).

brmassa’s picture

StatusFileSize
new35.21 KB

Gárbor,

i did some small changes. its now much more like you asked.

cheers,

massa

gábor hojtsy’s picture

Looked through the code. Some comments:

- There is already a hook_locale() in core, maybe you can reuse it more creatively, instead of introducing a possibly confusing hook_localeapi()
- In locale_translate_import_form_submit(), you pass on a file resource id to _locale_import(), then on _locale_import(), you indicate it is $data, which is supposed to be the whole file
- We discussed passing around the whole file data and/or a big array of the parsed strings is a no-go for core, unless you can enlighten us with some benchmarks
- Code comments are unclear and contain spelling errors: eg. "Coordenates all exporting process" is too general to help anything
- We don't start the variable documentation with the type name, and definitely not lowercase
- Receiving a function name to use, and accepting it as-is from a request parameter is part of a disaster recipe... You don't want people to run arbitrary PHP functions on your server
- ' %filename(line %lineno): %line' is not localizable... I would also refrain from printing the actual line contents there as it can easily be gibberish nonsense to the actual Drupal user (many Drupal users seeing these errors are not aware of how PO files are constructed, neither they care about it)
- There is really no value of the call_user_func_array() function here, if you know the exact variable number and order... "call_user_func_array($tables, array("db-store", $current, $language, $mode, $group))", in fact you might only need call_user...() stuff to easily tell that your $tables is a function name in fact (which should be more clear from the variable name from the start, so $tables does not look attractive)
- We did have quite a few bug reports about problematic plural formulas, so removing the file name from that error message does not seem to be a good idea.
- "$strings = _locale_export_get_strings($language, $group);" this is again not going to work nicely, when you export two or three thousand strings... I see Drupal currently does build an array of strings when exporting, we did not look into optimizing that a year ago, as importing was the immediate target for the work. Seems like now is the time.
- We don't have function ending comments like "} // _locale_export_po()"
- There is no such thing as a "Portable Gettext Object"... There is although "Gettext Portable Object"

brmassa’s picture

Gárbor,

1* Performance: i agree. I agree that we should look after performance and tunning, but a billion strings is basicly a 1mb file, and most servers can afford it. Im gonna try to fix it.
2* Comments: can be easily changed. ok!
3* Since im quite busy on my business in Brazil (i sell computers), and taking care of Ecommerce module and some other modules i maintain is a lot time demanding, im going to work on this patch later. Its not critical anyway.
4* Im going to work on Live Translations module. I just did several important changes on the module and i will launch a testing site with it. i just customized the locale import/export feature to help me on this module.

best regards,

massa

gábor hojtsy’s picture

1* Performance: i agree. I agree that we should look after performance and tunning, but a billion strings is basicly a 1mb file, and most servers can afford it. Im gonna try to fix it.

The Hungarian Drupal 5.x core translation is a half a megabyte PO file, and contains 2409 strings. So a 1mb file contains around 5000 strings, not a billion. Most people install more modules, and will not use the Drupal core modules only, so these 2409 strings is just the start for a foreign language site. A year ago we found most shared hosting services (for which your automated updates are especially useful, not the control freak professionals) cannot even afford to import this half a megabyte file at once. Look, if you have this whole file read at once, then taken apart to small pieces and stuffed into a big array structure, it gets to *megabytes* of data for this data to be in memory at once. Similarly for exports. Did you actually look at the benchmarks I have pointed you to? Why do I repeat the same arguments over and over again, just to find you think we are in the billion strings dimension?

gábor hojtsy’s picture

BTW just to reiterate: I very much care about this feature getting into core, or I would not come back and review new patches all over again. I think it should be done right, for the satisfaction of our users. We worked really hard to handle PO imports better in Drupal 6 with the new batch API and lot's of automated tools, and there is no room for stepping back to a performance-wise disastrous solution.

brmassa’s picture

Gárbor,

I think you got me wrong: i really agrees about performance.

Because i dont have your expertise with locale module, i dont know exactly how to do this. This patch doesnt change that much, so its probably better if you take a look and modify it. The file are still loaded at once, but the lastest patch implements the callback function per string, as you recommended. I did only a "beta" code.

Core + contribs PO file may have about 5000 strings, or 1mb. its probably not a that big deal to load it, but compared to your solution that rounds about 0mb, its completly ridiculous. i really believe that it needs the optimizations you pointed. i frankly dont know much how to do and dont have the free time to study.

This feature isnt critical anyway. There is no other than PO files containing drupal translations and im developing a module to, hopefully, turn this a little bit useless.

sorry for any misundestand.

best regards,

massa

PS: how did you do those benchmarks? since im developing the Live Translation module, its going to be quite useful.

gábor hojtsy’s picture

Status: Needs review » Needs work

This feature isnt critical anyway. There is no other than PO files containing drupal translations and im developing a module to, hopefully, turn this a little bit useless.

As far as I have last tracked, you are going to use PO files to share translations to keep compatibility and hackability (in a good sense :), so this part of Drupal is going to be used more with your module not less.

As far as memory benchmarks go http://php.net/memory_get_usage helps when placed to the places you want to have your memory consumption measured. That shows pretty nicely where does your memory get used.

gábor hojtsy’s picture

OK, so to move this issue forward, let's get the plural fixes in core at least. They are bug fixes, if I understand this right. If you could describe the bug and have that code to commit, I would test and commit the patch directly. I would love to refactor the locale code to be more modular for things such as live translation, and it would be much better to concentrate on that once we have the bugs fixed.

Thanks for your contributions!

gábor hojtsy’s picture

BTW I included part of the patch (to refactor DB operations from inser_one_string() into another function) in this cleanup effort: http://drupal.org/node/150521

brmassa’s picture

Gárbor,

plural bug: I dont remember much what was the bug, but i think its this: PO files should be formatted like

msgid "original string"
msgid_plural "original plural"
msgstr[0] "Translated string"
msgstr[1] "Translated plural"

It works fine on languages that have only 1 plural. Many languages, however, has more plural forms. studing the code i noted that it ignores this fact, considering only the first.

I believe the entire system has the same design. I dont know much about PO files and localization systems. My "Live Translation" module probably suffers the same problems.

massa

PS: since you are a member of Hungarian translation team also, i invite you to help on LT. im hosting the server on drupal.titanatlas.com

gábor hojtsy’s picture

I debugged the code a great deal today, and found out that there is only a simple bug in there, which exposes the internal @count[2] type of index to the PO file. I did include a fix and other improvements in the rework of the "export" part of your patch, submitted here: http://drupal.org/node/152670

Unfortunately due to the nature of how plurals are handled, I did not find a way to generate PO files with a "stream" type of flow, as you do in your modified potx.inc and as I suggested we keep doing with the PO import. Unfortunately nothing ensures that we have the plural versions in order from the DB, so we cannot just work with a stream type of parsing as it seems.

If you have time, I would appreciate if you could test the Drupal 6 patch I submitted at http://drupal.org/node/152670 so I can get on to the import part of your refactoring ideas, if that gets committed.

brmassa’s picture

Status: Needs work » Fixed

Gabor,

is it already fixed right?

massa

gábor hojtsy’s picture

Version: 6.x-dev » 7.x-dev
Status: Fixed » Postponed

The export refactoring is done in Drupal 6, because there were no interest shown from people for the import refactoring (look at the export refactoring issue mentioned above for an example), I concentrated on other issue, so the import refactoring is not done unfortunately.

brmassa’s picture

Angie,

XLIFF is more-and-more popular format for translations. Gábor started himself and old module called XLIFF Tools. It might be the time to revisit it, since its MUCH more advanced and complete than GetText (pot and po files) format. Otherwise, i suggest to close this issue.

regards,

massa

gábor hojtsy’s picture

Well, for Drupal interface translations, the Gettext format is great, because it actually has nice tools for people to work with: translate, merge, etc. XLIFF is still somewhat of a new kid on the block in terms of readily available tool support. Also, it bring a bigger baggage in terms of bandwidth usage for simple string translation such as with Drupal's interface strings.

brmassa’s picture

Gábor,

your right. I only would recommend XLIFF for 2 reasons: 1* Place Drupal 7 in the vanguard tech (along PHP 5+ only, Web Services and other bleeding edge technologies) and 2* it has a features that is not available for PO files: context. Its hard to translate "View" because it can be a verb or an object. We could support it on t() function, just like Qt does. It also provides some other features, so we might consider cost/benefit carefully.

However, if you believe its really not yet the moment (it should be a Drupal 8+ matter or not at all), i suggest to close this issue. I might reopen it in the future.

regards,

massa

gábor hojtsy’s picture

Well, it is a far cry if only the transport format supports context. Supporting context in the locale API would be a requirement first. And there was no progress there lately as far as I have seen.

gábor hojtsy’s picture

Also, discussing that would be highly offtopic for this issue :)

damien tournoud’s picture

XLIFF is nothing more than an transport format. The fact that it "supports" context means nothing to us: that would be easily implemented using the gettext format using a simple convention (for example: prefixing source strings with the context).

We really need to add context to t(), but that's another issue *entirely*.

brmassa’s picture

agreed.

damien tournoud’s picture

Some thoughts about add context to t(): #334283: Add msgctxt-type context to t().

plach’s picture

Component: language system » locale.module

Cleaning-up the "language system" issue queue as per http://cyrve.com/criticals.

gábor hojtsy’s picture

Related #1189184: OOP & PSR-0-ify gettext .po file parsing and generation but no proposal there to support other formats, just decouple the .po file handling. I think it might pave the way for this effort, but I'm not convinced about the special need for supporting more formats. A contrib module will be much easier to do after that patch, since the API should be much more open about it.

gábor hojtsy’s picture

Status: Postponed » Closed (duplicate)

#1189184: OOP & PSR-0-ify gettext .po file parsing and generation make the API so open to do this that any contrib module should be able to do that. Finally :)