Problem/motivation

Currently when importing .po files, we read a file from start to finish in one sitting. This can take longer then allowed on a shared hosting environments, and might time out the PHP process. To fix this, we should make all .po file reading support reading only a section of the file and then continue reading from the end of that in another HTTP request and so on and so forth. This is solved in Drupal 7 with the gettextapi module.

Proposed solution

Introduce the possibility to save the position of the file when reading so we can save that state and return to that seek position in the next HTTP request. Therefore we add getSeekPos() and setSeekPos() to the stream reader.

Then using this, we can apply progressive batches to .po file reading, so we store the seek state of the reader in the batch and restore it until we reach the end of file. Then we move on to the next batch operation to read the next file.

To unify this, we also apply this progressive batch to importing single uploaded files coming from the locale import .po UI.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

attiks’s picture

Issue tags: +D8MI, +sprint, +language-ui

tags

attiks’s picture

Issue summary: View changes

summ updated

steinmb’s picture

Status: Postponed » Active

Patch committed :), changing status.

Gábor Hojtsy’s picture

Title: Batch import all files in chunks » Batch import all Gettext .po files in chunks

Anybody wanna work on this one now? :) @steinmb?

Gábor Hojtsy’s picture

The D8MI sandbox at http://drupal.org/sandbox/goba/1624820 has a branch with lots of code to use for this. It was set aside to help focus on the core gettext functionality that got committed.

vasi1186’s picture

Assigned: Unassigned » vasi1186
vasi1186’s picture

Status: Active » Needs review
FileSize
14.11 KB

A fist version of the patch that needs review (and most probably some work still).

Gábor Hojtsy’s picture

Status: Needs review » Needs work

A quick review of the current patch. Mostly code style / docs stuff:

+++ b/core/lib/Drupal/Component/Gettext/BatchStateInterface.phpundefined
@@ -0,0 +1,50 @@
+<?php
+

Add @file level comments.

+++ b/core/lib/Drupal/Component/Gettext/BatchStateInterface.phpundefined
@@ -0,0 +1,50 @@
+ * Add state to an object to manage continue after a previous batch call.

"managed continue" does not sound like proper English :) "manage continuation of a reader" or something like that maybe?

+++ b/core/lib/Drupal/Component/Gettext/BatchStateInterface.phpundefined
@@ -0,0 +1,50 @@
+   * The Class implemeting this interface must have an empty constructor.

implementing

+++ b/core/lib/Drupal/Component/Gettext/BatchStateInterface.phpundefined
@@ -0,0 +1,50 @@
+   * @return array
+   *   key/value pairs of which one must be 'class' => __CLASS__

- Key/value (start with uppercase)
- of which one => out of which one?

+++ b/core/lib/Drupal/Component/Gettext/PoStreamReader.phpundefined
@@ -205,6 +206,36 @@ class PoStreamReader implements PoStreamInterface, PoReaderInterface {
+    // Make sure to (re)read the PoHeader

// Make sure to re-read the header first to have metadata for plural forms.

Or something along those lines IMHO.

+++ b/core/modules/locale/lib/Drupal/locale/Gettext.phpundefined
@@ -66,7 +66,7 @@ class Gettext {
    * @see Drupal\locale\PoDatabaseWriter
    */
-  static function fileToDatabase($file, $langcode, $overwrite_options, $customized = LOCALE_NOT_CUSTOMIZED) {
+  static function fileToDatabase($file, $langcode, $overwrite_options, $customized = LOCALE_NOT_CUSTOMIZED, $seekpos = NULL, $items = -1) {

Document $seekpos and $items.

+++ b/core/modules/locale/locale.bulk.incundefined
@@ -108,45 +108,8 @@ function locale_translate_import_form_submit($form, &$form_state) {
-
-    // Now import strings into the language
-    try {
-      // Try to allocate enough time to parse and import the data.
-      drupal_set_time_limit(240);
-
-      $report = GetText::fileToDatabase($file, $language->langcode, $form_state['values']['overwrite_options'], $customized);
-      $additions = $report['additions'];
-      $updates = $report['updates'];
-      $deletes = $report['deletes'];
-      $skips = $report['skips'];
-
-      menu_router_rebuild();
-      // Clear cache and force refresh of JavaScript translations.
-      _locale_invalidate_js($language->langcode);
-      cache()->deletePrefix('locale:');
-
-      drupal_set_message(t('The translation was successfully imported. There are %number newly created translated strings, %update strings were updated and %delete strings were removed.', array('%number' => $additions, '%update' => $updates, '%delete' => $deletes)));
-      watchdog('locale', 'Imported %file into %locale: %number new strings added, %update updated and %delete removed.', array('%file' => $file->filename, '%locale' => $language->langcode, '%number' => $additions, '%update' => $updates, '%delete' => $deletes));
-      if ($skips) {
-        if (module_exists('dblog')) {
-          $skip_message = format_plural($skips, 'A translation string was skipped because of disallowed or malformed HTML. <a href="@url">See the log</a> for details.', '@count translation strings were skipped because of disallowed or malformed HTML. <a href="@url">See the log</a> for details.', array('@url' => url('admin/reports/dblog')));
-        }
-        else {
-          $skip_message = format_plural($skips, 'A translation string was skipped because of disallowed or malformed HTML. See the log for details.', '@count translation strings were skipped because of disallowed or malformed HTML. See the log for details.');
-        }
-        drupal_set_message($skip_message, 'error');
-        watchdog('locale', '@count disallowed HTML string(s) in %file', array('@count' => $skips, '%file' => $file->uri), WATCHDOG_WARNING);
-      }
-      $variables = array('%filename' => $file->filename);
-      drupal_set_message(t('The translation import of %filename is done.', $variables));
-      watchdog('locale', 'The translation import of %filename is done.', $variables);
-
-    } catch (Exception $exc) {
-      drupal_set_message(print_r($exc, TRUE));
-      $variables = array('%filename' => $file->filename);
-      drupal_set_message(t('The translation import of %filename failed.', $variables), 'error');
-      watchdog('locale', 'The translation import of %filename failed.', $variables, WATCHDOG_ERROR);
-    }
+    $batch = locale_translate_batch_build(array($file->uri => $file), $form_state['values']['overwrite_options'], $customized, $form_state['values']['langcode'], TRUE);
+    batch_set($batch);

Amazing for standardizing this code. Yay, yay!

+++ b/core/modules/locale/locale.bulk.incundefined
@@ -285,10 +248,10 @@ function locale_translate_export_form_submit($form, &$form_state) {
-function locale_translate_add_language_set_batch($langcode) {
+function locale_translate_add_language_set_batch($langcode, $overwrite_options = array(), $customized = LOCALE_NOT_CUSTOMIZED) {

@@ -308,7 +271,7 @@ function locale_translate_add_language_set_batch($langcode) {
-function locale_translate_batch_import_files($langcode = NULL, $finish_feedback = FALSE, $force = FALSE) {
+function locale_translate_batch_import_files($langcode = NULL, $overwrite_options = array(), $customized = LOCALE_NOT_CUSTOMIZED, $finish_feedback = FALSE, $force = FALSE) {

@@ -364,13 +327,13 @@ function locale_translate_get_interface_translation_files($langcode = NULL) {
-function locale_translate_batch_build($files, $finish_feedback = FALSE) {
+function locale_translate_batch_build($files, $overwrite_options = array(), $customized = LOCALE_NOT_CUSTOMIZED, $langcode = NULL, $finish_feedback = FALSE) {

@@ -398,21 +361,47 @@ function locale_translate_batch_build($files, $finish_feedback = FALSE) {
-function locale_translate_batch_import($filepath, &$context) {
+function locale_translate_batch_import($filepath, $overwrite_options, $customized, $langcode = NULL, &$context) {

Document new common arguments.

+++ b/core/modules/locale/locale.bulk.incundefined
@@ -448,6 +437,11 @@ function locale_translate_batch_finished($success, $results) {
+    menu_router_rebuild();
+

I've seen this was moved from the file import code, but I don't think we need to rebuild menus anymore, they should not contain locale specific data.

clemens.tolboom’s picture

+++ b/core/modules/locale/locale.bulk.inc
@@ -364,13 +327,13 @@ function locale_translate_get_interface_translation_files($langcode = NULL) {
-function locale_translate_batch_build($files, $finish_feedback = FALSE) {
+function locale_translate_batch_build($files, $overwrite_options = array(), $customized = LOCALE_NOT_CUSTOMIZED, $langcode = NULL, $finish_feedback = FALSE) {

I think an array of options would be better like

array(
  'langcode' =>
  'customized' =>
  'finish_feedback' =>
  etc ...
);

with default. These options can then easily be transferred to other functions?

+++ b/core/modules/locale/locale.bulk.inc
@@ -398,21 +361,47 @@ function locale_translate_batch_build($files, $finish_feedback = FALSE) {
+  if ($langcode || preg_match('!(/|\.)([^\./]+)\.po$!', $filepath, $langcode)) {

Maybe naming the last $langcode onto $matches would help better understand this code?

+++ b/core/modules/locale/locale.bulk.inc
@@ -398,21 +361,47 @@ function locale_translate_batch_build($files, $finish_feedback = FALSE) {
+    if (is_array($langcode)) {

Then we can change this in to

if (empty($langcode) {
  $langcode = array_pop($matches);
}
clemens.tolboom’s picture

Issue summary: View changes

summary updated

Gábor Hojtsy’s picture

Title: Batch import all Gettext .po files in chunks » Import Gettext .po files in progressive batches to avoid time limits

Updated the issue summary with a much cleaner version and giving a cleaner title (hopefully) as well.

Gábor Hojtsy’s picture

Checked for standard methods to do hibernation of PHP objects, since we are doing a very thin layer to do it ourselves. Found these two:

- __sleep() and __wakeup() magic methods, where __sleep() takes keys of properties it will serialize; http://de2.php.net/manual/en/language.oop5.magic.php#object.sleep (this makes some requirements on the constructor and is not an interface that can be ensured/checked for)
- The Serializable interface where we return the serialized data ourselves and get the serialized data, this is very similar to the batchstateinterface and is a PHP standard, so why not use this? http://de2.php.net/serializable

vasi1186’s picture

Status: Needs work » Needs review
FileSize
20.23 KB
23.21 KB

Attached a patch that should apply now on the latest 8.x branch, and should solve the issues pointed out in #7 and #8
#10 not yet done.

clemens.tolboom’s picture

Status: Needs review » Needs work

@Gábor Hojtsy you're probably right in that the interface now is similar with the serializable one. I don't get the constructor part remark though. Our constructor is empty by design afaik?

Reading http://de2.php.net/manual/en/language.oop5.magic.php#object.sleep they refer to http://de2.php.net/manual/en/class.serializable.php which is imho better and more OOP oriented.

So if @vasi1186 like to try use http://de2.php.net/manual/en/class.serializable.php instead.

Gábor Hojtsy’s picture

@clemens.tolboom: cool, we are proposing the same thing then :)

sxnc’s picture

Just tested the patch from #11 and it worked fine locally, saw the progress-bar and everything (used file sizes were around 600-900kb).

vasi1186’s picture

Status: Needs work » Needs review
FileSize
4.86 KB
20.23 KB

After a closer look with Gábor, we actually got to the conclusion that we do not really need the BatchStateInterface at the moment. We used this only to set the pointer position in the file, and then get it back. So, in this case we decided to reduce the code and use just two functions to set and get the pointer in the stream. This position is stored in the context of the batch operation (as it was also before), so we actually never used the "hibernation" of an object in a true way, the seek position was set from outside all the time.

Attached a patch and an interdiff with the changes.

clemens.tolboom’s picture

The intent of an interface is to abstract stuff.

The purpose of getState | setState was to hide away filename, langcode, seekpos etc for the batch. That makes continuing a Batch step easier 'just grab the state' and is more OOP.

Is this because most of the state data is needed for the reporting too?

I'm indifferent to any good solution as Batch API is not OOP yet which then would probably require Serializable.

:-)

Gábor Hojtsy’s picture

Yes, it would need to required serializable then, which is a whole bunch of data, that we do not keep in the batch otherwise. Or in other words, we have the file name and langcode in the batch operation setup already, way before any reader is instantiated, to even instantiate the reader at first, so we'd be duplicating that. When looking at the code, we realized, all we really need on top of what we have anyway already is to keep track of the seek position. We cannot really encapsulate the file name and langcode because the batch API needs it to even instantiate readers and designate batch operations + we cannot really encapsulate the writing options, because those are not even in the reader, they are on the writer. So after a serious rearchitecture of the batch API to work with more OOP-y code, we might be able to return to this, but we are not taking that on.

lazysoundsystem’s picture

Just to say that I've tested the patch from #15 and it works well.

+    $report = $writer->getReport();
+
+    // Add the seek position to the report. This is usefull for the batch
+    // operation.

Just one too many 'l's in 'useful' :)

vasi1186’s picture

:) solved that.

penyaskito’s picture

Status: Needs review » Needs work
+++ b/core/lib/Drupal/Component/Gettext/PoStreamReader.phpundefined
@@ -205,6 +205,23 @@ class PoStreamReader implements PoStreamInterface, PoReaderInterface {
   /**
+   * Sets the seek possition for the current PO stream.

s/possition/position/g

Otherwise, this patch is awesome, vasi1186++.

vasi1186’s picture

Status: Needs work » Needs review
FileSize
557 bytes
20.22 KB

thx!, updated the patch.

vasi1186’s picture

Issue summary: View changes

Update summary with more meaningful and up to date info.

penyaskito’s picture

Status: Needs review » Reviewed & tested by the community

This is RTBC for me.

attiks’s picture

nice work

vasi1186’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
4.4 KB
20.17 KB

Small change: seekpos -> seek, getSeekPost() -> getSeek(), setSeekPos() -> setSeek().

Gábor Hojtsy’s picture

Status: Needs review » Reviewed & tested by the community

That change was in fact suggested by @webchick to avoid abbreviating names. Looks like should be still back to RTBC then :)

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Ok, spent a bunch of time reviewing this yesterday; sorry I didn't get as far as follow-up/commit.

As Gábor said, one nit-picky thing that stood out was the naming of the getSeekPos() function, but that's been remedied now, so thanks!

The other less nit-picky thing is that anytime I see stuff like this:

+++ b/core/modules/locale/locale.bulk.incundefined
@@ -283,12 +250,29 @@ function locale_translate_export_form_submit($form, &$form_state) {
+ * Sets a batch for a newly added language.
+ *
+ * @param array $options
+ *   An array with options that can have the following elements:
+ *   - 'langcode': The language code, required.
+ *   - 'overwrite_options': Overwrite options array as defined in
+ *     Drupal\locale\PoDatabaseWriter. Optional, defaults to an empty array.
+ *   - 'customized': Flag indicating whether the strings imported from $file
+ *     are customized translations or come from a community source. Use
+ *     LOCALE_CUSTOMIZED or LOCALE_NOT_CUSTOMIZED. Optional, defaults to
+ *     LOCALE_NOT_CUSTOMIZED.
+ *   - 'finish_feedback': Whether or not to give feedback to the user when the
+ *     batch is finished. Optional, defaults to TRUE.
...
+function locale_translate_add_language_set_batch($options) {

...it's usually an indicator that we're overloading a function into doing way too much. It's akin to when we used to have hook_node($op, $node, $a3, $a4) in D6 and below; these were split into hook_node_load(), hook_node_save(), etc. in D7+ and as a result the functions are much more focused and are much more grokkable. It feels like we probably need something similar here.

I spoke to Gábor about this, and he acknowledged this concern, but also said that fixing it would require far-reaching changes to the underlying Batch API as well. I actually think that'd make a great follow-up for anyone who worked on this issue, because this capability of processing items in a "series of seeks" (I'm using the wrong words here, but), rather than a big stack of things to do, seems like really useful functionality to have as part of the general Batch API. And if we did that, it's likely we could clean up some of the code here.

However! None of that is worth holding up this patch, which looks like a great solution to the problem, and these implementation concerns aside, looks good. Therefore..!

Committed and pushed to 8.x. Thanks! If we could get a follow-up issue to introduce this functionality into Batch API "proper" though, that would be awesome. (Even more awesome if some folks here who had to understand the existing batch API to write this could help push that issue along with at least some pseudocode/description of how it could work, before you all forget everything you learned. :D)

Gábor Hojtsy’s picture

Status: Fixed » Needs review
FileSize
685 bytes

Yay, thanks. Here is a CHANGELOG.txt followup :) Not sure the text is ok, but this should be quick to review.

webchick’s picture

Status: Needs review » Fixed

Text looks good to me!

Committed and pushed to 8.x.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Gábor Hojtsy’s picture

Issue tags: -sprint

Off the sprint then, thanks all!

Gábor Hojtsy’s picture

Issue summary: View changes

More up to date solution explanation.