I was asked to review how PO file import is done in Drupal. I was thrilled to see how inefficient the process was. The main problems were:

  • The PO file was read all in one.
  • A big array of strings was generated with parsing the file.
  • This really big array was passed around and processed elsewhere.

The first approach had no advantage whatsoever, so my optimization first pass was to eliminate that and read the PO file line-by-line, given that this was the parsing approach anyway. That reduced the memory requirements a great deal. Second it was a bit harder to process the tokenized file string-by-string, without building the big string array. It needed some code-move-around, so that the big array is not built up. This led to the fact that we cannot prevalidate the file, so broken files will be imported halfway before the process signals an error. This was accepted as a consequence on the development list by Dries, Gerhard et al.

The attached patch mostly contains original code, it is just moved around, so I retained Jacobo's credit information.

Comments

gábor hojtsy’s picture

StatusFileSize
new20.02 KB

Here is a chart which should show the original memory requirement, which nears the 8MB default PHP memory limit with an out of the box HEAD Drupal, nothing additional installed. It also shows the first pass results (getting rid of reading the file at once), and the second pass results, which is reflected in the patch.

Memory chart

gábor hojtsy’s picture

Title: Dramatically decrease locale performance » Dramatically decrease locale memory requirement
gábor hojtsy’s picture

Erm, the chart actually shows bytes on the vertical axis, not megabytes as the text depicts, but otherwise it is correct...

dries’s picture

Haven't tried it but the code looks good. I guess this just needs a decent amount of testing before it can be committed. There are only so many translations one can import to test the code so if one PHP4 user and one PHP5 user can import a number of translations we should be OK.

wulff’s picture

Works with PHP 5 (Apache 2.0.54, PHP 5.1.1 on windows).

dries’s picture

Some PHP4 testing anyone?

wulff’s picture

Works with PHP 4 (Apache 2.0.55, PHP 4.4.1 on linux).

Cvbge’s picture

While testing this patch I discovered that something is corrupting UTF strings (at least chinese translation). To problem occures with and without the patch.
I've checked that the files contain valid utf8 characters, so it's either PHP, drupal or my setup fault.

When importing chinese simplfied translation:

Warning: pg_query(): Query failed: ERROR: invalid byte sequence for encoding "UNICODE": 0xe4bf20 in /var/www/dt/d/includes/database.pgsql.inc on line 84

Warning: ERROR: invalid byte sequence for encoding "UNICODE": 0xe4bf20 query: INSERT INTO test_watchdog (uid, type, message, severity, link, location, referer, hostname, timestamp) VALUES (1, 'php', ' query: INSERT INTO test_locales_target (lid, locale, translation) VALUES (755, 'zh-hans', ' <h4>使用自定义PHP代码</h4> <p>如果你懂得用PHPb编写代码,Drupal给你嵌入任何代码的机会。当页面被查看时,PHP代码将被载入并运行。这将给你有趣的体验和动力,当然,如果你编码技术不过关,也会带来危险。如果你不熟悉PHP,SQL和站点引擎,避免实验PHP,因为你可能会损坏你的数据库或使你的网站不安全甚至无法使用。如果你不像你的内容非常独特,最好直接使用HTML。</p> <p>记住每个PHP标签内的代码都必须是合法的PHP代码——包括正确的终止符:分号。我们强烈 in /var/www/dt/d/includes/database.pgsql.inc on line 101
Cvbge’s picture

I've tested on a different machine, also got errors (with chinese traditional; didn't get errors with chinese simplfied this time. Weird)

postgresql 7.4.7-6sarge1
php4 4.3.10-16

I'll probably move this to different issue...

Cvbge’s picture

Just a side-note: if this is PHP/drupal fault, it probably happens also for mysql, it's just mysql silently ignoring (or removing, or something) the illegal characters.
I suppose testing with mysql 5 in strict mode with latest drupal cvs could help checking it.

gábor hojtsy’s picture

Piotr, I would suggest first checking the translation file outside of Drupal, whether it contains and unicode-invalid char. I bet it does.

killes@www.drop.org’s picture

Could this be due to the fact that translated strings are stored as blob in mysql and as text in pgsql?

gábor hojtsy’s picture

Status: Needs review » Reviewed & tested by the community

To stay ontopic, and to be sure that this patch fits, I tested again the import with the patch applied and without the patch applied. I have done a mysqldump ... localet locales_source locales_target after each run, and diffed the two files. No difference. Used the upcoming Drupal 4.7.0 Hungarian translation as a test case. PHP Version 4.3.10-10ubuntu4.3

Cvbge’s picture

Status: Reviewed & tested by the community » Needs review

@Goba: I did, I've used simple perl script to convert the file into a one-line file and then used COPY FROM file (http://www.postgresql.org/docs/current/static/sql-copy.html) to insert the data into a table in postgresql. I've also used iconv -f UTF-8 -t UTF-8 -c zh-hans.po > zh-hans.po.iconv that should drop invalid characters from input. I found no errors. Thus I assume the file is correct.
If you know any other way to check the file (a web-based validator would be the best) I'll check it using that too.

The script:

open(FH, '>one_line');
while (<>) { chop; print FH $_; }
print FH "\n\\.\n";
close(FH);

@killes: hmm right, didn't know that database.mysql uses blobs for that. That makes it work for any data...
I vaguely remember bug reports about locale tables and IIRC column type was changed... It was before Steven's UTF conversion patch. Maybe we should turn tham back into text and fix source of the problem instead...
I'll try to find the bug reports later...

Cvbge’s picture

bug reports about locale tables and IIRC column type was changed...

Hmm it seems it allways was blob... I wasn't able to find bug reports too... Maybe I was mistaken :/

Cvbge’s picture

Status: Needs review » Reviewed & tested by the community

I've changed the status by mistake.

dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to HEAD. Thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)