Add hashkey to strings and translations for quicker comparison [#622500]

When we import po files, it compare each strings in database. Strings store in database as TEXT, it cannot add indexs. so I added a hashkey field to database.

My test:
importing a drupal.po

Before:
65s

After:
6s

(l10n_server status: 5xxxx strings to translate)

here what I do:

1. add HashKey field [type: binary(32)] to l10n_community_string table [md5(value)] , add index to this field (important)
2. change following code:

// l10n_server\l10n_community\import.inc
    if ($sid = db_result(db_query("SELECT sid FROM {l10n_community_string} WHERE BINARY value = '%s' AND context = '%s'", $value['msgid'], $value['msgctxt']))) {

//  l10n_server\l10n_community\import.inc
  $value['msgid'] = md5($value['msgid']);
  if ($sid = db_result(db_query("SELECT sid FROM {l10n_community_string} WHERE HashKey = '%s' AND context = '%s'", $value['msgid'], $value['msgctxt']))) {

it can apply to l10n_server sub-modules / locales modules if no side effect.

Need your comment.

Thanks.

(** Is there anyway I can get all l.d.o server Strings ? I need some large data to do test and special usage)

Comment	File	Size	Author
#5	dramatic-string-lookup-speedup.patch	6.92 KB	gábor hojtsy
#4	dramatic-string-lookup-speedup.patch	6.93 KB	gábor hojtsy
#2	dramatic-string-lookup-speedup.patch	6.91 KB	gábor hojtsy

Comments

Comment #1

gábor hojtsy

he/him

Hungarian

Hungary

commented 11 January 2010 at 13:50

Title:	simple trick, big increase in performance	» Add hashkey to strings and translations for quicker comparison
Category:	feature	» task

I believe this might be a very good idea and are going to look into testing it. It could be useful both for source string matching and translation lookup, so both _translation and _string tables could get a hashkey to replace string comparison.

Comment #2

gábor hojtsy

he/him

Hungarian

Hungary

commented 11 January 2010 at 16:42

Status:

Active

» Needs review

Status	File	Size
new	dramatic-string-lookup-speedup.patch	6.91 KB

This is a total jaw-dropping live-saver. Of course it is easy to improve performance if we did so much of an unscalable dumb thing so far... Ah :)

So I made the above suggestion into a patch with update path and use of this code in l10n_remote (for string submissions from l10n_client), in exractor.inc (for newly discovered strings in new releases) and in import.inc (for .po file imports).

I've run our test suite with this which still works nicely. I've also tested two of the critical paths for performance improvements, and the results are as said jaw dropping.

Parsing new release: Used a dev release of Drupal 7. With drupal.org database dump, the parsing of that code took almost 7 minutes(!) on my laptop. With the attached patch in place, it took around 15-20 seconds(!).

Importing complete core .po: Used an l.d.o all releases export of Drupal core in Hungarian. This is a 745KB .po file. Without the patch, it took 5-6 minutes(!) to parse and import (just to realize there were no strings to import, but only duplicates since I did this on a dump of l.d.o). With the patch applied, it took 10-15 seconds(!) for it to complete its work.

I believe this might give us a solution for both issues with people importing stuff and timing out as well as maybe new releases parsed but not completed (maybe due to DB load). The new queries are way faster (obviously, they do not do a full table scan lookup anymore, eh).

The only issue that I found here is that the l.d.o database (dump) seems to have some bad data, which makes the md5 non-unique. I'm not sure how this data got in at all, but we should fix it before we commit / deploy this patch.

The offending strings where:

mysql> select value, hashkey, count(*) c from l10n_community_string group by hashkey having c > 1;
+-------------+----------------------------------+---+
| value       | hashkey                          | c |
+-------------+----------------------------------+---+
| Administer  | c387a19913546b9e08f8dd891f282f69 | 2 | 
| !datetime   | fae4c295b21b23bf20ea4121d7ca01c2 | 4 | 
+-------------+----------------------------------+---+

mysql> select * from l10n_community_string where hashkey = 'fae4c295b21b23bf20ea4121d7ca01c2';
+-------+------------+---------+----------------------------------+
| sid   | value      | context | hashkey                          |
+-------+------------+---------+----------------------------------+
| 57570 | !datetime  |         | fae4c295b21b23bf20ea4121d7ca01c2 | 
| 57572 | !datetime  |         | fae4c295b21b23bf20ea4121d7ca01c2 | 
| 78234 | !datetime  |         | fae4c295b21b23bf20ea4121d7ca01c2 | 
| 78236 | !datetime  |         | fae4c295b21b23bf20ea4121d7ca01c2 | 
+-------+------------+---------+----------------------------------+

mysql> select * from l10n_community_string where hashkey = 'c387a19913546b9e08f8dd891f282f69';
+--------+-------------+---------+----------------------------------+
| sid    | value       | context | hashkey                          |
+--------+-------------+---------+----------------------------------+
|  31212 | Administer  |         | c387a19913546b9e08f8dd891f282f69 | 
| 110408 | Administer  |         | c387a19913546b9e08f8dd891f282f69 | 
+--------+-------------+---------+----------------------------------+

Also note that currently, the hashkey's will not be unique, but (hashkey, context) will be unique. We can either append context to the value before counting the md5, so we can put a unique index on hashkey or we consider this cool that we have (hashkey, context) combined as unique.

On the first table, none of those sids seems to have a corresponding l10n_community_line entry but many of them have translations to Catalan. Interesting... On the second, sid 110408 does not appear on any lines and does not have any translations. So looks like we can just get rid of these and be happy :)

Comment #3

gábor hojtsy

he/him

Hungarian

Hungary

commented 11 January 2010 at 17:00

Good thing l10n_server already has an orphan string counter/detector and it does identify these strings as orphans as well. There are a few more, seemingly non-English stuff crept into some module source code:

mysql> SELECT s.sid, s.value FROM l10n_community_string s LEFT JOIN l10n_community_line l ON s.sid = l.sid WHERE l.sid IS NULL; 
+--------+----------------------------------------------------------------------------------+
| sid    | value                                                                            |
+--------+----------------------------------------------------------------------------------+
|  25024 | Reservierung wurde gel                                                           | 
|  25032 | Reservierung l                                                                   | 
|  57570 | !datetime                                                                        | 
|  57572 | !datetime                                                                        | 
|  70136 | Term message installed.  Go to <a href="/admin/content/term_message">Administer  | 
|  78234 | !datetime                                                                        | 
|  78236 | !datetime                                                                        | 
| 110408 | Administer                                                                       | 
| 228180 | Invalid amount format. Use the format: ?#####.##                                 | 
+--------+----------------------------------------------------------------------------------+
9 rows in set (2.94 sec)

Hit the orphan cleanup routine on localize.drupal.org, so these are not an issue anymore, yay!

So let's consider the above patch. A negative note on a context-inclusive hashkey currently is that l10n_client will not tell us the context, so we currently kind of shoot into the air and pick the first match with the string even if it was a different context to how the client intended. Well, what we can do here is to stick to empty context for now, which would allow us to push the context to the hashkey, making it unique in itself.

Comment #4

gábor hojtsy

he/him

Hungarian

Hungary

commented 11 January 2010 at 17:03

Status	File	Size
new	dramatic-string-lookup-speedup.patch	6.93 KB

So adding empty context check to l10n_remote as said above, while we do not support remote passing of context yet. This completes the use of "hashkey = MD5('%s') AND context = '%s'" all around.

Comment #5

gábor hojtsy

he/him

Hungarian

Hungary

commented 11 January 2010 at 18:16

Status:

Needs review

» Reviewed & tested by the community

Status	File	Size
new	dramatic-string-lookup-speedup.patch	6.92 KB

Gerhard says we should focus on having a unique key instead, so folding in the context into the source data for MD5 sounds better. Indeed. Another round of speed testing confirmed that with this, we speed up with another factor of 2-3x on the l.d.o data set running on my computer.

I believe this is ready to go. Also run the full automated test suite which includes extensive testing of package parsing and .po imports, the two main areas affected by this patch. The testing suite run fine and was happy.

Will commit and deploy this soon later in the week.

Comment #6

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 January 2010 at 09:47

Status:

Reviewed & tested by the community

» Fixed

This is now committed and deployed to localize.drupal.org. Unfortunately the .po imports are not happening instantly still, to me it looks like it is more of an issue of the file not getting to the import code too fast (I was waiting for minutes with my browser spinning and there were no suggestions imported yet, so it was not stuck in the middle of imports, rather before). However, we can consider this fixed and look for other issues :)

Comment #7

26 January 2010 at 09:50

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Add hashkey to strings and translations for quicker comparison

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

News items

Our community

Documentation

Drupal code base

Governance of community