[DBTNG + BLOB + hash]: Revamp locales_source.source with BLOB and MD5 hashing [#313999]

This patch add a new field locales_source.hash as MD5 hash of locales_source.source, also update locales_source.source and locales_target.translation as BLOB.

Existing implementation coming with following drawback and limitation:

We are having LIKE operation with MySQL TEXT, or PostgreSQL BLOB type directly (locales_source.source). Even we indexing with first 30 characters, for sure there may have performance degrade when working with huge content.
Indexing with first 30 characters only may not be accurate enough.
Both locales_source.source and locales_target.translation are now handling with hybrid schema definition (mysql_type as special MySQL handling). It is not necessary complicated based on cross database concern.
Storing UTF8 data with TEXT in case of PostgreSQL or other database may be risky.

Benefit of this patch:

LIKE operation will only operate with a MD5 hashing result (32 characters), which is well indexed. Overall performance should be boosted.
Compare with MD5 hashing result is more accurate.
Replace fields as BLOB can simplify maintenance of DB schema.
UTF8 + BLOB = perfect storage.

Technical detail:

Simpletest: patched version pass in both MySQL and PostgreSQL (Locate and Translation).
Update: MySQL pass; PostgreSQL's update.php fail in CVS HEAD so not able to test.
Syntax: Most change are revamp with TNG DB syntax, e.g. db_insert(), db_update(), fetch() and so on.

P.S. Some logic is cloning from aggregator_feed.hash :)

Comment	File	Size	Author
#17	blob-locales_source_hash-1227760668.patch	10.54 KB	hswong3i
#16	dbtng-locales_source_hash-1227555790.patch	10.52 KB	hswong3i
#14	dbtng-locales_source_hash-1224005025.patch	10.49 KB	hswong3i
#10	locales_source_hash-1223487464.patch	10.39 KB	hswong3i
#2	locales_source_hash-1222605790.patch	10.68 KB	hswong3i
#2	313999_benchmarking-0.1.zip	8.49 KB	hswong3i
	locales_source_hash-1222454789.patch	10.68 KB	hswong3i

Comments

Comment #1

gábor hojtsy

he/him

Hungarian

Hungary

commented 27 September 2008 at 07:45

- The code is full of inserts and updates, where you md5(). It might be better to abstract these out.
- Although matching on md5 instead of the string sounds more performance, a performance benchmark would be mandatory for this to be accepted.

Comment #2

hswong3i commented 28 September 2008 at 12:59

Status	File	Size
new	313999_benchmarking-0.1.zip	8.49 KB
new	locales_source_hash-1222605790.patch	10.68 KB

Performance is boosted base on benchmarking result, detail:

ab result (best record among 5 tests):
-------------------------
CVS HEAD + MySQL: 346 +/- 10.8ms, max 368 +/- 87.1ms
with patch + MySQL: 340 +/- 11.7ms, max 349 +/- 19.6ms

Testbed setup:
-------------------------
Host: Debian sid (KVM virtual client)
CPU: 2.2GHz AMD single core
Ram: 2GB
PHP: 5.2.6
MySQL: 5.0

Drupal setup:
-------------------------
users: 2000
categories: 15
terms: 250
nodes: 5000, with term and path alias
comments: 10000
{locales_source}: 2,959
{locales_target}: 19,668 (after import 8 different translations from D6)

Locale imported:
-------------------------
de-6.x-1.0-rc1.tar.gz
el-6.x-1.0-beta1.tar.gz
es-6.x-1.0.tar.gz
ja-6.x-1.2.tar.gz
pt-br-6.x-1.0-rc5.tar.gz
pt-pt-6.x-1.x-dev.tar.gz
zh-hans-6.x-1.0.tar.gz
zh-hant-6.x-1.1.tar.gz

Drupal setup procedure:
-------------------------
1. install with CVS HEAD, normal profile.
2. activate locale and install those 8 translations. keep UI as english.
3. activate devel and generate dummy data.
4. setup patched version, point to CVS HEAD DB
5. upgrade schema as new version. so they share same schema and record.
6. change UI as traditional chinese before benchmarking.
7. run benchmarking.

I keep the change as minimal as possible, which clone most legacy logic but only revamp with TNG DB syntax :D

Comment #3

damien tournoud commented 28 September 2008 at 14:31

Status:

Needs review

» Closed (won't fix)

@hswong3i: thanks for the benchmark. Based on your figures, I guess we can conclude that this was not a good idea. The partial index on the first 30 characters seems to perform very well.

Comment #4

hswong3i commented 2 October 2008 at 16:44

Assigned:	Unassigned	» hswong3i
Status:	Closed (won't fix)	» Needs review

@Damien: Thanks for your comment, too. Anyway, this seems to be over simplify the needs of the patch.

It is hard for me to believe performance as the one and only one requirement of this patch. For sure that we won't accept changes if it degrade performance without meaningful enhancement. As a counter prove by benchmarking, this patch should pass.

On the other hand, besides point 1 (performance boost) we still have 3 more benefits:

# Compare with MD5 hashing result is more accurate.
# Replace fields as BLOB can simplify maintenance of DB schema.
# UTF8 + BLOB = perfect storage.

Maybe we can have some more reconsideration?

Comment #5

damien tournoud commented 2 October 2008 at 16:54

I'm not sure I really understand.

# Compare with MD5 hashing result is more accurate.

Huh? Why?

# Replace fields as BLOB can simplify maintenance of DB schema.

That's the only valid point of that patch. You should have put it on top.

# UTF8 + BLOB = perfect storage.

I don't get this. Why should it be unsafe to store UTF8 in a TEXT column on PostgreSQL? TEXT columns are *designed* for this.

Comment #6

hswong3i commented 2 October 2008 at 19:02

Compare with MD5 hashing result is more accurate.

It is because MD5 will give a "complete fingerprint" of the input data (locales_source.source) for comparison, but still keep the fingerprint as 32bytes only. Only compare first 30 characters != compare the complete version of input data, which maybe not safe enough in some rare cases.

UTF8 + BLOB = perfect storage.

As a counter prove that's why MySQL now using BLOB rather than TEXT: because TEXT may not be able to handle ALL UTF8 characters, sometime it maybe filtered out. On the other hand, BLOB is designed as RAW-IN-RAW-OUT storage type: whatever data you feed into it, whatever data you can fetch from it. BLOB will NEVER (most likely) retouch your data, and so suitable for UTF8 storage.

Comment #7

damien tournoud commented 2 October 2008 at 20:28

I'm sorry but you are wrong on both cases:

It is because MD5 will give a "complete fingerprint" of the input data (locales_source.source) for comparison, but still keep the fingerprint as 32bytes only. Only compare first 30 characters != compare the complete version of input data, which maybe not safe enough in some rare cases.

You are confusing the index (on the first 30 characters) with the actual comparison. The database use the index to get potential matches and compare all of them with the actual data to find a true match.

As a counter prove that's why MySQL now using BLOB rather than TEXT: because TEXT may not be able to handle ALL UTF8 characters, sometime it maybe filtered out.

Our MySQL implementation uses BLOB instead of TEXT for one reason (and one reason only): because of text collation. MySQL has two peculiarities unknown from most other DBMS:

- it has a column-by-column collation setting
- it defaults to case-insensitive comparison by default

Here we need a case-sensitive comparison, but because our Schema API doesn't manage collations it was easier to change the TEXT column into a BLOB.

*Any* valid UTF8 data can be stored in a TEXT column, in both MySQL and PostgreSQL.

Comment #8

hswong3i commented 3 October 2008 at 02:14

Please correct me:

The database use the index to get potential matches and compare all of them with the actual data to find a true match.

Is that means index will give a hand for the full-text-based comparison? If so for sure that using MD5 hash should be better because it is something already preprocessed (similar as cache?) but still able to keep the fingerprint of original data.

On the other hand, we are now doing "full-text search + complete 32bytes indexing + max 32bytes data", therefore it should be more SCALABLE than that of "full-text search + first 30 characters indexing + max GB-scale data" operation in theory?

Benchmarking result also show the relatively stable result (340ms ~ 349ms) when compare with original implementation (346ms ~ 368ms). Please refer to benchmarking record set in #2 for more information.

because of text collation... Here we need a case-sensitive comparison

Again MD5 show its simplicity. Because it is already preprocessed and keep its original fingerprint in case-sensitive format, we can simply forget any restriction of any DB engines and so maintenance is simplified :D

Comment #9

gábor hojtsy

he/him

Hungarian

Hungary

commented 3 October 2008 at 10:38

hswong3i: you are contradicting yourself. If the BLOB type is only used to enforce matching, then if we'd introduce md5 based matching, we'd be better to move back to TEXT, right?

Comment #10

hswong3i commented 8 October 2008 at 17:41

Priority:

Normal

» Critical

Status	File	Size
new	locales_source_hash-1223487464.patch	10.39 KB

@Gabor: There is no contradiction, because this is a split issue of http://drupal.org/node/147947

Update based on http://drupal.org/node/316095 founding. Revamp BLOB field with nullable.

Tested with MySQL and PostgreSQL. Stand alone simpletest, pass; simpletest when combine with http://drupal.org/node/299088, also pass.

Comment #11

gábor hojtsy

he/him

Hungarian

Hungary

commented 8 October 2008 at 17:55

hswong3i: The contradiction in your saying is that you argue we should use BLOB, because it allows for better UTF8 storage, while we only use BLOB because it allows for better case sensitive comparison. Since you say we should abandon that field for comparisons and use a hash instead, there is no need whatsoever to use BLOB, and we can migrate to TEXT. Dries also suggested in http://drupal.org/node/147947 that we should make a distinction of textual and binary data, and keep textual data as text.

Comment #12

hswong3i commented 8 October 2008 at 20:11

@gabor: Be honest that http://drupal.org/node/147947 is my initial motivation of this issue. Sometime we don't even have choice between TEXT or BLOB... Some database engine have too many limitation for TEXT, especially storage size limitation, e.g. Oracle and DB2. When working with huge storage requirement, BLOB as storage only + assistance hash column for comparison should be the most balance and efficiency implementation.

BTW, whenever mention "Oracle" people may feel this is a bias implementation and so trigger negative feeling and feedback. Therefore I try to approach the goal with performance, compatibility and stability consideration. Please forgive my contradiction :S

Comment #13

meba commented 11 October 2008 at 14:04

subscribing

Comment #14

hswong3i commented 14 October 2008 at 17:27

Title:

Revamp locales_source.source with hashing: performance, compatibility and stability

» [DBTNG + BLOB + hash]: Revamp locales_source.source with BLOB and MD5 hashing

Status	File	Size
new	dbtng-locales_source_hash-1224005025.patch	10.49 KB

Since #316095: Raise PHP requirement to 5.2.12 for PostgreSQL only already figure out the solution for PostgreSQL + BLOB + NULL + INSERT/UPDATE bug, this patch is now safe for using both null or nullable BLOB field.

Patch reroll via CVS HEAD. Only standardize field type as BLOB. Update coding standards for DBTNG.

P.S. Actually, I would like to add default value for "variables" as it is 'not null' => TRUE by default. But since MySQL will buggy with #300219: [DBTNG]: MySQL should remove TEXT/BLOB default value so I would like to wait and handle this with another issue.

Comment #15

damien tournoud commented 14 October 2008 at 18:37

Priority:	Critical	» Normal
Status:	Needs review	» Closed (won't fix)

@hswong3i: you haven't prove any benefit of that patch, so I'm won't fixing it, again. Please don't open it again unless you have new arguments for that approach.

Comment #16

hswong3i commented 24 November 2008 at 19:45

Status	File	Size
new	dbtng-locales_source_hash-1227555790.patch	10.52 KB

Comment #17

hswong3i commented 27 November 2008 at 05:28

Status	File	Size
new	blob-locales_source_hash-1227760668.patch	10.54 KB

[DBTNG + BLOB + hash]: Revamp locales_source.source with BLOB and MD5 hashing

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

News items

Our community

Documentation

Drupal code base

Governance of community