Closed (won't fix)
Project:
Drupal core
Version:
7.x-dev
Component:
language system
Priority:
Normal
Category:
Feature request
Assigned:
Reporter:
Created:
26 Sep 2008 at 19:13 UTC
Updated:
27 Nov 2008 at 05:28 UTC
Jump to comment: Most recent file
Comments
Comment #1
gábor hojtsy- The code is full of inserts and updates, where you md5(). It might be better to abstract these out.
- Although matching on md5 instead of the string sounds more performance, a performance benchmark would be mandatory for this to be accepted.
Comment #2
hswong3i commentedPerformance is boosted base on benchmarking result, detail:
I keep the change as minimal as possible, which clone most legacy logic but only revamp with TNG DB syntax :D
Comment #3
damien tournoud commented@hswong3i: thanks for the benchmark. Based on your figures, I guess we can conclude that this was not a good idea. The partial index on the first 30 characters seems to perform very well.
Comment #4
hswong3i commented@Damien: Thanks for your comment, too. Anyway, this seems to be over simplify the needs of the patch.
It is hard for me to believe performance as the one and only one requirement of this patch. For sure that we won't accept changes if it degrade performance without meaningful enhancement. As a counter prove by benchmarking, this patch should pass.
On the other hand, besides point 1 (performance boost) we still have 3 more benefits:
Maybe we can have some more reconsideration?
Comment #5
damien tournoud commentedI'm not sure I really understand.
Huh? Why?
That's the only valid point of that patch. You should have put it on top.
I don't get this. Why should it be unsafe to store UTF8 in a TEXT column on PostgreSQL? TEXT columns are *designed* for this.
Comment #6
hswong3i commentedIt is because MD5 will give a "complete fingerprint" of the input data (locales_source.source) for comparison, but still keep the fingerprint as 32bytes only. Only compare first 30 characters != compare the complete version of input data, which maybe not safe enough in some rare cases.
As a counter prove that's why MySQL now using BLOB rather than TEXT: because TEXT may not be able to handle ALL UTF8 characters, sometime it maybe filtered out. On the other hand, BLOB is designed as RAW-IN-RAW-OUT storage type: whatever data you feed into it, whatever data you can fetch from it. BLOB will NEVER (most likely) retouch your data, and so suitable for UTF8 storage.
Comment #7
damien tournoud commentedI'm sorry but you are wrong on both cases:
You are confusing the index (on the first 30 characters) with the actual comparison. The database use the index to get potential matches and compare all of them with the actual data to find a true match.
Our MySQL implementation uses BLOB instead of TEXT for one reason (and one reason only): because of text collation. MySQL has two peculiarities unknown from most other DBMS:
- it has a column-by-column collation setting
- it defaults to case-insensitive comparison by default
Here we need a case-sensitive comparison, but because our Schema API doesn't manage collations it was easier to change the TEXT column into a BLOB.
*Any* valid UTF8 data can be stored in a TEXT column, in both MySQL and PostgreSQL.
Comment #8
hswong3i commentedPlease correct me:
Is that means index will give a hand for the full-text-based comparison? If so for sure that using MD5 hash should be better because it is something already preprocessed (similar as cache?) but still able to keep the fingerprint of original data.
On the other hand, we are now doing "full-text search + complete 32bytes indexing + max 32bytes data", therefore it should be more SCALABLE than that of "full-text search + first 30 characters indexing + max GB-scale data" operation in theory?
Benchmarking result also show the relatively stable result (340ms ~ 349ms) when compare with original implementation (346ms ~ 368ms). Please refer to benchmarking record set in #2 for more information.
Again MD5 show its simplicity. Because it is already preprocessed and keep its original fingerprint in case-sensitive format, we can simply forget any restriction of any DB engines and so maintenance is simplified :D
Comment #9
gábor hojtsyhswong3i: you are contradicting yourself. If the BLOB type is only used to enforce matching, then if we'd introduce md5 based matching, we'd be better to move back to TEXT, right?
Comment #10
hswong3i commented@Gabor: There is no contradiction, because this is a split issue of http://drupal.org/node/147947
Update based on http://drupal.org/node/316095 founding. Revamp BLOB field with nullable.
Tested with MySQL and PostgreSQL. Stand alone simpletest, pass; simpletest when combine with http://drupal.org/node/299088, also pass.
Comment #11
gábor hojtsyhswong3i: The contradiction in your saying is that you argue we should use BLOB, because it allows for better UTF8 storage, while we only use BLOB because it allows for better case sensitive comparison. Since you say we should abandon that field for comparisons and use a hash instead, there is no need whatsoever to use BLOB, and we can migrate to TEXT. Dries also suggested in http://drupal.org/node/147947 that we should make a distinction of textual and binary data, and keep textual data as text.
Comment #12
hswong3i commented@gabor: Be honest that http://drupal.org/node/147947 is my initial motivation of this issue. Sometime we don't even have choice between TEXT or BLOB... Some database engine have too many limitation for TEXT, especially storage size limitation, e.g. Oracle and DB2. When working with huge storage requirement, BLOB as storage only + assistance hash column for comparison should be the most balance and efficiency implementation.
BTW, whenever mention "Oracle" people may feel this is a bias implementation and so trigger negative feeling and feedback. Therefore I try to approach the goal with performance, compatibility and stability consideration. Please forgive my contradiction :S
Comment #13
meba commentedsubscribing
Comment #14
hswong3i commentedSince #316095: Raise PHP requirement to 5.2.12 for PostgreSQL only already figure out the solution for PostgreSQL + BLOB + NULL + INSERT/UPDATE bug, this patch is now safe for using both null or nullable BLOB field.
Patch reroll via CVS HEAD. Only standardize field type as BLOB. Update coding standards for DBTNG.
P.S. Actually, I would like to add default value for "variables" as it is
'not null' => TRUEby default. But since MySQL will buggy with #300219: [DBTNG]: MySQL should remove TEXT/BLOB default value so I would like to wait and handle this with another issue.Comment #15
damien tournoud commented@hswong3i: you haven't prove any benefit of that patch, so I'm won't fixing it, again. Please don't open it again unless you have new arguments for that approach.
Comment #16
hswong3i commentedComment #17
hswong3i commented