Problem/Motivation
Schema is missing generic BINARY and VARBINARY support (especially mysql); this prevents leveraging those DB-level facilities for cases where efficiency (query lookup time, storage usage) is highly desired.
Hex hashes are widely used, for many kinds of situations; they are just string representations of bytes; storing a hash as a VARCHAR (even if using the binary option) does not provide the same level of speed or reduced storage usage as just leveraging DB-level binary data structures. For a MySQL-specific analysis, see this blog post.
Actual measurements by @mfb:
... storing e.g. a SHA-256 hash in a UTF-8 CHAR(64) column uses six times more bytes than a BINARY(32) column.
... storing a hash in an ASCII VARCHAR(64) column is a big improvement, but still uses more than twice as much space as a BINARY(32) column
Here are some real-world numbers, for 4000 rows of SHA-256 hashes as primary key plus an additional index*.
CREATE TABLE `binhash` ( `hash` binary(32) NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4binary(32) size on disk: 640K
CREATE TABLE `aschash` ( `hash` varchar(64) CHARACTER SET ascii NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4varchar(64) ascii size on disk: 9.0M
Steps to reproduce
N/A, this is for new functionality.
Proposed resolution
- Add schema support for BINARY and VARBINARY columns.
- Augment database abstraction layer so that binary data can be easily used for all types of queries, for all supported DB engines.
Remaining tasks
Add schema support(see #50, #52).- Finish test coverage for schema support.
- TBD: add support for proper conversion to/from binary for all types of queries. See #22
User interface changes
None.
API changes
hook_schema: support for type => binary/varbinary.- Proper helpers (escaping, handling of default value, etc).
Data model changes
None
Release notes snippet
TBD
| Comment | File | Size | Author |
|---|---|---|---|
| #67 | 710940-db-binary-type-67.patch | 5.06 KB | jedihe |
Comments
Comment #1
Crell commentedWe're not making any changes to the schema support in Drupal 7. We can come back to this in Drupal 8.
Comment #2
mfbSubscribe. Storing e.g. a SHA-256 hash in a UTF-8 CHAR(64) column uses six times more bytes than a BINARY(32) column.
Comment #3
ebeyrent commentedNow that D8 is in alpha 3 release, was this addressed at all?
Comment #4
danblack commentedComment #5
mfbThis looks great, thanks!
Comment #6
danblack commentedadded test cases.
Also fixed sloppy quoting of DEFAULT strings using the pdo->quote.
Comment #7
sun...and that goes into a separate issue + patch. :-)
No need to use format_string() here, because we know what $table_name contains.
"Table $table_name created."
The patch actually adjusts Postgres + SQLite accordingly, so I'm adjusting the issue title.
Comment #8
danblack commentedquoting fixed in patch in issue 2232425 which is now a requirement for this to work
format_string corrected though was copying from same code base.
Comment #10
danblack commentedAlso discover a PDO exception with this in postgres while testing the #2224847: Automatically shorten cid's in Cache\DatabaseBackend and PhpBackend
What postgres need to escape bindata in some way. Looks like a job for core/lib/Drupal/Core/Database/Query/Condition.php
Comment #11
danblack commentedPostgres needs http://www.php.net/manual/en/function.pg-escape-bytea.php
No obvious place to insert this. https://github.com/doctrine/dbal/pull/452 doesn't seem to include it.
Comment #12
danblack commentedDeferred postgresql support in Update/Select arguments to #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments - seems a little tricky to tackle on its own.
Test cases updated. Postgresql seems to handle schema bits associated with default value using standard quote. Still waiting on #2232425: Database Schema field/column default value is not properly quoted via PDO::quote() to be committed.
Dependency of #2222635: Waiting for table metadata lock on cache_field table
Comment #13
Crell commentedWe need to have someone verify that this works on Postgres and SQLite before we commit. I'm not opposed, though.
Comment #15
danblack commentedI have verified passes the Schema test suite on postgres and sqlite.
I've even DatabaseBackendUnitTest with patch [2222635-14] (with the following change) on postgres and sqlite since postgres doesn't like escaping binary in SELECT statements.
Comment #16
danblack commented12: binary.patch queued for re-testing.
Comment #17
danblack commented8: binary.patch queued for re-testing.
Comment #18
danblack commentedThanks to the commit in #2232425: Database Schema field/column default value is not properly quoted via PDO::quote() this now passes automated tested and is ready for review/committing as is a prerequisite for hash based db caches (#2222635: Waiting for table metadata lock on cache_field table)
Comment #19
pwolanin commented12: binary.patch queued for re-testing.
Comment #20
pwolanin commentedSo, I'm a little confused about the posgres docs for bytea: http://www.postgresql.org/docs/9.1/static/datatype-binary.html
It seems in #12 that's deferred? however, we could convert the binary to hex instead of using the standard escape?
Comment #21
danblack commentedMaybe. However I'm still stuck with then how do we override the standard escape in the db api layer and somehow have an awareness of which field is the bytea to contain escaping?
So if cidhash is to be escaped how is the following expression written to use hex.
Directly inserting it into the query string is one unelegant way:
Comment #22
sunLooks like we need a new
escapeBinary()method for all db drivers - essentially following the existingescapeLike()model?Some drivers might be able to piggy-back onto
quote()in case PDO implements native binary quoting support for them.Comment #23
pwolanin commented@sun - so how would that work? postgres would make it hex and mysql would be a no-op?
Comment #24
danblack commented@sun, @pwolanin - can we take the issue of postgresql compatibility of bytea to #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments as this patch doesn't introduce the incompatibility of bytea and postgres select, it exists currently as postgres already maps blobs to bytea.
As such this is adding a feature and no regressions even though currently you'll need to be careful how you use it if binary types are used in selects with non-ascii values as parameters.
Comment #25
pwolanin commentedHmm, well, in the cache table issue we can make binary work for mysql only, so I don't think we should show the schema API as fully supporting it unless we resolve this (small?) postgres issue.
Of course, getting the testbots to run postgres too wold be a big help.
Comment #26
danblack commentedIt may be small but its really ugly to fix - progress is been made - look in the appropriate bug report.
Comment #27
berdirNote that we support the binary => true/false flag and now also use that for the cid column in cache tables by default. If that is set, then the BINARY flag is used for MySQL.
Comment #28
mfb@Berdir these are two unrelated things. The BINARY type is for storing binary data; the BINARY attribute is for storing text with a defined character set but a binary case-sensitive collation. The BINARY attribute automatically chooses an appropriate collation for the column's character set e.g. utf8_bin.
Comment #29
gangaloo commentedI'm using Drupal 7, no plans to go to 8 yet.
I'm writing up a module with a .install file and using hook_schema() in the .install file.
The install file includes the function for a hook_schema().
Working with encryption and I need to store as type BINARY or VARBINARY.
Looking at the Schema API, BINARY nor VARBINARY is listed.
What I've tried:
Band-aiding a solution to use base64_encode() or bin2hex(), will cost storage space and performance (I believe).
If i apply this patch (Patch #12), would it enable me to store BINARY / VARBINARY for hook_schema()?
thanks in advance,
Comment #30
Crell commentedgangaloo: This issue is about Drupal 8. Please open a new issue for Drupal 7, as the patch won't apply anyway due to other changes in the code base.
Comment #31
bzrudi71 commentedComment #32
catchFor cache tables we're now using varchar_ascii for the cache IDs.
Is there any other use case for this?
Given #1031122: postgres changeField() is unable to convert to bytea column type correctly I'd be very wary about adding even more bytea dependencies to core.
Comment #33
danblack commentedSeems to be enough desire over the years to add this. Doctrine dbal added it years ago. As referenced in the other issues there are still some limitations around binary/varbinary types and these issues have a body of knowledge around them and aren't impossible to fix and nor are they necessary for a variety of uses.
In the mean time however consider that there is a range of functionality provided by this patch that isn't available currently.
Comment #34
catchThere's desire expressed in this issue but the issue doesn't mention varchar_ascii which can store hashes very efficiently, hence me asking what people want to store in it now.
Once we add a feature we can't control what people use it for, so we should try to fix known issues first.
Comment #35
mfbre: #2 storing a hash in an ASCII VARCHAR(64) column is a big improvement, but still uses more than twice as much space as a BINARY(32) column
Comment #36
pwolanin commentedmoving to an appropriate version for possible features
Comment #38
twistor commentedHow is varchar_ascii more efficient than utf8? I thought the only difference was allowed index size. The size on disk is the same.
The efficiency with varbinary comes from not having to encode binary data (hashes, encryption, files) into text via hex or base64.
Comment #39
mfbYou are correct, using ascii instead of utf8mb4 doesn't help size on disk at all (I just checked to verify this :) So we have some improvement here only for index size.
Here are some real-world numbers, for 4000 rows of SHA-256 hashes as primary key plus an additional index*.
CREATE TABLE `binhash` ( `hash` binary(32) NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4binary(32) size on disk: 640K
CREATE TABLE `aschash` ( `hash` varchar(64) CHARACTER SET ascii NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4varchar(64) ascii size on disk: 9.0M
* per MySQL docs:
Comment #47
jedihe commentedTrying a re-roll for 9.1.x. Let's see what the testbot says...
Re-roll was manual, so no interdiff. Summary of important differences:
- fieldSetDefault() changes: dropped due to fieldSetDefault() being deprecated in favor of changeField() (I'm assuming changeField() is already updated via createFieldSql() changes). See change record for fieldSetDefault() deprecation.
- ::_escapeDefaultValue(): already implemented in 9.1.x. (Database\Schema.php).
- Test was ported to new SchemaTest.php, I tried following the pattern in other tests for some calls. Also, commented out tests for now-non-existant fieldSetDefault() et al, with a TODO for using changeField() instead.
Comment #48
jedihe commentedSorry testbot!, I got you yelling all the way through tests due to a silly mistake. It should be fixed now.
quasi-interdiff is just:
Comment #50
jedihe commentedTrying to fix the deprecation...
Comment #51
jedihe commentedNow that we have passes for the three DBMSes, let's try to get some feedback; I'm wondering if testing changeField() is really necessary (see TODO). If not, we can just remove the TODO and continue review + refinements.
Comment #52
jedihe commentedFor anybody interested, #50 applies on 8.8.x. With it, you can define binary fields in hook_schema():
For MySQL queries, HEX()/UNHEX() + db->select() and db->query() (for INSERT) should suffice:
Comment #53
jedihe commentedDid some more research on this today, and I'm now wondering if \PDO::PARAM_LOB can help to have a single way of properly escaping binary data; PDO-style sample from https://stackoverflow.com/a/59360328:
And we should be able to use \PDO::PARAM_LOB via \Drupal\Core\Database\Connection::quote().
Comment #54
daffie commentedComment #55
daffie commentedComment #56
jedihe commentedUpdating IS.
Comment #57
jedihe commentedComment #58
jedihe commentedGiven #50 provides almost complete support for hook_schema() functionality, I suggest we split adding support for proper conversion to/from binary for all types of queries into a new issue. #52 shows how to leverage existing facilities to handle hex hashes. I haven't checked other uses of binary data, though.
Comment #60
daffie commentedI think that it is better to do more in this issue. Lets start using this functionality in at least one place in core. To me that is the best way to see that it works as it should.
Maybe we should first do #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments, before doing this issue.
Comment #64
jedihe commentedJust to report we have been using #50 in a site; the patch has been working fine through various changes in the site codebase/infrastructure:
The site handles ~900k unique visits per month. During this time, the site generated ~1M records in the table having the binary field.
Comment #66
jedihe commentedLet's try with a quick re-roll.
Comment #67
jedihe commentedFixed cspell errors by adding the words to the dictionary.
Comment #68
daffie commented@jedihe: Thank you for working on this. Let me start by saying that I am not against this issue, only that I am not convinced that the benefits of using binary and varbinary fields outway the effort to make the change and the problems we will get in making the change. The only benefit that I see is that we save a bit of harddisk space. We do not live in 1970 any more and harddisk space is cheap. If it makes Drupal faster in a significant way, than this would something we should do.
Comment #69
mfbOn mysql, historically a binary/varbinary column gave you better performance than a blob column by avoiding slow on-disk temporary tables. Version 8.0.13 improved handling of blob columns: they can now be stored in an in-memory temporary table. I'd guess binary/varbinary might still be slightly faster than blob overall due to how the data is stored in table - could be a fun little benchmarking project..
Comment #70
jedihe commented@daffie: thanks for the detailed feedback. I understand that this patch touches on low-level functionality, so potential inclusion will only occur if a very compelling argument is made for it. For now, I'm happy with just having this patch in a working state for MySQL.
Comment #71
ghost of drupal pastConsider https://www.drupal.org/pift-ci-job/2591966 as a use case for (var)binary. I am not even sure how to do this without VARBINARY support, to be fair.
Edit: the module has been released with the appropriate drupalci.yml file to apply a minimal version of this patch.
Comment #72
mfbCan someone sum up what has been tricky about adding support for binary/varbinary, given that blob is already supported? Is there something broken about blob on postgres? (I happen to have a contrib module that stores some binary data in a cache table, so I was assuming this basically worked..)