Problem/Motivation
We don't need to use UTF-8 for all our indexed machine names/UUIDs when they might as well be simple ASCII, especially not if we are looking to implement utf8mb4 in the future, which would further increase index size and could affect query buffer performance. (VAR)CHAR utf8mb4 fields also have a 191 character limitation on their indexes, while ASCII would allow for indexing all 255 characters.
Furthermore, not specifying on the schema level which fields should be UTF-8 and which ones should be simple ASCII adds to our technical debt, as developers may use UTF-8 characters in places where we haven't tested that they will actually work. Using non-ASCII characters in a module name disallows us from using hooks for instance, as PHP only accepts ASCII characters in function names.
Proposed resolution
Originally proposed by @sun & @Damien Tournoud:
#1314214-82: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols):
It's probably worth tidying up our schema (for example: machine name-type keys should probably use a ascii character set so as to reduce the size of the index)
we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).
What needs to happen is that we need to stop using Unicode columns (and indexes) where they are not necessary. This is a comprehensive change that cannot be workaround by simply bumping database version requirements.
Related issues
Remaining tasks
- Review patch
- File followup issue for ASCII support in other database engines than MySQL
User interface changes
N/A
API changes
- By cleaning up our schema definitions in this issue, we explicitly disallow non ASCII characters for certain machine names on the database level. If we didn't disallow this already, we now do, and stick to supporting UTF-8 for content only.
- Addition of a new
is_ascii
setting on the string formatter (in addition to the already existing "length" and "case_sensitive" settings). - Addition of a new
varchar_ascii
on the schema definition.
Beta phase evaluation
Issue category | Task because this is an API cleanup / performance issue |
---|---|
Issue priority | Normal, but affects performance, blocks #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) and reduces fragility |
Prioritized changes | The main goal of this issue is API clean-up and unblocking full UTF-8 support in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) |
Disruption | Not disruptive for core because we don't use non-ASCII characters in funny places anyway |
Comment | File | Size | Author |
---|---|---|---|
#122 | interdiff-114-122.txt | 1.9 KB | stefan.r |
#122 | 1923406-122.patch | 39.23 KB | stefan.r |
#114 | interdiff-112-114.txt | 627 bytes | stefan.r |
#114 | 1923406-114.patch | 39.14 KB | stefan.r |
#112 | 1923406-112.patch | 39.98 KB | stefan.r |
Comments
Comment #1
alexpottThe default regex on machine names limits the characters in machine names using the following regex
[^a-z0-9_]+
so only ASCII characters are used.Comment #2
amateescu CreditAttribution: amateescu commentedThe improvement @sun proposes is for the schema of the table column that stores machine names :)
Comment #3
stefan.r CreditAttribution: stefan.r commentedDoes this still have a chance of going in? If I understand correctly this is blocking #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) (emojis!).
Comment #4
stefan.r CreditAttribution: stefan.r commentedHere's a start, this would add an 'ascii' key. We'd want to use this on all UUIDs, machine names, non-numeric indexed columns and primary keys where possible, there is no need for all those to be encoded in
utf8
.Comment #6
stefan.r CreditAttribution: stefan.r commentedSeems MySQL has a problem if we don't declare a length for a varchar field in the schema.
Comment #7
Crell CreditAttribution: Crell at Palantir.net commentedI dislike adding a boolean flag on things. That violates Asimov's Law. (2 is the least likely number in the universe.)
Could this be instead done with a charset key that gets set to ascii (and defaults to something reasonable so 99% of cases don't need to specify it)?
Comment #8
stefan.r CreditAttribution: stefan.r commentedI sort of like the simplicity of only supporting 1 character set though (UTF-8). As an exception, ASCII is a subset of UTF-8 that we can use for identifiers, but then even latin1 for instance I believe is not. Having a "charset" key sort of gives the impression we support custom column level character sets, the question is do we want that... it may lead to issues down the line. Then there is also the question of collations.
The ascii flag is really no much different from the case sensitive flag given what we would be using it for (ie. distinguishing between identifier fields and content fields). Maybe the name is unfortunate and we need to rename it to something more sensible? Or I don't know if creating a new type (ie something other than "string") is still possible at this stage?
Comment #9
Crell CreditAttribution: Crell at Palantir.net commentedThe idea here is to use a smaller charset for fields that we know are only going to contain machine names ( A-Za-z0-9), correct? What about an "is-machine name" or "is trivial" or something like that? If we have to use a boolean value it should be on something that is conceptually boolean, whereas "Ascii" is not. We can then map that metdata to mean "oh yeah, use ascii encoding for this field, or something else if we find an even better option later on". (That latter part is a big part of why I dislike "ascci" as a boolean.)
I agree that custom charsets per-field is a can of worms even pandora doesn't want opened.
Comment #10
stefan.r CreditAttribution: stefan.r commentedCorrect, that was the idea... I agree something like that makes more sense than "ascii" for a flag name :)
I'll go with
is_trivial
for now, as it encompasses bundle, machine name, uuid, etc.Comment #11
amateescu CreditAttribution: amateescu commentedHow about
is_alphanumeric
?Comment #12
stefan.r CreditAttribution: stefan.r commentedThat's less confusing but not really truthful, we'd still allow underscores and possibly dots :)
Comment #13
Aki Tendo CreditAttribution: Aki Tendo commentedWouldn't this be better suited to assert? Or are end users expected to input schemas??
Comment #14
amateescu CreditAttribution: amateescu commented@stefan.r, I'd go with "friendlier" over "truthful but meaningless" any time :)
@Aki Tendo, Yep, schema is user input, even though it's less an less so in Drupal 8.
Comment #15
stefan.r CreditAttribution: stefan.r commentedComment #16
stefan.r CreditAttribution: stefan.r commentedThere may be more columns that could be alphanumeric but this should cover most of them.
Some may require further validation to make sure we don't try to store non-ASCII data in there, and some may need some further conversion. I think one field we tried to store raw a SHA256 hash for instance, that should ideally be base64'd so we can make it an "alphanumeric" field.
Comment #18
stefan.r CreditAttribution: stefan.r commentedComment #21
stefan.r CreditAttribution: stefan.r commentedComment #22
stefan.r CreditAttribution: stefan.r commentedFor easier reviewing I'll summarize the columns that have been turned into simple ASCII columns:
UUID fields
* All, through the UuidItem class
Language fields
* All, through the LanguageItem class
Content entity storage
* langcode and bundle field (The field instance bundle to which this row belongs), through SqlContentEntityStorageSchema.
Text long fields
* The format machine name, through the TextLongItem class
Text with summary fields
* The format machine name, through the TextWithSummaryItem class
cache_* bin tables
* cid (Unique cache ID)
* checksum (The tag invalidation checksum when this entry was saved.)
Cache table for tracking cache tag invalidations
* tag (Namespace-prefixed tag string)
Configuration table
* collection (Primary Key: Config object collection.)
* name (Primary Key: Config object name.)
Other key value tables
* key_value.collection
* key_value_expire.collection
* key_value.name
* key_value_expire.name
Menu tree storage
* menu_name (the menu name. All links with the same menu name (such as 'tools') are part of the same menu.)
* id (Unique machine name: the plugin ID.)
* parent (The plugin ID for the parent of this link.)
* route_name (The machine name of a defined Symfony Route this menu item represents.)
* provider (the name of the module that generated this link.)
Feeds
* hash (Calculated hash of the feed data, used for validating cache.)
Ban
* ip (IP address)
Comment
* entity_type (The entity_type of the entity to which this comment is a reply.)
* field_name (The field_name of the field that was used to add this comment.)
Dblog
* type (Type of log message, for example "user" or "page not found.)
* hostname (Hostname of the user who triggered the event.)
Entity reference
* target_id, through EntityReferenceItem class
File
* module (The name of the module that is using the file.)
* type (The name of the object type in which the file is used)
* id (The primary key of the object using the file)
* filemime (File MIME type)
Locale
* context (The context this string applies to)
* version (Version of Drupal where the string was last used (for locales optimization))
* language (Language code. References {language}.langcode)
* type (The location type (file, config, path, etc).)
* version (Version of Drupal where the location was found)
* locale_file.project (A unique short name to identify the project the file belongs to.)
* locale_file.langoode (Language code of this translation. References {language}.langcode.)
MenuLinkContent
* bundle (The content menu link bundle)
* menu_name (The menu name. All links with the same menu name (such as "tools") are part of the same menu.)
Node
* langcode (the {language}.langcode of this node.)
* node_grants.realm
Search
* langcode (The {languages}.langcode of the item variant.)
* type (type of item, e.g. node)
Shortcut
* set_name (The {shortcut_set}.set_name that will be displayed for this user.)
Simpletest
* message_group (the message group this message belongs to. For example: warning, browser, user.)
* class_name
System
* batch.token (A string token generated against the current user's session id and the batch id)
* flood.event (Name of event (e.g. contact).)
* flood.identifier(Identifier of the visitor, such as an IP address or hostname.)
* queue.name (The queue name.)
* router.name (Primary Key: Machine name of this route)
* semaphore.name (Primary Key: Unique name)
* semaphore.value (lock ID, is uniqid)
* sessions.sid (A session ID (hashed). The value is generated by Drupal's session handlers)
* sessions.hostname (The IP address that last used this session ID)
* url_alias.langcode (The language code this alias is for)
User
* users_data.name (The identifier of the data)
* users_data.module (The name of the module declaring the variable)
Comment #23
stefan.r CreditAttribution: stefan.r commentedComment #24
yannickooI'm currently reviewing #23. Going to make sure that no field is missing in the patch :)
Comment #25
yannickooPatch looks really good and I have also added
is_alphanumeric
property to long text'sformat
column (which stores the used text format).Comment #26
yannickooComment #31
Crell CreditAttribution: Crell at Palantir.net commentedThis is adding to the config schema, not DB schema. I don't think this is needed, is it? (DB schema and Config schema SHOULD be entirely independent.)
We should check for a True value specifically, not just mere presence.
The comment here is a little confusing. The value that setHash() returns must be base64 encoded? That seems over-reach for an interface definition. Rather, it should just state that the hash must contain only US ASCII characters. base64 encoding is only one of many ways of achieving that.
I didn't look at the specific fields that are being flagged in detail, but I'll accept yanikoo and stefan's work on that. The above issues should be easy enough to resolve so we can move forward here.
The next step would be to include some benchmarks and a beta analysis table to justify this addition in 8.0. I'm OK with it if it passes beta criteria.
Comment #32
stefan.r CreditAttribution: stefan.r commentedThanks @Crell.
1. Just to clarify, those are string field storage settings which can be either max_length, case_sensitive or is_alphanumeric and are translated to DB schema settings elsewhere:
We can use this setting in the UuidItem/LanguageItem/TextWithSummaryItem/TextLongItem field types and in some of the BaseFieldDefinitions in Feed/File/MenuLinkContent.
If we do the case_sensitive/binary setting in the config schema I guess we can do this one as well.. Maybe we should rename the config setting to something else, just for consistency with the other settings?
2. The binary / unsigned settings actually also do a
!empty()
check in the MySQL driver, are we sure we want to deviate from those? If needed we could break the previous behavior though and do stricter testing there as well, only accepting TRUE anymore...We were also planning to do a last manual review of every single field just to make sure there is no way we can somehow insert non-ASCII data in those. We'll also need to document the new hook_schema setting in database.api.php.
Comment #33
stefan.r CreditAttribution: stefan.r commentedis_trivial
for consistency with other field storage settings and to clarify it's different to the schema setting.is_alphanumeric
settingFor future reference, other than the following fields, all other indexed fields should now be set to ASCII:
- aggregator.title
- aggregator.url
- file_managed.uri
- block_content_field_data.info
- taxonomy_term_field_data.name
- locales.source
- menu_link_content_data.link__uri
- menu_tree.route_param_key
- node_field_data.title
- router.pattern_outline
- search_total.word
- shortcut_field_data.link__uri
- url_alias.source
- url_alias.alias
- users_field_data.name
- users_field_data.mail
Comment #34
stefan.r CreditAttribution: stefan.r commentedComment #36
Crell CreditAttribution: Crell at Palantir.net commentedRe the config schema: Yeesh. I didn't realize it was doing that. I defer to someone from the config system team on that front for what the best way to handle it is. It's irrelevant from a DB system perspective IMO. I would prefer to still call it is_numeric, though, as is_trivial is a rather meaningless term.
Re the type of truthy check, this is where PHP is really annoying. :-) The current code would read TRUE, 1, 5, "true", and "Game of Thrones" as "this is alphanumeric" and FALSE, NULL, '', and not-even-set as "not alphanumeric". I generally prefer tighter checks, but I suppose it's reasonable to allow 1 in addition to TRUE. *sigh* PHP... I could be convinced either way here.
One thing we should realize: Technically this means that machine names may no longer have non-latin characters. IE, a user in Tokyo would still have to use ASCII letters for a machine name of a View. Are we OK with the implications of that? (I am, but I defer to Gabor on the i18n UX implications of that.)
Comment #37
stefan.r CreditAttribution: stefan.r commentedI have reviewed all occurrences with @klausi and basically we should be fine to turn these fields into ASCII fields. He recommended going with
is_ascii
for both the setting and the schema setting.@Crell as you well noted, not all of it is validated or codified to be ascii, we don't protect users against doing dumb things in code such as using japanese machine names. It probably threw an error before, and if it didn't, now it does! We thought this shouldn't be supported anyway -- the UI, PHP and the OS may not be able to deal with non-ASCII machine names, we should probably stick to UTF-8 for content only. For instance, we can't use japanese characters for a module name because then we can't use hooks as PHP functions need to be ASCII :)
Log of the manual review:
https://docs.google.com/spreadsheets/d/1g42xm9ZWUJRoBcIXaw96QfBZaUqsalGh...
Comment #38
stefan.r CreditAttribution: stefan.r commentedUpdated issue summary, added beta evaluation. If this is green we can proceed to profiling this.
Comment #39
stefan.r CreditAttribution: stefan.r commentedComment #40
stefan.r CreditAttribution: stefan.r commentedAlso just FYI this should not affect existing installs. New Drupal installs will just have some columns be ASCII instead of UTF-8.
Comment #42
stefan.r CreditAttribution: stefan.r commentedComment #43
stefan.r CreditAttribution: stefan.r commentedOK, I am happy with how this looks now :)
Comment #44
stefan.r CreditAttribution: stefan.r commentedHow do we best benchmark this? Seems the testbot completes our test suite in 16:02 min which is about the time it usually takes. Do we have any scenarios that simulate a larger database?
Comment #45
yannickooTabs instead of whitespaces D:
Comment #46
stefan.r CreditAttribution: stefan.r commentedComment #47
stefan.r CreditAttribution: stefan.r commentedFrom some initial simple where/order by/like benchmarks on a table with a million rows of random content I can't really seem to get the character set to affect index size or performance (as long as the content is ASCII). Don't know about more complex operations yet but maybe any performance gains are just theoretical/negligible?
Comment #48
Damien Tournoud CreditAttribution: Damien Tournoud at Centarro commented@stefan.r was struggling to prove the performance improvements in this issue and asked me to step in.
Performance is not the point: the point is to reduce the size of the indexes so that it is practical to move to
utf8mb4
by default.Databases are really smart in how they deal with your data, so don't be surprised that even if our schema doesn't make sense (we store ASCII identifiers in a Unicode column), our indexes are still relatively efficient. As long as the size of the indexes fit in memory, it is not going to make a significant difference to the query performance anyway.
This is definitely a huge improvement anyway, +1 from me.
Comment #49
stefan.r CreditAttribution: stefan.r commentedAdded two more fields that didn't really need to be UTF-8.
Comment #50
yannickooI think we can set this issue to RTBC now because this is an API clean-up and does not affect existing installations. We have done an extensive review of that together with other guys and Damien, Crell and klausi are also fine with this :)
Another good point is that the patch would unblock #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols).
Comment #51
stefan.r CreditAttribution: stefan.r commentedComment #52
morgantocker CreditAttribution: morgantocker commentedMorgan from the MySQL team here.
For the example of machine names, BINARY is better than ascii since its not really text and collations do not apply.
I would clarify that this may possibly be a micro-optimization:
- utf8mb4 indexes and columns store as variable length with InnoDB (so there should be no direct data size saving).
- Any performance loss is going to be in query buffers that are not variable length (during sort, temporary table or sending results to the client). MySQL 5.7 makes more of these buffers variable length, so in the specific case there is a heavy perf regression (I am expecting this to be rare), maybe it will be reasonable to recommend an upgrade?
So maybe to clarify the above: I don't think this feature is required.
Comment #53
stefan.r CreditAttribution: stefan.r commented@morgantocker did you really mean 5.7? Drupal 8 raising the minimum version to 5.7 is pretty unlikely.. But we've proposed raising it to 5.5 in #2473301: Raise MySQL requirement to 5.5.3
We did actually discuss using BINARY with our database maintainer @Damien Tournoud, but as mentioned in #48 performance is not the main point here -- this is about working around the maximum index size requirement (191 characters for utf8mb4 unless we use non-standard settings), and cleaning up the API to implicitly say "don't use non-ASCII characters for these machine-name fields"
Comment #54
morgantocker CreditAttribution: morgantocker commented@Stefan.r: I did not mean raising the minimum version to 5.7. I had stated that those requiring extreme performance have an acceptable solution. But I apologize that we may have been talking about different things. The physical index size will not change with ASCII, but the meta data size will (maximum theoretical length), and this is your current problem.
Maybe it is easier to clarify with examples:
This issue is suggesting example in t3. I was suggesting t2 may be more elegant.
Comment #55
stefan.r CreditAttribution: stefan.r commentedThanks for clarifying, that would explain why I noticed no size difference in my own tests when changing the encoding.
Indeed either
BINARY
or a different encoding should work in terms of solving the index size limitation for non-utf8mb4 data, although withBINARY
a worry would be that we'd have to expand the scope of the current patch and deal with the encoding in Drupal itself, validating that what goes in and comes out of these fields is ASCII encoded data only. Just so as to not to have contrib/custom module developers try saving data in other encodings on those fields :)Comment #56
stefan.r CreditAttribution: stefan.r commentedComment #57
Crell CreditAttribution: Crell at Palantir.net commentedSounds like this is still an open question.
Thanks Morgan for jumping in with more information!
Comment #58
stefan.r CreditAttribution: stefan.r commentedWell in terms of making sure that data in ASCII fields is actually ASCII and allowing for
utf8mb4
encoding by default, I see 3 ways out here:a) Reducing all indexes to 191 characters max instead of 255, sticking with
utf8mb4
encoding everywhere, potentially hurting performance (although I guess smaller indexes could be better in some cases?). We'd also need to make sure on the database driver level that everything going in and out is actually ASCII encoded. This should not be a very expensive regex but it may further hurt performance. We'd also need to save somewhere which fields are ASCII or not as the DBTNG driver does not have access to the schema, which would mean further overhead and complexity. In any case it would need a more complicated patch :)b) Going with the current patch which will make MySQL throw an error if we try to do a select with utf8 data on an ASCII field, and garble up the content to a question mark if we insert other data (which we've checked that we don't do in core), allowing us to use the full 255 characters for an index. Regardless of the index size we may also get a slight performance improvement vs option a) in terms of the query buffers, which as was well explained above, would apply to a lesser extent in MySQL 5.7.
c) The more "correct/elegant" solution of using
(var)binary
instead of(var)char
on the fields marked withis_ascii
as suggested earlier. The worry is this may be confusing in terms of DX as on the schema level we'd mark fields as bothvarbinary
andis_ascii
where we also already have abinary
key that casts to binary for case sensitive comparisons. This should only be used for fields that are not text. In core we don't do this, but it could lead to further confusion for contrib/custom. Also, the same problem as in a) applies, in that we need to validate the encoding of what comes in and comes out of those fields on the PHP level. Other than being more elegant, I don't know if there's also a performance improvement vs option b), but this may be undone by the extra overhead of doing the ASCII check in the DB driver.@Crell/Morgan any thoughts on this yourselves?
Comment #59
morgantocker CreditAttribution: morgantocker commentedI find it confusing to have more than one character set in an application. This is why I like just calling it binary instead of ascii.
I doubt that reducing indexes to 191 bytes will hurt performance, since most strings I have performed analysis on are selective in the first 20-30 characters. The exception may be something like a list of string that all start with a common prefix (i..e http://www.drupal.org/node/ ..). In the pathological case can you recommend that the user explicitly enable innodb_large_prefix.
Comment #60
stefan.r CreditAttribution: stefan.r commentedThanks. It would be good if we could get Damien Tournoud to chime in on this as well.
Comment #61
Damien Tournoud CreditAttribution: Damien Tournoud at Centarro commentedConceptually, I actually think that a ASCII / UTF8 distinction makes more sense than a BINARY / UTF8 distinction, as the data that we store in those columns is just *not* binary. From a storage perspective, the end result is the same, but from a semantic perspective it is really not.
Comment #62
pwolanin CreditAttribution: pwolanin at Acquia commentedI tend to agree with DamZ - let's use ASCII and get the validation at the DB level.
Looking at the patch, I think the DX would be better if we defined a new schema API data type instead of a flag
i.e. instead of
something like:
I don't have a strong feeling, but varascii, varchar_ascii, etc some type name that makes it clear that the type is different and ASCII.
Comment #63
stefan.r CreditAttribution: stefan.r commentedComment #66
stefan.r CreditAttribution: stefan.r commentedThat patch seems close, let's see what @Crell says before fixing those test failures
Comment #67
Crell CreditAttribution: Crell at Palantir.net commentedWe already have some type mapping logic, so introducing a custom virtual type is, eh, I guess OK. I'm not crazy about it but can live with it. However, I think the implementation may be more challenging:
Would this work? We map varchar_ascii to VARCHAR, then check for varchar_ascii... which shouldn't be there anymore as it's been replaced by varchar, right?
Comment #68
stefan.r CreditAttribution: stefan.r commentedOK great to hear we're close to a solution here!
It should work, but I'll see if I can document that or make it less confusing somehow. The first test is for the 'type' key, the second test is for the 'mysql_type' key.
Comment #69
stefan.r CreditAttribution: stefan.r commentedComment #70
stefan.r CreditAttribution: stefan.r commentedUpdated issue summary. Also note SQLite / PostgreSQL support is outside of the scope of this issue as they don't have the utf8mb4 problem.
Comment #72
stefan.r CreditAttribution: stefan.r commentedComment #73
stefan.r CreditAttribution: stefan.r commentedComment #74
stefan.r CreditAttribution: stefan.r commentedComment #75
catchComment #76
catchComment #77
stefan.r CreditAttribution: stefan.r commentedI'm assuming any upgrade path would be outside of the scope of this issue? We don't support beta-to-beta upgrades yet do we?
Comment #78
amateescu CreditAttribution: amateescu commentedYes, the upgrade path is not in scope of this issue. Beta-to-beta upgrades are provided by the HEAD to HEAD contrib module and the tag helps to keep track of issues that need a patch over there.
Comment #79
Crell CreditAttribution: Crell at Palantir.net commentedI don't know the config schema well enough to say, but should we rename this since "is_ascii" is no longer parallel with the DB schema?
Shouldn't the is_ascii be gone now?
And... should we be using varchar_ascii or varchar:ascii? We seem to be using colons elsewhere, no? (I've not dug into the schema code in quite some time so I may be missing a context... I'm OK with whatever is most consistent for DX.)
Comment #80
stefan.r CreditAttribution: stefan.r commentedRenaming isn't necessary as the string settings in cofig are independent of the database schema. I also discussed that with @pwolanin and we settled on keeping that, for instance the
case_sensitive
string config setting translates to "binary" on the schema level.The colons would be confusing as they're used for size already, as far as DX is concerned an underscore is better.
Indeed the is_ascii on the hook_schema should be gone, I'll post a new patch.
Comment #81
stefan.r CreditAttribution: stefan.r commentedComment #82
Crell CreditAttribution: Crell at Palantir.net commentedMakes sense to me.
Comment #83
joelpittetSilly question: can or should this be backported to D7?
Comment #84
catchI don't think we can rely on cache IDs being ascii-only, or if we do it'd be an API change from 7.x - have seen URLs with utf8 characters used as cache IDs in 7.x for example.
Comment #85
catchAlso situations like that make me very tempted to schedule this patch for a couple of days after the next beta, to have some time to flush them out. We should use ascii as much as possible for index size, but only as much as possible.
Comment #86
hass CreditAttribution: hass commentedThe empty line in array should be removed.
Comment #87
hass CreditAttribution: hass commentedOne more empty line
Comment #88
stefan.r CreditAttribution: stefan.r commented@catch the worry is that as we switch to the utf8mb4 charset a primary key (such as the cache ID) can only be 190 characters anymore. I think an ASCII cache ID would be an API break worth having as the point of the patch is also to avoid us using UTF8 identifiers where not desirable... As to the specific D7 example you cite, in D8 we use a hash for that (I think as of #1855260: Page caching broken by accept header-based routing).
There was already a thorough review at the dev days, as far as D8 core is concerned at least it seemed solid, but there is still some uncertainty/doubt so looks like this will still need a final review. Which should be hours, not days, as some of the previous review has already been documented, but if there's no time to do so before the next beta maybe this issue should be set to postponed?
Comment #89
stefan.r CreditAttribution: stefan.r commented@hass the empty line was actually on purpose so we don't have to touch the vertically aligned mega array underneath and as it's a "special" data type anyway. The mysql and sqlite drivers use empty lines as well there to separate similar groups, it's probably fine there for now unless anyone still thinks we need to change the code style in those places.
Comment #90
catchWell it might be worth having but it's a completely undocumented restriction at the moment. Additionally it's only a restriction if you're using MySQL for the cache backend. So it would be possible to develop some code that works fine with another cache backend which then fails if you switch back to MySQL.
Either the mysql backend should normalize to ASCII, or it needs updating in the cache documentation at a higher level.
afaik we don't use a hash for URLs, we just automatically hash if a cache ID is longer than 256 characters, but don't have the issue to hand.
Double-checked and we do validate language codes in \Drupal\language\Form\LanguageFormBase::validateCommon() so this and the rest of the patch should be OK - just cache IDs to figure out.
Comment #91
stefan.r CreditAttribution: stefan.r commentedThanks for the final review @catch. Would it help if as a first step we did transliteration in
DatabaseBackend::normalizeCid()
and then had the discussion about whether to allow non-US ASCII characters at all (which is less pressing) in another issue?So right now this looks like this:
Just to be clear, I'm proposing to add in a
mb_check_encoding($cid, 'ASCII') === TRUE
check along with the length check, and transliterate the initial part of the cache ID.Comment #92
stefan.r CreditAttribution: stefan.r commentedComment #93
stefan.r CreditAttribution: stefan.r commentedComment #94
stefan.r CreditAttribution: stefan.r commentedJust to solve the cache ID issue in one go, would something like this work for now?
Comment #95
stefan.r CreditAttribution: stefan.r commentedAdding another empty line just to be consistent and address @hass feedback in #86/#87
Comment #99
pwolanin CreditAttribution: pwolanin at Acquia commentedYeah, not sure we need to convert cache ID to ASCII. The shorter index should be ok - possibly just hash/truncate the cache ID to 190?
Comment #100
Crell CreditAttribution: Crell at Palantir.net commentedHaving the MySQL cache backend hash at 190 instead of 255 sounds fine to me. Alternatively, I'd also be OK with omitting that field from the conversion here. have no strong preference.
Comment #101
stefan.r CreditAttribution: stefan.r commented@pwolanin not sure I follow. Part of the point of this issue was to disallow utf8mb4 where not needed, why disallow it in all machine names except cache IDs? Do we *need* special characters in there?
I think ideally we'd disallow non-ASCII characters on cache IDs altogether (by documenting this on the database cache backend interface) but there seems to be some doubt as to where we might use non-ASCII cache IDs in core, so transliterating them seemed like the most workable solution at least for now.
normalizeCid()
exists to work around database limitations; we already have a length limitation, which the normalization works around, so if we add an ASCII limitation here, the normalization would just transliterate the same way. At least until we confirm that core doesn't use any non-ASCII cache IDs anymore.Also limiting utf8mb4 keys to 190 chars would probably have to be done in a followup issue as I don't know that cache ID is the only one.
@Crell: you and @pwolanin are talking about two different things here. It's not "either one or the other", it's either "both of those things" or "neither of them" :) Only utf8mb4 keys have the 190 char limitation, so if we exclude the
cid
field from conversion here, by definition we'll need to limit the length to 190 as well.Comment #102
stefan.r CreditAttribution: stefan.r commentedComment #103
pwolanin CreditAttribution: pwolanin at Acquia commenteddisallowing utf8 in cache IDs is an API change per catch above - let's not do that?
Comment #104
stefan.r CreditAttribution: stefan.r commentedFair enough. Is there a concern with the current patch? (which doesn't pose an API change and still allows cache ID's to have any sort of format/length and just normalizes it to database limitations)
Comment #105
stefan.r CreditAttribution: stefan.r commentedAdding a test
Comment #106
anavarreNit: 80 cols.
Comment #107
stefan.r CreditAttribution: stefan.r commentedThanks, test had a problem as well. Interdiff is with the rtbc patch in 81.
Comment #108
stefan.r CreditAttribution: stefan.r commentedUpdating a comment
Comment #109
catchTested #107
Applied this change to node.routing.yml
Added one alias (/node -> /front)
Visited node/addお
Then I looked at the path alias preload cache IDs, which are by path:
So in that case we could have just transliterated and not hashed - but the code always hashes if transliteration is needed, not a massive issue.
However I'm not sure we should necessarily transliterate here at all - it's an extra dependency for the database backend. Instead we could just do the ascii check, and hash without any prefix in that case - still ascii-safe that way and for the rare cases where there are non-ascii cache IDs I don't think we're losing much by ditching the prefix there - we only have the prefix to aid debugging/testing. Comes down to whether the transliteration dependency makes a measurable difference to the internal page cache.
Would be good to add test coverage for the database backend for whatever it ends up doing though.
Comment #111
stefan.r CreditAttribution: stefan.r commentedOK that works for me as well if we think it's worth losing the debugging aid for NON-ascii cache IDs.
As to the page cache ID specifically, I guess we could just urlencode that instead upon setting (instead of in the DB backend)?
The hash was to prevent transliteration collisions. I'm sure there's a way to have two different UTF8 IDs turn into the same ASCII ID if we don't use the hash.
Comment #112
stefan.r CreditAttribution: stefan.r commentedThe attached patch hashes any non ASCII cache IDs so that we can lose the transliteration dependency. There's still test coverage through the tests that were added in #107.
It also urlencodes the page cache ID, so at least there we'd still have the debugging aid. @catch would that be OK?
Comment #114
stefan.r CreditAttribution: stefan.r commentedOK since that didn't work, maybe let's leave the urlencoding out as it's just going to make the scope of this issue expand, as catch pointed out it's not just page caching that uses URLs in the ID.
Comment #115
stefan.r CreditAttribution: stefan.r commentedJust to summarize, relative to the previously RTBCed patch, the current patch just implements the cache ID hashing suggested by @catch which was the sole outstanding issue.
Maybe someone can validate this so we can move on to the parent issue and commit this after the beta on Wednesday? :)
Comment #116
catchMoving back to RTBC, this looks good to me now.
Comment #117
catchCommitted/pushed to 8.0.x, thanks!
Opened #2482517: Upgrade path for #1923406.
Back to 7.x for backport discussion.
Also this could use a PSA-style change record for contrib schemas and the API addition.
Comment #119
bzrudi71 CreditAttribution: bzrudi71 commentedFYI this seems to cause SchemaTest fails for PostgreSQL and (maybe) SQLIte too in regards to lastest bot logs :-(
d8pgbot.erwanderbar.de and d8sqlitebot.erwanderbar.de
Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "FULL" LINE 1: SHOW FULL COLUMNS FROM simpletest700187test_table ^: SHOW FULL COLUMNS FROM {test_table}; Array ( ) in Drupal\simpletest\TestBase->run() (line 998 of /opt/local/apache2/htdocs/drupal8/core/modules/simpletest/src/TestBase.php).
Comment #121
catchOuch.
Reverted for now.
Comment #122
stefan.r CreditAttribution: stefan.r commentedYes, sorry we had missed that. I had already noticed that and included a fix in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) yesterday, this ports the fix into this patch.
@amateescu will do another SchemaTest run on SQLite/PostgreSQL to confirm.
Comment #123
amateescu CreditAttribution: amateescu for Drupal Association commentedHere are the results on SQLite:
And on PostreSQL:
It's a bit weird that we have a different number of passes but.. meh :)
Comment #124
stefan.r CreditAttribution: stefan.r commentedWell as the patch is now green on MySQL as well and the reason this was reverted was because of SchemaTest failing in PostgreSQL / SQLite, and the changes in this patch are isolated to SchemaTest, I guess this can be committed now?
Comment #125
amateescu CreditAttribution: amateescu commentedYup, +1 to recommit.
Comment #126
alexpottLet's do this to see if the other db engines benefit from this too.
Committed d57ee5f and pushed to 8.0.x. Thanks!
Comment #128
alexpottAdding review commit credit.
Comment #129
webchickOne more.
Comment #130
webchickSorry for the noise.
Comment #131
webchickHm.
Comment #132
David_Rothstein CreditAttribution: David_Rothstein commentedBased on #117 there is supposed to be a backport discussion here?
Comment #133
stefan.r CreditAttribution: stefan.r commentedActually this patch may not need a backport.
Assuming we're going to be backporting #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) to Drupal 7, we'll need to work around the 191 character identifier length limitation somehow. This would also apply to tables defined by contrib/custom modules. So for every field, people can either:
...this is not a decision we can automate in a migration script, we could merely check the whole database schema for unique indexes/primary keys on varchar fields larger than 191 characters, and refuse to migrate otherwise.
With contrib and custom this could all get a bit messy, especially if they don't think this through and make fields with existing utf8 data ASCII and cause data loss during the migration of an existing install. Or if they use core fields in unexpected ways, ie. if they put utf8 characters in machine name fields.
So as far as D7 is concerned I'm leaning towards only allowing
utf8mb4
on setups withinnodb_large_indexes
enabled, which allow us to bypass the 191 character limit on the database level instead. This is a non-standard setting but as of 5.7 it will be a MySQL default. This way we would lose the need for a backport of this patch as far as UTF-8 support is concerned.Comment #134
ArlaThe URI field schema is now incomplete: #2462155: Field settings schema missing for some field types (8.0.x)
Comment #138
sir_gon CreditAttribution: sir_gon as a volunteer and commentedIt is absolutely necessary to implement this ...
In addition esteem it requires some kind of batch job or drush command to convert tables that already exist
Recently I convert my database to utf8mb4 as 7.50 core suggest... but I have some modules with long file paths as dependencies and is imposible store this paths in {registy} or {system table}.
See this issue as an example.
This full broke my website and I had to restore a dump and I had to manually change utf8mb4 charset to utf8 in te dump file...
Soem websites with lot of content and high availability can not afford to restore dumps.
I think this is a very serious issue.
Comment #139
stefan.r CreditAttribution: stefan.r commented@sir_gon
In order to use utf8mb4 in drupal 7, you'll need to use InnoDB with innodb_large_prefix turned on, which allows for the full 255 characters and works around the 191 character limitation -- see the requirements listed in settings.php:
IMO this issue should marked won't fix for Drupal 7, as we have utf8mb4 support now (albeit only if
innodb_large_prefix
is enabled). In Drupal 8 we were able to implement this issue as we were still in beta, so there utf8mb4 also works with innodb_large_prefix turned off, but for Drupal 7 it seems too late to implement this anymore -- we'd have to identify all core and contrib fields that could be turned into ascii, and implement other solutions for fields that need more than 191 characters and can't be ascii (including possibly truncating them).Comment #140
hass CreditAttribution: hass commentedWhy is innodb_file_per_table=true required? This causes sooo much file handles (error "Too many open files") that a virtualized system runs our of resources (beancounter). Aside of this it does not change data encoding. I know I cannot run 6 drupal databases on my vserver this way, but without file per table it is fine.
Comment #141
chx CreditAttribution: chx at Smartsheet, Migrate Rocks commentedBecause that's how mysql works before MySQL 5.7.6: you need DYNAMIC row format for innodb_large_prefix and that is only supported per file in 5.5 and 5.6. We can not require MySQL 5.7 yet, it's not even a year old and only Ubuntu has an LTS release with it.
Comment #142
hass CreditAttribution: hass commentedWe should warn users that beancounter need to be very large or their server WILL stop working. Maybe we should warn if we already know that a minimum value does not fit.
My hoster has a relativly high limit, but others have more typically the half. A single drupal install with file per table can take your site intermittent down. Impossible to solve without data loss if you converted your database.
Comment #145
quietone CreditAttribution: quietone at PreviousNext commentedNo longer relevant to D8, remove tag.