Use ASCII character set on alphanumeric fields so we can index all 255 characters [#1923406]

Reference: https://www.drupal.org/core/beta-changes
Issue category	Task because this is an API cleanup / performance issue
Issue priority	Normal, but affects performance, blocks #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) and reduces fragility
Prioritized changes	The main goal of this issue is API clean-up and unblocking full UTF-8 support in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)
Disruption	Not disruptive for core because we don't use non-ASCII characters in funny places anyway

Comment	File	Size	Author
#122	interdiff-114-122.txt	1.9 KB	stefan.r
#122	1923406-122.patch	39.23 KB	stefan.r

#114	interdiff-112-114.txt	627 bytes	stefan.r
#114	1923406-114.patch	39.14 KB	stefan.r

#112	1923406-112.patch	39.98 KB	stefan.r

#112	interdiff-107-112.txt	2.16 KB	stefan.r
#108	1923406-109.patch	39.97 KB	stefan.r

#108	interdiff-107-109.txt	453 bytes	stefan.r
#107	interdiff-81-107.txt	3.47 KB	stefan.r
#107	1923406-107.patch	39.79 KB	stefan.r

#105	1923406-105.patch	39.77 KB	stefan.r

#105	interdiff-102-105.txt	1.18 KB	stefan.r
#102	1923406-102.patch	38.44 KB	stefan.r

#102	interdiff-95-102.txt	2.57 KB	stefan.r
#95	1923406-95.patch	39.73 KB	stefan.r

#95	interdiff-94-95.txt	477 bytes	stefan.r
#94	1923406-94.patch	39.73 KB	stefan.r

#94	interdiff-81-94.txt	3.02 KB	stefan.r
#93	1923406-93.patch	39.73 KB	stefan.r

#93	interdiff-81-93.txt	3.03 KB	stefan.r
#81	1923406-81.patch	36.72 KB	stefan.r

#81	interdiff-74-81.txt	457 bytes	stefan.r
#74	interdiff-73-74.txt	1.1 KB	stefan.r
#74	1923406-74.patch	36.87 KB	stefan.r

#73	1923406-73.patch	36.69 KB	stefan.r

#73	interdiff-69-73.txt	418 bytes	stefan.r
#69	1923406-69.patch	36.18 KB	stefan.r

#69	interdiff-63-69.txt	1009 bytes	stefan.r
#63	1923406-63.patch	36.07 KB	stefan.r

#63	interdiff-48-63.patch	31.33 KB	stefan.r

#49	1923406-48.patch	33.5 KB	stefan.r

#49	interdiff-46-48.txt	868 bytes	stefan.r
#46	interdiff-43-46.txt	1.23 KB	stefan.r
#46	drupal-ascii_charset-1923406-46.patch	32.99 KB	stefan.r

#43	drupal-ascii_charset-1923406-43.patch	32.97 KB	stefan.r

#43	interdiff-42-43.txt	1.12 KB	stefan.r
#42	drupal-ascii_charset-1923406-42.patch	31.91 KB	stefan.r

#42	interdiff-37-42.txt	2.65 KB	stefan.r
#37	interdiff-32-37.txt	25.85 KB	stefan.r
#37	drupal-ascii_charset-1923406-37.patch	29.29 KB	stefan.r

#34	interdiff-26-32.txt	10.31 KB	stefan.r
#34	drupal-ascii_charset-1923406-32.patch	29.71 KB	stefan.r

#33	interdiff-26-32.txt	10.31 KB	stefan.r
#26	drupal-ascii_charset-1923406-26.patch	42.04 KB	yannickoo

#25	interdiff.txt	1.66 KB	yannickoo
#25	drupal-ascii_charset-1923406-25.patch	40.94 KB	yannickoo

#23	interdiff.txt	1.18 KB	stefan.r
#23	1923406-23.patch	23.1 KB	stefan.r

#21	interdiff.txt	411 bytes	stefan.r
#21	1923406-21.patch	22.4 KB	stefan.r

#18	1923406-18.patch	22.4 KB	stefan.r

#18	interdiff-16-18.txt	1.04 KB	stefan.r
#16	1923406-16.patch	22.4 KB	stefan.r

#16	interdiff.txt	506 bytes	stefan.r
#15	1923406-15.patch	22.39 KB	stefan.r

#6	1923406-6.patch	5.27 KB	stefan.r

#4	1923406-ascii-4.patch	5.95 KB	stefan.r

Comment #1

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott commented 25 October 2013 at 01:47

Status:

Active

» Closed (won't fix)

The default regex on machine names limits the characters in machine names using the following regex [^a-z0-9_]+ so only ASCII characters are used.

Log in or register to post comments

Comment #2

amateescu CreditAttribution: amateescu commented 25 October 2013 at 05:59

Status:

Closed (won't fix)

» Active

The improvement @sun proposes is for the schema of the table column that stores machine names :)

Log in or register to post comments

Comment #3

stefan.r CreditAttribution: stefan.r commented 4 April 2015 at 00:10

Issue summary:

View changes

Does this still have a chance of going in? If I understand correctly this is blocking #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) (emojis!).

Log in or register to post comments

Comment #4

stefan.r CreditAttribution: stefan.r commented 4 April 2015 at 15:29

Title:	Consider ASCII character set for machine name schema columns	» Use ASCII character set in machine name / UUID columns for more efficient indexing
Issue summary:	View changes
Status:	Active	» Needs review
Issue tags:		+Performance, +database schema

File	Size
1923406-ascii-4.patch	5.95 KB

Here's a start, this would add an 'ascii' key. We'd want to use this on all UUIDs, machine names, non-numeric indexed columns and primary keys where possible, there is no need for all those to be encoded in utf8.

Log in or register to post comments

Comment #5

4 April 2015 at 13:37

Status:

Needs review

» Needs work

The last submitted patch, 4: 1923406-ascii-4.patch, failed testing.

Log in or register to post comments

Comment #6

stefan.r CreditAttribution: stefan.r commented 5 April 2015 at 01:07

Status:

Needs work

» Needs review

File	Size
1923406-6.patch	5.27 KB

1 file was hidden/shown/deleted

File	Size
1923406-ascii-4.patch	5.95 KB

Seems MySQL has a problem if we don't declare a length for a varchar field in the schema.

Log in or register to post comments

Comment #7

Crell CreditAttribution: Crell at Palantir.net commented 6 April 2015 at 17:16

I dislike adding a boolean flag on things. That violates Asimov's Law. (2 is the least likely number in the universe.)

Could this be instead done with a charset key that gets set to ascii (and defaults to something reasonable so 99% of cases don't need to specify it)?

Log in or register to post comments

Comment #8

stefan.r CreditAttribution: stefan.r commented 6 April 2015 at 17:35

I sort of like the simplicity of only supporting 1 character set though (UTF-8). As an exception, ASCII is a subset of UTF-8 that we can use for identifiers, but then even latin1 for instance I believe is not. Having a "charset" key sort of gives the impression we support custom column level character sets, the question is do we want that... it may lead to issues down the line. Then there is also the question of collations.

The ascii flag is really no much different from the case sensitive flag given what we would be using it for (ie. distinguishing between identifier fields and content fields). Maybe the name is unfortunate and we need to rename it to something more sensible? Or I don't know if creating a new type (ie something other than "string") is still possible at this stage?

Log in or register to post comments

Comment #9

Crell CreditAttribution: Crell at Palantir.net commented 6 April 2015 at 18:16

The idea here is to use a smaller charset for fields that we know are only going to contain machine names ( A-Za-z0-9), correct? What about an "is-machine name" or "is trivial" or something like that? If we have to use a boolean value it should be on something that is conceptually boolean, whereas "Ascii" is not. We can then map that metdata to mean "oh yeah, use ascii encoding for this field, or something else if we find an even better option later on". (That latter part is a big part of why I dislike "ascci" as a boolean.)

I agree that custom charsets per-field is a can of worms even pandora doesn't want opened.

Log in or register to post comments

Comment #10

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 16:27

Correct, that was the idea... I agree something like that makes more sense than "ascii" for a flag name :)

I'll go with is_trivial for now, as it encompasses bundle, machine name, uuid, etc.

Log in or register to post comments

Comment #11

amateescu CreditAttribution: amateescu commented 6 April 2015 at 19:16

How about is_alphanumeric?

Log in or register to post comments

Comment #12

stefan.r CreditAttribution: stefan.r commented 6 April 2015 at 19:25

That's less confusing but not really truthful, we'd still allow underscores and possibly dots :)

Log in or register to post comments

Comment #13

Aki Tendo CreditAttribution: Aki Tendo commented 6 April 2015 at 19:36

Wouldn't this be better suited to assert? Or are end users expected to input schemas??

Log in or register to post comments

Comment #14

amateescu CreditAttribution: amateescu commented 6 April 2015 at 20:04

@stefan.r, I'd go with "friendlier" over "truthful but meaningless" any time :)

@Aki Tendo, Yep, schema is user input, even though it's less an less so in Drupal 8.

Log in or register to post comments

Comment #15

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 16:27

File	Size
1923406-15.patch	22.39 KB

Log in or register to post comments

Comment #16

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 16:36

File	Size
interdiff.txt	506 bytes
1923406-16.patch	22.4 KB

There may be more columns that could be alphanumeric but this should cover most of them.

Some may require further validation to make sure we don't try to store non-ASCII data in there, and some may need some further conversion. I think one field we tried to store raw a SHA256 hash for instance, that should ideally be base64'd so we can make it an "alphanumeric" field.

Log in or register to post comments

Comment #17

7 April 2015 at 16:43

Status:

Needs review

» Needs work

The last submitted patch, 16: 1923406-16.patch, failed testing.

Log in or register to post comments

Comment #18

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 17:04

Status:

Needs work

» Needs review

File	Size
interdiff-16-18.txt	1.04 KB
1923406-18.patch	22.4 KB

Log in or register to post comments

Comment #19

7 April 2015 at 17:08

The last submitted patch, 15: 1923406-15.patch, failed testing.

Log in or register to post comments

Comment #20

7 April 2015 at 17:29

Status:

Needs review

» Needs work

The last submitted patch, 18: 1923406-18.patch, failed testing.

Log in or register to post comments

Comment #21

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 20:04

Status:

Needs work

» Needs review

File	Size
1923406-21.patch	22.4 KB

interdiff.txt	411 bytes

6 files were hidden/shown/deleted

File	Size
1923406-6.patch	5.27 KB

1923406-15.patch	22.39 KB

interdiff.txt	506 bytes
1923406-16.patch	22.4 KB

interdiff-16-18.txt	1.04 KB
1923406-18.patch	22.4 KB

Log in or register to post comments

Comment #22

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 13:45

For easier reviewing I'll summarize the columns that have been turned into simple ASCII columns:

UUID fields
* All, through the UuidItem class

Language fields
* All, through the LanguageItem class

Content entity storage
* langcode and bundle field (The field instance bundle to which this row belongs), through SqlContentEntityStorageSchema.

Text long fields
* The format machine name, through the TextLongItem class

Text with summary fields
* The format machine name, through the TextWithSummaryItem class

cache_* bin tables
* cid (Unique cache ID)
* checksum (The tag invalidation checksum when this entry was saved.)

Cache table for tracking cache tag invalidations
* tag (Namespace-prefixed tag string)

Configuration table
* collection (Primary Key: Config object collection.)
* name (Primary Key: Config object name.)

Other key value tables
* key_value.collection
* key_value_expire.collection
* key_value.name
* key_value_expire.name

Menu tree storage
* menu_name (the menu name. All links with the same menu name (such as 'tools') are part of the same menu.)
* id (Unique machine name: the plugin ID.)
* parent (The plugin ID for the parent of this link.)
* route_name (The machine name of a defined Symfony Route this menu item represents.)
* provider (the name of the module that generated this link.)

Feeds
* hash (Calculated hash of the feed data, used for validating cache.)

Ban
* ip (IP address)

Comment
* entity_type (The entity_type of the entity to which this comment is a reply.)
* field_name (The field_name of the field that was used to add this comment.)

Dblog
* type (Type of log message, for example "user" or "page not found.)
* hostname (Hostname of the user who triggered the event.)

Entity reference
* target_id, through EntityReferenceItem class

File
* module (The name of the module that is using the file.)
* type (The name of the object type in which the file is used)
* id (The primary key of the object using the file)
* filemime (File MIME type)

Locale
* context (The context this string applies to)
* version (Version of Drupal where the string was last used (for locales optimization))
* language (Language code. References {language}.langcode)
* type (The location type (file, config, path, etc).)
* version (Version of Drupal where the location was found)
* locale_file.project (A unique short name to identify the project the file belongs to.)
* locale_file.langoode (Language code of this translation. References {language}.langcode.)

MenuLinkContent
* bundle (The content menu link bundle)
* menu_name (The menu name. All links with the same menu name (such as "tools") are part of the same menu.)

Node
* langcode (the {language}.langcode of this node.)
* node_grants.realm

Search
* langcode (The {languages}.langcode of the item variant.)
* type (type of item, e.g. node)

Shortcut
* set_name (The {shortcut_set}.set_name that will be displayed for this user.)

Simpletest
* message_group (the message group this message belongs to. For example: warning, browser, user.)
* class_name

System
* batch.token (A string token generated against the current user's session id and the batch id)
* flood.event (Name of event (e.g. contact).)
* flood.identifier(Identifier of the visitor, such as an IP address or hostname.)
* queue.name (The queue name.)
* router.name (Primary Key: Machine name of this route)
* semaphore.name (Primary Key: Unique name)
* semaphore.value (lock ID, is uniqid)
* sessions.sid (A session ID (hashed). The value is generated by Drupal's session handlers)
* sessions.hostname (The IP address that last used this session ID)
* url_alias.langcode (The language code this alias is for)

User
* users_data.name (The identifier of the data)
* users_data.module (The name of the module declaring the variable)

Log in or register to post comments

Comment #23

stefan.r CreditAttribution: stefan.r commented 7 April 2015 at 20:33

File	Size
1923406-23.patch	23.1 KB

interdiff.txt	1.18 KB

Log in or register to post comments

Comment #24

yannickoo

Berlin

CreditAttribution: yannickoo at Wunder commented 15 April 2015 at 21:48

Assigned:

Unassigned

» yannickoo

I'm currently reviewing #23. Going to make sure that no field is missing in the patch :)

Log in or register to post comments

Comment #25

yannickoo

Berlin

CreditAttribution: yannickoo at Wunder commented 15 April 2015 at 22:01

File	Size
drupal-ascii_charset-1923406-25.patch	40.94 KB

interdiff.txt	1.66 KB

Patch looks really good and I have also added is_alphanumeric property to long text's format column (which stores the used text format).

Log in or register to post comments

Comment #26

yannickoo

Berlin

CreditAttribution: yannickoo at Wunder commented 15 April 2015 at 22:08

File	Size
drupal-ascii_charset-1923406-26.patch	42.04 KB

1 file was hidden/shown/deleted

File	Size
drupal-ascii_charset-1923406-25.patch	40.94 KB

Log in or register to post comments

Comment #27

15 April 2015 at 22:19

The last submitted patch, 25: drupal-ascii_charset-1923406-25.patch, failed testing.

Log in or register to post comments

Comment #28

15 April 2015 at 22:29

stefan.r queued 23: 1923406-23.patch for re-testing.

Log in or register to post comments

Comment #29

15 April 2015 at 22:29

stefan.r queued 25: drupal-ascii_charset-1923406-25.patch for re-testing.

Log in or register to post comments

Comment #30

15 April 2015 at 22:48

The last submitted patch, 25: drupal-ascii_charset-1923406-25.patch, failed testing.

Log in or register to post comments

Comment #31

Crell CreditAttribution: Crell at Palantir.net commented 15 April 2015 at 23:15

+++ b/core/config/schema/core.data_types.schema.yml
@@ -438,20 +438,23 @@ field.value.*:
+    is_alphanumeric:
+      type: boolean
+      label: 'US ASCII characters only'

This is adding to the config schema, not DB schema. I don't think this is needed, is it? (DB schema and Config schema SHOULD be entirely independent.)

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -137,20 +137,23 @@ protected function createTableSql($name, $table) {
+      if (!empty($spec['is_alphanumeric'])) {
+        $sql .= ' CHARACTER SET ascii COLLATE ascii_general_ci';
+      }

We should check for a True value specifically, not just mere presence.

+++ b/core/modules/aggregator/src/FeedInterface.php
@@ -163,21 +163,22 @@ public function setImage($image);
    * @param string $hash
-   *   A string containing the calculated hash of the feed.
+   *   A string containing the calculated hash of the feed. Must be base64
+   *   encoded so it contains US ASCII characters only.

The comment here is a little confusing. The value that setHash() returns must be base64 encoded? That seems over-reach for an interface definition. Rather, it should just state that the hash must contain only US ASCII characters. base64 encoding is only one of many ways of achieving that.

I didn't look at the specific fields that are being flagged in detail, but I'll accept yanikoo and stefan's work on that. The above issues should be easy enough to resolve so we can move forward here.

The next step would be to include some benchmarks and a beta analysis table to justify this addition in 8.0. I'm OK with it if it passes beta criteria.

Log in or register to post comments

Comment #32

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 08:21

Issue tags:

+drupaldevdays

Thanks @Crell.

1. Just to clarify, those are string field storage settings which can be either max_length, case_sensitive or is_alphanumeric and are translated to DB schema settings elsewhere:

   public static function schema(FieldStorageDefinitionInterface $field_definition) {
     return array(
       'columns' => array(
         'value' => array(
           'type' => 'varchar',
           'length' => (int) $field_definition->getSetting('max_length'),
           'binary' => $field_definition->getSetting('case_sensitive'),
           'is_alphanumeric' => $field_definition->getSetting('is_alphanumeric'),
         ),
       ),
     );
   }

We can use this setting in the UuidItem/LanguageItem/TextWithSummaryItem/TextLongItem field types and in some of the BaseFieldDefinitions in Feed/File/MenuLinkContent.

If we do the case_sensitive/binary setting in the config schema I guess we can do this one as well.. Maybe we should rename the config setting to something else, just for consistency with the other settings?

2. The binary / unsigned settings actually also do a !empty() check in the MySQL driver, are we sure we want to deviate from those? If needed we could break the previous behavior though and do stricter testing there as well, only accepting TRUE anymore...

We were also planning to do a last manual review of every single field just to make sure there is no way we can somehow insert non-ASCII data in those. We'll also need to document the new hook_schema setting in database.api.php.

Log in or register to post comments

Comment #33

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 13:44

File	Size
interdiff-26-32.txt	10.31 KB

Further reviewed all fields on a complete drupal install (with all modules) and found another few that should be "trivial" fields. I will update #22 to reflect this.
Renamed the field storage setting to is_trivial for consistency with other field storage settings and to clarify it's different to the schema setting.
Added documentation for new is_alphanumeric setting

For future reference, other than the following fields, all other indexed fields should now be set to ASCII:

- aggregator.title
- aggregator.url
- file_managed.uri
- block_content_field_data.info
- taxonomy_term_field_data.name
- locales.source
- menu_link_content_data.link__uri
- menu_tree.route_param_key
- node_field_data.title
- router.pattern_outline
- search_total.word
- shortcut_field_data.link__uri
- url_alias.source
- url_alias.alias
- users_field_data.name
- users_field_data.mail

Log in or register to post comments

Comment #34

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 13:51

File	Size
drupal-ascii_charset-1923406-32.patch	29.71 KB

interdiff-26-32.txt	10.31 KB

Log in or register to post comments

Comment #35

16 April 2015 at 14:10

Status:

Needs review

» Needs work

The last submitted patch, 34: drupal-ascii_charset-1923406-32.patch, failed testing.

Log in or register to post comments

Comment #36

Crell CreditAttribution: Crell at Palantir.net commented 16 April 2015 at 16:29

Re the config schema: Yeesh. I didn't realize it was doing that. I defer to someone from the config system team on that front for what the best way to handle it is. It's irrelevant from a DB system perspective IMO. I would prefer to still call it is_numeric, though, as is_trivial is a rather meaningless term.

Re the type of truthy check, this is where PHP is really annoying. :-) The current code would read TRUE, 1, 5, "true", and "Game of Thrones" as "this is alphanumeric" and FALSE, NULL, '', and not-even-set as "not alphanumeric". I generally prefer tighter checks, but I suppose it's reasonable to allow 1 in addition to TRUE. *sigh* PHP... I could be convinced either way here.

One thing we should realize: Technically this means that machine names may no longer have non-latin characters. IE, a user in Tokyo would still have to use ASCII letters for a machine name of a View. Are we OK with the implications of that? (I am, but I defer to Gabor on the i18n UX implications of that.)

Log in or register to post comments

Comment #37

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 22:37

Status:

Needs work

» Needs review

File	Size
drupal-ascii_charset-1923406-37.patch	29.29 KB

interdiff-32-37.txt	25.85 KB

I have reviewed all occurrences with @klausi and basically we should be fine to turn these fields into ASCII fields. He recommended going with is_ascii for both the setting and the schema setting.

@Crell as you well noted, not all of it is validated or codified to be ascii, we don't protect users against doing dumb things in code such as using japanese machine names. It probably threw an error before, and if it didn't, now it does! We thought this shouldn't be supported anyway -- the UI, PHP and the OS may not be able to deal with non-ASCII machine names, we should probably stick to UTF-8 for content only. For instance, we can't use japanese characters for a module name because then we can't use hooks as PHP functions need to be ASCII :)

Log of the manual review:
https://docs.google.com/spreadsheets/d/1g42xm9ZWUJRoBcIXaw96QfBZaUqsalGh...

Log in or register to post comments

Comment #38

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 17:25

Issue summary:

View changes

Updated issue summary, added beta evaluation. If this is green we can proceed to profiling this.

Log in or register to post comments

Comment #39

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 17:27

Issue summary:

View changes

Log in or register to post comments

Comment #40

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 17:50

Issue summary:

View changes

Also just FYI this should not affect existing installs. New Drupal installs will just have some columns be ASCII instead of UTF-8.

Log in or register to post comments

Comment #41

16 April 2015 at 17:51

Status:

Needs review

» Needs work

The last submitted patch, 37: drupal-ascii_charset-1923406-37.patch, failed testing.

Log in or register to post comments

Comment #42

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 22:32

Status:

Needs work

» Needs review

File	Size
interdiff-37-42.txt	2.65 KB
drupal-ascii_charset-1923406-42.patch	31.91 KB

Log in or register to post comments

Comment #43

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 22:32

File	Size
interdiff-42-43.txt	1.12 KB
drupal-ascii_charset-1923406-43.patch	32.97 KB

OK, I am happy with how this looks now :)

Log in or register to post comments

Comment #44

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 23:14

How do we best benchmark this? Seems the testbot completes our test suite in 16:02 min which is about the time it usually takes. Do we have any scenarios that simulate a larger database?

Log in or register to post comments

Comment #45

yannickoo

Berlin

CreditAttribution: yannickoo at Wunder commented 16 April 2015 at 23:20

Status:

Needs review

» Needs work

+++ b/core/modules/system/system.install
@@ -759,6 +759,7 @@
+		'is_ascii' => TRUE,

@@ -766,6 +767,7 @@
+		'is_ascii' => TRUE,

@@ -786,6 +788,7 @@
+		'is_ascii' => TRUE,

@@ -794,6 +797,7 @@
+		'is_ascii' => TRUE,

Tabs instead of whitespaces D:

Log in or register to post comments

Comment #46

stefan.r CreditAttribution: stefan.r commented 16 April 2015 at 23:38

Status:

Needs work

» Needs review

File	Size
drupal-ascii_charset-1923406-46.patch	32.99 KB

interdiff-43-46.txt	1.23 KB

Log in or register to post comments

Comment #47

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 11:25

From some initial simple where/order by/like benchmarks on a table with a million rows of random content I can't really seem to get the character set to affect index size or performance (as long as the content is ASCII). Don't know about more complex operations yet but maybe any performance gains are just theoretical/negligible?

Log in or register to post comments

Comment #48

Damien Tournoud CreditAttribution: Damien Tournoud at Centarro commented 17 April 2015 at 13:45

@stefan.r was struggling to prove the performance improvements in this issue and asked me to step in.

Performance is not the point: the point is to reduce the size of the indexes so that it is practical to move to utf8mb4 by default.

Databases are really smart in how they deal with your data, so don't be surprised that even if our schema doesn't make sense (we store ASCII identifiers in a Unicode column), our indexes are still relatively efficient. As long as the size of the indexes fit in memory, it is not going to make a significant difference to the query performance anyway.

This is definitely a huge improvement anyway, +1 from me.

Log in or register to post comments

Comment #49

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 13:56

File	Size
interdiff-46-48.txt	868 bytes
1923406-48.patch	33.5 KB

17 files were hidden/shown/deleted

File	Size
1923406-21.patch	22.4 KB

interdiff.txt	411 bytes
1923406-23.patch	23.1 KB

interdiff.txt	1.18 KB
interdiff.txt	1.66 KB
drupal-ascii_charset-1923406-26.patch	42.04 KB

interdiff-26-32.txt	10.31 KB
drupal-ascii_charset-1923406-32.patch	29.71 KB

interdiff-26-32.txt	10.31 KB
drupal-ascii_charset-1923406-37.patch	29.29 KB

interdiff-32-37.txt	25.85 KB
interdiff-37-42.txt	2.65 KB
drupal-ascii_charset-1923406-42.patch	31.91 KB

interdiff-42-43.txt	1.12 KB
drupal-ascii_charset-1923406-43.patch	32.97 KB

drupal-ascii_charset-1923406-46.patch	32.99 KB

interdiff-43-46.txt	1.23 KB

Added two more fields that didn't really need to be UTF-8.

Log in or register to post comments

Comment #50

yannickoo

Berlin

CreditAttribution: yannickoo at Wunder commented 17 April 2015 at 14:47

Status:

Needs review

» Reviewed & tested by the community

I think we can set this issue to RTBC now because this is an API clean-up and does not affect existing installations. We have done an extensive review of that together with other guys and Damien, Crell and klausi are also fine with this :)

Another good point is that the patch would unblock #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols).

Log in or register to post comments

Comment #51

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 14:48

Assigned:

yannickoo

» Unassigned

Log in or register to post comments

Comment #52

morgantocker CreditAttribution: morgantocker commented 17 April 2015 at 14:51

Morgan from the MySQL team here.

For the example of machine names, BINARY is better than ascii since its not really text and collations do not apply.

I would clarify that this may possibly be a micro-optimization:
- utf8mb4 indexes and columns store as variable length with InnoDB (so there should be no direct data size saving).
- Any performance loss is going to be in query buffers that are not variable length (during sort, temporary table or sending results to the client). MySQL 5.7 makes more of these buffers variable length, so in the specific case there is a heavy perf regression (I am expecting this to be rare), maybe it will be reasonable to recommend an upgrade?

So maybe to clarify the above: I don't think this feature is required.

Log in or register to post comments

Comment #53

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 15:15

@morgantocker did you really mean 5.7? Drupal 8 raising the minimum version to 5.7 is pretty unlikely.. But we've proposed raising it to 5.5 in #2473301: Raise MySQL requirement to 5.5.3

We did actually discuss using BINARY with our database maintainer @Damien Tournoud, but as mentioned in #48 performance is not the main point here -- this is about working around the maximum index size requirement (191 characters for utf8mb4 unless we use non-standard settings), and cleaning up the API to implicitly say "don't use non-ASCII characters for these machine-name fields"

Log in or register to post comments

Comment #54

morgantocker CreditAttribution: morgantocker commented 17 April 2015 at 15:30

@Stefan.r: I did not mean raising the minimum version to 5.7. I had stated that those requiring extreme performance have an acceptable solution. But I apologize that we may have been talking about different things. The physical index size will not change with ASCII, but the meta data size will (maximum theoretical length), and this is your current problem.

Maybe it is easier to clarify with examples:

mysql> CREATE TABLE t1 (a VARCHAR(255) NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql> show warnings;
+---------+------+---------------------------------------------------------+
| Level   | Code | Message                                                 |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
+---------+------+---------------------------------------------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE t2 (a VARBINARY(255) NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected (0.02 sec)

mysql> CREATE TABLE t3 (a VARCHAR(255) CHARACTER SET latin1 NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected (0.03 sec)

This issue is suggesting example in t3. I was suggesting t2 may be more elegant.

Log in or register to post comments

Comment #55

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 22:33

Thanks for clarifying, that would explain why I noticed no size difference in my own tests when changing the encoding.

Indeed either BINARY or a different encoding should work in terms of solving the index size limitation for non-utf8mb4 data, although with BINARY a worry would be that we'd have to expand the scope of the current patch and deal with the encoding in Drupal itself, validating that what goes in and comes out of these fields is ASCII encoded data only. Just so as to not to have contrib/custom module developers try saving data in other encodings on those fields :)

Log in or register to post comments

Comment #56

stefan.r CreditAttribution: stefan.r commented 17 April 2015 at 22:42

Issue summary:

View changes

1 file was hidden/shown/deleted

File	Size
interdiff-46-48.txt	868 bytes

Log in or register to post comments

Comment #57

Crell CreditAttribution: Crell at Palantir.net commented 18 April 2015 at 05:52

Status:

Reviewed & tested by the community

» Needs review

Sounds like this is still an open question.

Thanks Morgan for jumping in with more information!

Log in or register to post comments

Comment #58

stefan.r CreditAttribution: stefan.r commented 18 April 2015 at 11:29

Well in terms of making sure that data in ASCII fields is actually ASCII and allowing for utf8mb4 encoding by default, I see 3 ways out here:

a) Reducing all indexes to 191 characters max instead of 255, sticking with utf8mb4 encoding everywhere, potentially hurting performance (although I guess smaller indexes could be better in some cases?). We'd also need to make sure on the database driver level that everything going in and out is actually ASCII encoded. This should not be a very expensive regex but it may further hurt performance. We'd also need to save somewhere which fields are ASCII or not as the DBTNG driver does not have access to the schema, which would mean further overhead and complexity. In any case it would need a more complicated patch :)

b) Going with the current patch which will make MySQL throw an error if we try to do a select with utf8 data on an ASCII field, and garble up the content to a question mark if we insert other data (which we've checked that we don't do in core), allowing us to use the full 255 characters for an index. Regardless of the index size we may also get a slight performance improvement vs option a) in terms of the query buffers, which as was well explained above, would apply to a lesser extent in MySQL 5.7.

c) The more "correct/elegant" solution of using (var)binary instead of (var)char on the fields marked with is_ascii as suggested earlier. The worry is this may be confusing in terms of DX as on the schema level we'd mark fields as both varbinary and is_ascii where we also already have a binary key that casts to binary for case sensitive comparisons. This should only be used for fields that are not text. In core we don't do this, but it could lead to further confusion for contrib/custom. Also, the same problem as in a) applies, in that we need to validate the encoding of what comes in and comes out of those fields on the PHP level. Other than being more elegant, I don't know if there's also a performance improvement vs option b), but this may be undone by the extra overhead of doing the ASCII check in the DB driver.

@Crell/Morgan any thoughts on this yourselves?

Log in or register to post comments

Comment #59

morgantocker CreditAttribution: morgantocker commented 18 April 2015 at 21:21

I find it confusing to have more than one character set in an application. This is why I like just calling it binary instead of ascii.

I doubt that reducing indexes to 191 bytes will hurt performance, since most strings I have performed analysis on are selective in the first 20-30 characters. The exception may be something like a list of string that all start with a common prefix (i..e http://www.drupal.org/node/ ..). In the pathological case can you recommend that the user explicitly enable innodb_large_prefix.

Log in or register to post comments

Comment #60

stefan.r CreditAttribution: stefan.r commented 18 April 2015 at 21:29

Thanks. It would be good if we could get Damien Tournoud to chime in on this as well.

Log in or register to post comments

Comment #61

Damien Tournoud CreditAttribution: Damien Tournoud at Centarro commented 18 April 2015 at 21:47

Conceptually, I actually think that a ASCII / UTF8 distinction makes more sense than a BINARY / UTF8 distinction, as the data that we store in those columns is just *not* binary. From a storage perspective, the end result is the same, but from a semantic perspective it is really not.

Log in or register to post comments

Comment #62

pwolanin CreditAttribution: pwolanin at Acquia commented 19 April 2015 at 12:58

I tend to agree with DamZ - let's use ASCII and get the validation at the DB level.

Looking at the patch, I think the DX would be better if we defined a new schema API data type instead of a flag

i.e. instead of

@@ -75,6 +76,7 @@ public static function schema(FieldStorageDefinitionInterface $field_definition)
         'value' => array(
           'type' => 'varchar',
           'length' => 12,
+          'is_ascii' => TRUE,
         ),
       ),
     );

something like:

@@ -75,6 +76,7 @@ public static function schema(FieldStorageDefinitionInterface $field_definition)
         'value' => array(
-          'type' => 'varchar',
+          'type' => 'varascii',
           'length' => 12,
         ),
       ),
     );

I don't have a strong feeling, but varascii, varchar_ascii, etc some type name that makes it clear that the type is different and ASCII.

Log in or register to post comments

Comment #63

stefan.r CreditAttribution: stefan.r commented 19 April 2015 at 13:47

File	Size
interdiff-48-63.patch	31.33 KB

1923406-63.patch	36.07 KB

Log in or register to post comments

Comment #64

19 April 2015 at 13:49

The last submitted patch, 63: interdiff-48-63.patch, failed testing.

Log in or register to post comments

Comment #65

19 April 2015 at 13:51

Status:

Needs review

» Needs work

The last submitted patch, 63: 1923406-63.patch, failed testing.

Log in or register to post comments

Comment #66

stefan.r CreditAttribution: stefan.r commented 19 April 2015 at 14:02

That patch seems close, let's see what @Crell says before fixing those test failures

Log in or register to post comments

Comment #67

Crell CreditAttribution: Crell at Palantir.net commented 20 April 2015 at 15:55

We already have some type mapping logic, so introducing a custom virtual type is, eh, I guess OK. I'm not crazy about it but can live with it. However, I think the implementation may be more challenging:

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -144,6 +144,9 @@ protected function createFieldSql($name, $spec) {

@@ -144,6 +144,9 @@ protected function createFieldSql($name, $spec) {
       if (!empty($spec['binary'])) {
         $sql .= ' BINARY';
       }
+      if (isset($spec['type']) && $spec['type'] == 'varchar_ascii') {
+        $sql .= ' CHARACTER SET ascii COLLATE ascii_general_ci';
+      }
     }
     elseif (isset($spec['precision']) && isset($spec['scale'])) {
       $sql .= '(' . $spec['precision'] . ', ' . $spec['scale'] . ')';
@@ -218,6 +221,8 @@ public function getFieldTypeMap() {

@@ -218,6 +221,8 @@ public function getFieldTypeMap() {
     // database types back into schema types.
     // $map does not use drupal_static as its value never changes.
     static $map = array(
+      'varchar_ascii:normal' => 'VARCHAR',
+
       'varchar:normal'  => 'VARCHAR',
       'char:normal'     => 'CHAR',

Would this work? We map varchar_ascii to VARCHAR, then check for varchar_ascii... which shouldn't be there anymore as it's been replaced by varchar, right?

Log in or register to post comments

Comment #68

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 20:29

OK great to hear we're close to a solution here!

It should work, but I'll see if I can document that or make it less confusing somehow. The first test is for the 'type' key, the second test is for the 'mysql_type' key.

Log in or register to post comments

Comment #69

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 21:32

Status:

Needs work

» Needs review

File	Size
interdiff-63-69.txt	1009 bytes
1923406-69.patch	36.18 KB

Log in or register to post comments

Comment #70

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 23:11

Title:	Use ASCII character set in machine name / UUID columns for more efficient indexing	» Use ASCII character set on alphanumeric fields so we can index all 255 characters
Issue summary:	View changes

Updated issue summary. Also note SQLite / PostgreSQL support is outside of the scope of this issue as they don't have the utf8mb4 problem.

Log in or register to post comments

Comment #71

20 April 2015 at 22:04

Status:

Needs review

» Needs work

The last submitted patch, 69: 1923406-69.patch, failed testing.

Log in or register to post comments

Comment #72

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 23:11

Issue summary:

View changes

Log in or register to post comments

Comment #73

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 23:12

Status:

Needs work

» Needs review

File	Size
interdiff-69-73.txt	418 bytes
1923406-73.patch	36.69 KB

Log in or register to post comments

Comment #74

stefan.r CreditAttribution: stefan.r commented 20 April 2015 at 23:57

File	Size
1923406-74.patch	36.87 KB

interdiff-73-74.txt	1.1 KB

7 files were hidden/shown/deleted

File	Size
1923406-48.patch	33.5 KB

interdiff-48-63.patch	31.33 KB

1923406-63.patch	36.07 KB

interdiff-63-69.txt	1009 bytes
1923406-69.patch	36.18 KB

interdiff-69-73.txt	418 bytes
1923406-73.patch	36.69 KB

Log in or register to post comments

Comment #75

catch

he/him

English

CreditAttribution: catch commented 21 April 2015 at 15:19

Issue tags:

+D8 upgrade path

Log in or register to post comments

Comment #76

catch

he/him

English

CreditAttribution: catch commented 21 April 2015 at 15:19

Priority:

Normal

» Major

Log in or register to post comments

Comment #77

stefan.r CreditAttribution: stefan.r commented 21 April 2015 at 17:20

I'm assuming any upgrade path would be outside of the scope of this issue? We don't support beta-to-beta upgrades yet do we?

Log in or register to post comments

Comment #78

amateescu CreditAttribution: amateescu commented 21 April 2015 at 17:56

Yes, the upgrade path is not in scope of this issue. Beta-to-beta upgrades are provided by the HEAD to HEAD contrib module and the tag helps to keep track of issues that need a patch over there.

Log in or register to post comments

Comment #79

Crell CreditAttribution: Crell at Palantir.net commented 21 April 2015 at 19:12

+++ b/core/config/schema/core.data_types.schema.yml
@@ -445,6 +445,9 @@ field.storage_settings.string:
+    is_ascii:
+      type: boolean
+      label: 'Contains US ASCII characters only'

I don't know the config schema well enough to say, but should we rename this since "is_ascii" is no longer parallel with the DB schema?

+++ b/core/modules/comment/comment.install
@@ -46,17 +46,18 @@ function comment_schema() {
       'field_name' => array(
-        'type' => 'varchar',
+        'type' => 'varchar_ascii',
         'not null' => TRUE,
         'default' => '',
         'length' => FieldStorageConfig::NAME_MAX_LENGTH,
+        'is_ascii' => TRUE,
         'description' => 'The field_name of the field that was used to add this comment.',

Shouldn't the is_ascii be gone now?

And... should we be using varchar_ascii or varchar:ascii? We seem to be using colons elsewhere, no? (I've not dug into the schema code in quite some time so I may be missing a context... I'm OK with whatever is most consistent for DX.)

Log in or register to post comments

Comment #80

stefan.r CreditAttribution: stefan.r commented 21 April 2015 at 19:55

Renaming isn't necessary as the string settings in cofig are independent of the database schema. I also discussed that with @pwolanin and we settled on keeping that, for instance the case_sensitive string config setting translates to "binary" on the schema level.

The colons would be confusing as they're used for size already, as far as DX is concerned an underscore is better.

Indeed the is_ascii on the hook_schema should be gone, I'll post a new patch.

Log in or register to post comments

Comment #81

stefan.r CreditAttribution: stefan.r commented 21 April 2015 at 19:51

File	Size
interdiff-74-81.txt	457 bytes
1923406-81.patch	36.72 KB

Log in or register to post comments

Comment #82

Crell CreditAttribution: Crell at Palantir.net commented 21 April 2015 at 21:45

Status:

Needs review

» Reviewed & tested by the community

Makes sense to me.

Log in or register to post comments

Comment #83

joelpittet

English

Vancouver

CreditAttribution: joelpittet commented 22 April 2015 at 01:53

Silly question: can or should this be backported to D7?

Log in or register to post comments

Comment #84

catch

he/him

English

CreditAttribution: catch commented 22 April 2015 at 11:58

Status:

Reviewed & tested by the community

» Needs review

I don't think we can rely on cache IDs being ascii-only, or if we do it'd be an API change from 7.x - have seen URLs with utf8 characters used as cache IDs in 7.x for example.

Log in or register to post comments

Comment #85

catch

he/him

English

CreditAttribution: catch commented 22 April 2015 at 11:59

Also situations like that make me very tempted to schedule this patch for a couple of days after the next beta, to have some time to flush them out. We should use ascii as much as possible for index size, but only as much as possible.

Log in or register to post comments

Comment #86

hass CreditAttribution: hass commented 22 April 2015 at 12:21

+++ b/core/lib/Drupal/Core/Database/Driver/sqlite/Schema.php
@@ -207,6 +207,8 @@ public function getFieldTypeMap() {
+

The empty line in array should be removed.

Log in or register to post comments

Comment #87

hass CreditAttribution: hass commented 22 April 2015 at 12:20

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -218,6 +222,8 @@ public function getFieldTypeMap() {
+      'varchar_ascii:normal' => 'VARCHAR',
+

One more empty line

Log in or register to post comments

Comment #88

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 14:02

@catch the worry is that as we switch to the utf8mb4 charset a primary key (such as the cache ID) can only be 190 characters anymore. I think an ASCII cache ID would be an API break worth having as the point of the patch is also to avoid us using UTF8 identifiers where not desirable... As to the specific D7 example you cite, in D8 we use a hash for that (I think as of #1855260: Page caching broken by accept header-based routing).

There was already a thorough review at the dev days, as far as D8 core is concerned at least it seemed solid, but there is still some uncertainty/doubt so looks like this will still need a final review. Which should be hours, not days, as some of the previous review has already been documented, but if there's no time to do so before the next beta maybe this issue should be set to postponed?

Log in or register to post comments

Comment #89

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 12:49

@hass the empty line was actually on purpose so we don't have to touch the vertically aligned mega array underneath and as it's a "special" data type anyway. The mysql and sqlite drivers use empty lines as well there to separate similar groups, it's probably fine there for now unless anyone still thinks we need to change the code style in those places.

Log in or register to post comments

Comment #90

catch

he/him

English

CreditAttribution: catch commented 22 April 2015 at 13:13

Status:

Needs review

» Needs work

I think an ASCII cache ID would be an API break worth having

Well it might be worth having but it's a completely undocumented restriction at the moment. Additionally it's only a restriction if you're using MySQL for the cache backend. So it would be possible to develop some code that works fine with another cache backend which then fails if you switch back to MySQL.

Either the mysql backend should normalize to ASCII, or it needs updating in the cache documentation at a higher level.

afaik we don't use a hash for URLs, we just automatically hash if a cache ID is longer than 256 characters, but don't have the issue to hand.

+++ b/core/modules/locale/locale.install
@@ -103,7 +103,7 @@ function locale_schema() {
       'language' => array(

Double-checked and we do validate language codes in \Drupal\language\Form\LanguageFormBase::validateCommon() so this and the rest of the patch should be OK - just cache IDs to figure out.

Log in or register to post comments

Comment #91

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 13:57

Status:

Needs work

» Closed (fixed)

Thanks for the final review @catch. Would it help if as a first step we did transliteration in DatabaseBackend::normalizeCid() and then had the discussion about whether to allow non-US ASCII characters at all (which is less pressing) in another issue?

So right now this looks like this:

  protected function normalizeCid($cid) {
    // Nothing to do if the ID length is 255 characters or less.
    if (strlen($cid) <= 255) {
      return $cid;
    }
    // Return a string that uses as much as possible of the original cache ID
    // with the hash appended.
    $hash = Crypt::hashBase64($cid);
    return substr($cid, 0, 255 - strlen($hash)) . $hash;
  }

Just to be clear, I'm proposing to add in a mb_check_encoding($cid, 'ASCII') === TRUE check along with the length check, and transliterate the initial part of the cache ID.

Log in or register to post comments

Comment #92

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 13:56

Status:

Closed (fixed)

» Needs work

Log in or register to post comments

Comment #93

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 14:18

Status:

Needs work

» Needs review

File	Size
interdiff-81-93.txt	3.03 KB
1923406-93.patch	39.73 KB

Log in or register to post comments

Comment #94

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 14:27

File	Size
interdiff-81-94.txt	3.02 KB
1923406-94.patch	39.73 KB

5 files were hidden/shown/deleted

File	Size
1923406-74.patch	36.87 KB

interdiff-74-81.txt	457 bytes
1923406-81.patch	36.72 KB

interdiff-81-93.txt	3.03 KB
1923406-93.patch	39.73 KB

Just to solve the cache ID issue in one go, would something like this work for now?

Log in or register to post comments

Comment #95

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 14:26

File	Size
interdiff-94-95.txt	477 bytes
1923406-95.patch	39.73 KB

Adding another empty line just to be consistent and address @hass feedback in #86/#87

Log in or register to post comments

Comment #96

22 April 2015 at 14:28

The last submitted patch, 94: 1923406-94.patch, failed testing.

Log in or register to post comments

Comment #97

22 April 2015 at 14:30

Status:

Needs review

» Needs work

The last submitted patch, 95: 1923406-95.patch, failed testing.

Log in or register to post comments

Comment #98

22 April 2015 at 14:34

The last submitted patch, 93: 1923406-93.patch, failed testing.

Log in or register to post comments

Comment #99

pwolanin CreditAttribution: pwolanin at Acquia commented 22 April 2015 at 15:07

Yeah, not sure we need to convert cache ID to ASCII. The shorter index should be ok - possibly just hash/truncate the cache ID to 190?

Log in or register to post comments

Comment #100

Crell CreditAttribution: Crell at Palantir.net commented 22 April 2015 at 15:33

Having the MySQL cache backend hash at 190 instead of 255 sounds fine to me. Alternatively, I'd also be OK with omitting that field from the conversion here. have no strong preference.

Log in or register to post comments

Comment #101

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 18:15

@pwolanin not sure I follow. Part of the point of this issue was to disallow utf8mb4 where not needed, why disallow it in all machine names except cache IDs? Do we *need* special characters in there?

I think ideally we'd disallow non-ASCII characters on cache IDs altogether (by documenting this on the database cache backend interface) but there seems to be some doubt as to where we might use non-ASCII cache IDs in core, so transliterating them seemed like the most workable solution at least for now. normalizeCid() exists to work around database limitations; we already have a length limitation, which the normalization works around, so if we add an ASCII limitation here, the normalization would just transliterate the same way. At least until we confirm that core doesn't use any non-ASCII cache IDs anymore.

Also limiting utf8mb4 keys to 190 chars would probably have to be done in a followup issue as I don't know that cache ID is the only one.

@Crell: you and @pwolanin are talking about two different things here. It's not "either one or the other", it's either "both of those things" or "neither of them" :) Only utf8mb4 keys have the 190 char limitation, so if we exclude the cid field from conversion here, by definition we'll need to limit the length to 190 as well.

Log in or register to post comments

Comment #102

stefan.r CreditAttribution: stefan.r commented 22 April 2015 at 18:13

Status:

Needs work

» Needs review

File	Size
interdiff-95-102.txt	2.57 KB
1923406-102.patch	38.44 KB

Log in or register to post comments

Comment #103

pwolanin CreditAttribution: pwolanin at Acquia commented 23 April 2015 at 00:56

disallowing utf8 in cache IDs is an API change per catch above - let's not do that?

Log in or register to post comments

Comment #104

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 07:02

Fair enough. Is there a concern with the current patch? (which doesn't pose an API change and still allows cache ID's to have any sort of format/length and just normalizes it to database limitations)

Log in or register to post comments

Comment #105

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 09:45

File	Size
interdiff-102-105.txt	1.18 KB
1923406-105.patch	39.77 KB

Adding a test

Log in or register to post comments

Comment #106

anavarre

French

🇪🇺

CreditAttribution: anavarre commented 23 April 2015 at 09:51

+++ b/core/modules/system/src/Tests/Cache/DatabaseBackendUnitTest.php
@@ -33,4 +33,23 @@ protected function createCacheBackend($bin) {
+    // Set up a cache ID that is not ASCII and longer than 255 characters so we can test cache ID normalization.

Nit: 80 cols.

Log in or register to post comments

Comment #107

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 10:15

File	Size
1923406-107.patch	39.79 KB

interdiff-81-107.txt	3.47 KB

Thanks, test had a problem as well. Interdiff is with the rtbc patch in 81.

Log in or register to post comments

Comment #108

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 10:32

File	Size
interdiff-107-109.txt	453 bytes
1923406-109.patch	39.97 KB

Updating a comment

Log in or register to post comments

Comment #109

catch

he/him

English

CreditAttribution: catch commented 23 April 2015 at 10:48

Tested #107

Applied this change to node.routing.yml

Added one alias (/node -> /front)

Visited node/addお

diff --git a/core/modules/node/node.routing.yml b/core/modules/node/node.routing.yml
index 9f3af90..bde11ad 100644
--- a/core/modules/node/node.routing.yml
+++ b/core/modules/node/node.routing.yml
@@ -6,7 +6,7 @@ node.multiple_delete_confirm:
     _permission: 'administer nodes'
 
 node.add_page:
-  path: '/node/add'
+  path: '/node/addお'
   defaults:
     _title: 'Add content'
     _controller: '\Drupal\node\Controller\NodeController::addPage'

Then I looked at the path alias preload cache IDs, which are by path:

| preload-paths:admin/config/search/path                                                   |
| preload-paths:node                                                                       |
| preload-paths:admin/content                                                              |
| preload-paths:node/addo2Tc-w7F3bOxdy8T5hzZrjgHxJhqBXIRsqNFJSYqJbas                       |
+------------------------------------------------------------------------------------------+

So in that case we could have just transliterated and not hashed - but the code always hashes if transliteration is needed, not a massive issue.

However I'm not sure we should necessarily transliterate here at all - it's an extra dependency for the database backend. Instead we could just do the ascii check, and hash without any prefix in that case - still ascii-safe that way and for the rare cases where there are non-ascii cache IDs I don't think we're losing much by ditching the prefix there - we only have the prefix to aid debugging/testing. Comes down to whether the transliteration dependency makes a measurable difference to the internal page cache.

Would be good to add test coverage for the database backend for whatever it ends up doing though.

Log in or register to post comments

Comment #110

23 April 2015 at 10:55

The last submitted patch, 105: 1923406-105.patch, failed testing.

Log in or register to post comments

Comment #111

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 10:58

OK that works for me as well if we think it's worth losing the debugging aid for NON-ascii cache IDs.

As to the page cache ID specifically, I guess we could just urlencode that instead upon setting (instead of in the DB backend)?

The hash was to prevent transliteration collisions. I'm sure there's a way to have two different UTF8 IDs turn into the same ASCII ID if we don't use the hash.

Log in or register to post comments

Comment #112

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 11:26

File	Size
interdiff-107-112.txt	2.16 KB
1923406-112.patch	39.98 KB

The attached patch hashes any non ASCII cache IDs so that we can lose the transliteration dependency. There's still test coverage through the tests that were added in #107.

It also urlencodes the page cache ID, so at least there we'd still have the debugging aid. @catch would that be OK?

Log in or register to post comments

Comment #113

23 April 2015 at 11:58

Status:

Needs review

» Needs work

The last submitted patch, 112: 1923406-112.patch, failed testing.

Log in or register to post comments

Comment #114

stefan.r CreditAttribution: stefan.r commented 23 April 2015 at 12:10

Status:

Needs work

» Needs review

File	Size
1923406-114.patch	39.14 KB

interdiff-112-114.txt	627 bytes

OK since that didn't work, maybe let's leave the urlencoding out as it's just going to make the scope of this issue expand, as catch pointed out it's not just page caching that uses URLs in the ID.

Log in or register to post comments

Comment #115

stefan.r CreditAttribution: stefan.r commented 28 April 2015 at 18:22

Just to summarize, relative to the previously RTBCed patch, the current patch just implements the cache ID hashing suggested by @catch which was the sole outstanding issue.

Maybe someone can validate this so we can move on to the parent issue and commit this after the beta on Wednesday? :)

Log in or register to post comments

Comment #116

catch

he/him

English

CreditAttribution: catch commented 1 May 2015 at 10:14

Status:

Needs review

» Reviewed & tested by the community

Moving back to RTBC, this looks good to me now.

Log in or register to post comments

Comment #117

catch

he/him

English

CreditAttribution: catch commented 4 May 2015 at 10:39

Version:	8.0.x-dev	» 7.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)
Issue tags:		+Needs change record

Committed/pushed to 8.0.x, thanks!

Opened #2482517: Upgrade path for #1923406.

Back to 7.x for backport discussion.

Also this could use a PSA-style change record for contrib schemas and the API addition.

Log in or register to post comments

Comment #118

4 May 2015 at 10:40

catch committed ac4e35c on 8.0.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

Log in or register to post comments

Comment #119

bzrudi71 CreditAttribution: bzrudi71 commented 5 May 2015 at 14:48

FYI this seems to cause SchemaTest fails for PostgreSQL and (maybe) SQLIte too in regards to lastest bot logs :-(
d8pgbot.erwanderbar.de and d8sqlitebot.erwanderbar.de

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "FULL" LINE 1: SHOW FULL COLUMNS FROM simpletest700187test_table ^: SHOW FULL COLUMNS FROM {test_table}; Array ( ) in Drupal\simpletest\TestBase->run() (line 998 of /opt/local/apache2/htdocs/drupal8/core/modules/simpletest/src/TestBase.php).

Log in or register to post comments

Comment #120

5 May 2015 at 14:51

catch committed 1d2f9e9 on 8.0.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

Log in or register to post comments

Comment #121

catch

he/him

English

CreditAttribution: catch commented 5 May 2015 at 14:51

Version:	7.x-dev	» 8.0.x-dev
Status:	Patch (to be ported)	» Needs work

Ouch.

Reverted for now.

Log in or register to post comments

Comment #122

stefan.r CreditAttribution: stefan.r commented 5 May 2015 at 15:36

Status:

Needs work

» Needs review

File	Size
1923406-122.patch	39.23 KB

interdiff-114-122.txt	1.9 KB

Yes, sorry we had missed that. I had already noticed that and included a fix in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) yesterday, this ports the fix into this patch.

@amateescu will do another SchemaTest run on SQLite/PostgreSQL to confirm.

Log in or register to post comments

Comment #123

amateescu CreditAttribution: amateescu for Drupal Association commented 5 May 2015 at 16:48

Here are the results on SQLite:

Test summary
------------

Drupal\system\Tests\Database\SchemaTest                      743 passes                                      

Test run duration: 1 sec

And on PostreSQL:

Test summary
------------

Drupal\system\Tests\Database\SchemaTest                      748 passes                                      

Test run duration: 19 sec

It's a bit weird that we have a different number of passes but.. meh :)

Log in or register to post comments

Comment #124

stefan.r CreditAttribution: stefan.r commented 5 May 2015 at 16:08

Well as the patch is now green on MySQL as well and the reason this was reverted was because of SchemaTest failing in PostgreSQL / SQLite, and the changes in this patch are isolated to SchemaTest, I guess this can be committed now?

Log in or register to post comments

Comment #125

amateescu CreditAttribution: amateescu commented 5 May 2015 at 16:11

Status:

Needs review

» Reviewed & tested by the community

Yup, +1 to recommit.

Log in or register to post comments

Comment #126

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 5 May 2015 at 16:43

Status:

Reviewed & tested by the community

» Fixed

File followup issue for ASCII support in other database engines than MySQL

Let's do this to see if the other db engines benefit from this too.

Committed d57ee5f and pushed to 8.0.x. Thanks!

Log in or register to post comments

Comment #127

5 May 2015 at 16:44

alexpott committed d57ee5f on 8.0.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #128

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 5 May 2015 at 16:44

Adding review commit credit.

Log in or register to post comments

Comment #129

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick at Acquia commented 5 May 2015 at 16:47

One more.

Log in or register to post comments

Comment #130

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick at Acquia commented 5 May 2015 at 16:52

Status:

Fixed

» Reviewed & tested by the community

Sorry for the noise.

Log in or register to post comments

Comment #131

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick at Acquia commented 5 May 2015 at 16:53

Status:

Reviewed & tested by the community

» Fixed

Hm.

Log in or register to post comments

Comment #132

David_Rothstein CreditAttribution: David_Rothstein commented 7 May 2015 at 03:43

Version:	8.0.x-dev	» 7.x-dev
Status:	Fixed	» Patch (to be ported)

Based on #117 there is supposed to be a backport discussion here?

Log in or register to post comments

Comment #133

stefan.r CreditAttribution: stefan.r commented 7 May 2015 at 12:55

Actually this patch may not need a backport.

Assuming we're going to be backporting #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) to Drupal 7, we'll need to work around the 191 character identifier length limitation somehow. This would also apply to tables defined by contrib/custom modules. So for every field, people can either:

shorten the field length
turn it into ASCII
turn it into single-byte utf8
drop the constraint
hash the field and put a unique constraint on the hash

...this is not a decision we can automate in a migration script, we could merely check the whole database schema for unique indexes/primary keys on varchar fields larger than 191 characters, and refuse to migrate otherwise.

With contrib and custom this could all get a bit messy, especially if they don't think this through and make fields with existing utf8 data ASCII and cause data loss during the migration of an existing install. Or if they use core fields in unexpected ways, ie. if they put utf8 characters in machine name fields.

So as far as D7 is concerned I'm leaning towards only allowing utf8mb4 on setups with innodb_large_indexes enabled, which allow us to bypass the 191 character limit on the database level instead. This is a non-standard setting but as of 5.7 it will be a MySQL default. This way we would lose the need for a backport of this patch as far as UTF-8 support is concerned.

Log in or register to post comments

Comment #134

Arla

they/them

CreditAttribution: Arla at MD Systems GmbH commented 12 May 2015 at 10:43

The URI field schema is now incomplete: #2462155: Field settings schema missing for some field types (8.0.x)

Log in or register to post comments

Comment #135

19 November 2015 at 17:53

catch committed 1d2f9e9 on 8.1.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

catch committed ac4e35c on 8.1.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

alexpott committed d57ee5f on 8.1.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #136

3 August 2016 at 00:04

catch committed 1d2f9e9 on 8.3.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

catch committed ac4e35c on 8.3.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

alexpott committed d57ee5f on 8.3.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #137

3 August 2016 at 00:05

catch committed 1d2f9e9 on 8.3.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

catch committed ac4e35c on 8.3.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

alexpott committed d57ee5f on 8.3.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #138

sir_gon CreditAttribution: sir_gon as a volunteer and commented 26 August 2016 at 16:45

It is absolutely necessary to implement this ...
In addition esteem it requires some kind of batch job or drush command to convert tables that already exist

Recently I convert my database to utf8mb4 as 7.50 core suggest... but I have some modules with long file paths as dependencies and is imposible store this paths in {registy} or {system table}.

See this issue as an example.

This full broke my website and I had to restore a dump and I had to manually change utf8mb4 charset to utf8 in te dump file...

Soem websites with lot of content and high availability can not afford to restore dumps.

I think this is a very serious issue.

Log in or register to post comments

Comment #139

stefan.r CreditAttribution: stefan.r commented 27 August 2016 at 12:24

@sir_gon

In order to use utf8mb4 in drupal 7, you'll need to use InnoDB with innodb_large_prefix turned on, which allows for the full 255 characters and works around the 191 character limitation -- see the requirements listed in settings.php:

 * When using this setting on an existing installation, ensure that all existing
 * tables have been converted to the utf8mb4 charset, for example by using the
 * utf8mb4_convert contributed project available at
 * https://www.drupal.org/project/utf8mb4_convert, so as to prevent mixing data
 * with different charsets.
 * Note this should only be used when all of the following conditions are met:
 * - In order to allow for large indexes, MySQL must be set up with the
 *   following my.cnf settings:
 *     [mysqld]
 *     innodb_large_prefix=true
 *     innodb_file_format=barracuda
 *     innodb_file_per_table=true
 *   These settings are available as of MySQL 5.5.14, and are defaults in
 *   MySQL 5.7.7 and up.
 * - The PHP MySQL driver must support the utf8mb4 charset (libmysqlclient
 *   5.5.3 and up, as well as mysqlnd 5.0.9 and up).
 * - The MySQL server must support the utf8mb4 charset (5.5.3 and up).
 *

IMO this issue should marked won't fix for Drupal 7, as we have utf8mb4 support now (albeit only if innodb_large_prefix is enabled). In Drupal 8 we were able to implement this issue as we were still in beta, so there utf8mb4 also works with innodb_large_prefix turned off, but for Drupal 7 it seems too late to implement this anymore -- we'd have to identify all core and contrib fields that could be turned into ascii, and implement other solutions for fields that need more than 191 characters and can't be ascii (including possibly truncating them).

Log in or register to post comments

Comment #140

hass CreditAttribution: hass commented 27 August 2016 at 17:50

Why is innodb_file_per_table=true required? This causes sooo much file handles (error "Too many open files") that a virtualized system runs our of resources (beancounter). Aside of this it does not change data encoding. I know I cannot run 6 drupal databases on my vserver this way, but without file per table it is fine.

Log in or register to post comments

Comment #141

chx CreditAttribution: chx at Smartsheet, Migrate Rocks commented 27 August 2016 at 18:23

Because that's how mysql works before MySQL 5.7.6: you need DYNAMIC row format for innodb_large_prefix and that is only supported per file in 5.5 and 5.6. We can not require MySQL 5.7 yet, it's not even a year old and only Ubuntu has an LTS release with it.

Log in or register to post comments

Comment #142

hass CreditAttribution: hass commented 27 August 2016 at 21:47

We should warn users that beancounter need to be very large or their server WILL stop working. Maybe we should warn if we already know that a minimum value does not fit.

My hoster has a relativly high limit, but others have more typically the half. A single drupal install with file per table can take your site intermittent down. Impossible to solve without data loss if you converted your database.

Log in or register to post comments

Comment #143

27 January 2017 at 17:53

catch committed 1d2f9e9 on 8.4.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

catch committed ac4e35c on 8.4.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

alexpott committed d57ee5f on 8.4.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #144

27 January 2017 at 17:52

catch committed 1d2f9e9 on 8.4.x

Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...

catch committed ac4e35c on 8.4.x

Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...

alexpott committed d57ee5f on 8.4.x

Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...

Log in or register to post comments

Comment #145

quietone CreditAttribution: quietone at PreviousNext commented 16 August 2022 at 03:02

Issue tags:

-D8 upgrade path

No longer relevant to D8, remove tag.

Log in or register to post comments

Use ASCII character set on alphanumeric fields so we can index all 255 characters

Problem/Motivation

Proposed resolution

Related issues

Remaining tasks

User interface changes

API changes

Beta phase evaluation

Comments