Problem

  • When configuring a different database default collation than MySQL's default collation for utf8, utf8_general_ci, and new tables are created by Drupal, then Drupal does not respect that database default collation and uses utf8_general_ci instead.

Goal

  • Make Drupal respect the default collation of the database under MySQL.

Details

  • MySQL 5.1 contains an incompatible, i.e. not backwards-compatible change for the utf8_general_ci collation, which mainly affects primary keys, unique keys, and indexes.
    • Among other (incompatible) changes, the German letter sharp-s (ß) is interpreted differently (as a single character s), which can easily lead to duplicate key errors. For example, when having two usernames Reißer and Reiser.
    • The upgrade notes http://dev.mysql.com/doc/refman/5.1/en/upgrading-from-previous-series.html state:

      Incompatible change: Character set or collation changes were made in MySQL 5.1.21, 5.1.23, and 5.1.24
      that may require table indexes to be rebuilt.
      For details, see Section 2.12.3, "Checking Whether Table Indexes Must Be Rebuilt".

    • Further documentation on http://dev.mysql.com/doc/refman/5.1/en/checking-table-incompatibilities.... states:

      MySQL 5.1.24, 5.4.0 (Bug#27877)
      Affects indexes that use the utf8_general_ci or ucs2_general_ci collation for columns that contain 'ß' LATIN SMALL LETTER SHARP S (German).
      Affected tables can be detected by CHECK TABLE ... FOR UPGRADE as of MySQL 5.1.30, 5.4.0 (see Bug#40053).

    • The actual bug report and commit that lead to the change: http://bugs.mysql.com/bug.php?id=43593
  • The solution for this incompatible collation change is to convert the database/tables/columns into utf8_unicode_ci, which works for all languages (not only German), but is often mentioned to be slower due to multi-byte comparisons (though I couldn't find any benchmarks that prove this).
  • By default, Drupal's MySQL database driver appends the following to the CREATE TABLE statement:
    DEFAULT CHARACTER SET UTF8
    

    As this suffix does not contain the COLLATE statement, MySQL always defaults to the default collation of the current connection, i.e. utf8_general_ci.

  • Users who managed to get around this incompatible collation change in utf8_general_ci (by converting their data to utf8_unicode_ci) may think that they resolved this issue, but they have not and they cannot. Installing new modules, creating new content types (CCK), adding fields, and creating any arbitrary table, triggers aforementioned default character set and collation, i.e.:

    While existing (converted) tables are using utf8_unicode_ci, every new table is using utf8_general_ci.

  • The bogus mix of different collations stays hidden, until you try to use a module that exposes the mismatch due to advanced SQL expressions. For example, Pathauto will trigger the following error message:
    user warning: Illegal mix of collations (utf8_general_ci, IMPLICIT) and (utf8_unicode_ci, IMPLICIT) for operation '='
    query:
    SELECT COUNT(nid) FROM node n LEFT JOIN url_alias alias ON CONCAT('node/', CAST(n.nid AS CHAR)) = alias.src
    WHERE alias.src IS NULL AND n.type IN ('page', 'location', 'webform')
    

    That is, because CAST() is using the default collation of the connection: http://dev.mysql.com/doc/refman/5.1/en/charset-convert.html (see below regarding SET NAMES)

Procedure

  • To resolve this critical bug, we need to remove the DEFAULT CHARACTER SET UTF8 suffix from CREATE TABLE statements, so as to make MySQL derive the default database collation.

Links

Notes

  • Originally, I assumed that our initialization of the MySQL database connection would be guilty, since
    SET NAMES utf8;
    

    "officially" means that the default collation of the server is used (utf8_general_ci for charset utf8). Only SET CHARACTER SET utf8 should respect the default collation of the database for the current connection. At least, that's what is precisely stated in http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html

    However, further manual testing revealed that tables are created in the proper default collation even when leaving the SET NAMES as is.

    Though I didn't test the CAST() use-case from above yet. (!)

    If the MySQL manual is correct, then we may have to change our database connection initialization from SET NAMES utf8 into

    SET CHARACTER SET utf8;
    

    as that would default the default connection collation to the default database collation -- instead of the default server collation for the character set utf8.

  • As of now, it seems like above mentioned suffix for CREATE TABLE is Drupal's only enforcement of the utf8 character set for created tables. By removing the suffix, tables are created in the default character set and collation of the database, which may not be utf8. This also affects Drupal's installer.

    I'm actually not sure about the implications, but theoretically, it shouldn't really matter in which character set and/or collation the data is stored - the data should be properly converted during getting and setting. (But you surely may not read all stored data via other tools, such as phpMyAdmin.)

    Anyhow, I do not pretend to be an expert in this area, but we may need to add a installation requirement to ensure that the default collation of the database is utf8_*, so as to prevent storing data in non-utf8 collations.

  • For international Drupal users, which also includes mono-lingual Drupal installations in a different language than English, the default collation of utf8_general_ci effectively is unusable starting with MySQL 5.1. While you can probably install Drupal, you potentially will not be able to import an exported database dump of that installation due to duplicate key errors -- which is where this entire story began for me.

    Therefore, one further question is whether we should not require and enforce utf8_unicode_ci during installation.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

sun’s picture

Although probably intended to be more generic, the following manual page describes the difference between SET NAMES and SET CHARACTER SET better:
http://dev.mysql.com/doc/refman/5.1/en/set-option.html

zroger’s picture

you potentially will not be able to import an exported database dump of that installation due to duplicate key errors

I've seen this soooo many times and was never able to track it down. Working on numerous multilingual sites and importing production databases into my local environment, this seems like it could very well be the culprit.

I'm not sure how I'll be able to verify the patch, since I was never able to reliably replicate the errors, but I'll try anyways. Anyhow, thanks for taking the time to write this up.

jpmckinney’s picture

FileSize
1.12 KB

Thanks for the detailed report. I'm worried for users who have poorly configured databases on shared hosts whose default collation is something ridiculous and worse than utf8_general_ci. Here's a patch where we enforce utf8_unicode_ci.

sun’s picture

Why is this bug report critical? If it also applies to D6?

See http://drupal.org/project/issues/search?text=Illegal+mix+of+collations

Because no one was able to track down the real problem, and instead assumed a bogus database configuration. "Use the defaults" or "Install from scratch" were common replies to close those issues.

With the incompatible collation change in MySQL 5.1, we can no longer ignore this bug. Drupal destroys data here.

sun’s picture

Status: Needs review » Needs work

@jpmckinney: I fear that we cannot force any default. Why? Because that would trigger the very same bug for existing installations, just vice-versa (and this time, also including English Drupal installations, which ran just fine with utf8_general_ci previously).

However, we also need to discuss the questions I raised in the notes.

Damien Tournoud’s picture

Default installation are rarely UTF8. Debian, which powers most of our test slaves, is defaulting to latin1, and that explains the failure in the original patch.

Damien Tournoud’s picture

Let's just include the collation in both the CREATE TABLE and the SET NAMES, default to utf8_unicode_ci and make that configurable in the database connection array.

The only thing I'm worried about is how to migrate existing installations.

sun’s picture

@Damien: After having spent days with debugging this overall charset/collation issue, I fear that we can't enforce any default, unless we require users to update their settings.php before upgrading to D7. When enforcing any default, users have to make 100% sure that their existing data matches.

Based on http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/, we cannot safely convert data. Not sure whether that post is outdated or applies to certain versions of MySQL only. I've used approach #1 to convert a couple of my sites without unwanted schema changes.

Thinking out aloud - can we think of a way to dynamically apply utf8_unicode_ci, in case the default database collation does not start with utf8?

klonos’s picture

subscribing...

cog.rusty’s picture

Taking a step back, just two questions:
- Is there anything that guarantees that other "collation incompatibilities which affect indexes" won't also creep in utf8_unicode_ci in a future MySQL version?
- Is there anything here which can't be fixed by rebuilding the indexes after a MySQL version change?

sun’s picture

Status: Needs work » Needs review
FileSize
2.3 KB

Can we identify the default database collation during upgrade? We perhaps can. Don't sue me for this patch, it's just an idea.

@cog.rusty:

1) No, there is no guarantee. There's even no guarantee that existing sites are not purposively using a different collation already to fix other, custom issues they experienced on their site. But if Drupal core enforces any default, then utf8_unicode_ci is the most safe collation.

2) Nope, as mentioned in the OP: For me, the entire story began when I tried to import a MySQL 5.0 database dump into 5.1. Interestingly, when migrating a database via mysqlhotcopy across servers and versions, then you are able to run that database, run mysql upgrade commands, and those will simply kill (delete) any bogus records due to this collation change.

jpmckinney’s picture

@cog.rusty: Future versions of MySQL, PHP, Apache, etc. may of course introduce bugs. But that's not a reason to avoid existing bugs (:

jpmckinney’s picture

Status: Needs review » Needs work

The last submitted patch, drupal.default-collation.11.patch, failed testing.

jpmckinney’s picture

@sun: in #11, you store 'default_collation', but then you never use it? I don't see other occurrences of 'default_collation'.

sun’s picture

Status: Needs work » Needs review
FileSize
4.19 KB

Of course you don't see any usage of that, because I merely threw away that patch as idea ;) This one may provide some more clues. Still not sure whether it makes sense and is reliable.

Status: Needs review » Needs work

The last submitted patch, drupal.default-collation.16.patch, failed testing.

sun’s picture

Please bear in mind: This has to be backported to D6.

I guess there are two different ways to solve this issue:

  1. Remove the explicit DEFAULT CHARACTER SET utf8, so as to always use the default collation of the database. And add a requirement check to Drupal's installer, so as to only allow installation if the collation of the database uses a utf8 charset, i.e. the default collation starts with utf8_*.

    Therefore, new installations can use utf8_general_ci or utf8_unicode_ci, or any other utf8 collation of MySQL, i.e. whatever makes sense for the site/user. Existing installations can use whatever they used, and we simply add a step to the handbook pages for upgrading.

    Pro: We do not have to fiddle with the collation in run-time code. Keep the database configuration away from Drupal.

    Con: While we can suggest and direct users to set the default collation to utf8_unicode_ci, most users probably won't have any clue of what they are doing. Second, I'm not sure whether you can alter the default database collation on shared hosts.

  2. Keep the explicit, but enhanced DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci, so as to force all sites to use utf8_unicode_ci. But make the collation configurable in settings.php.

    For existing installations,

    • either try to identify the proper/actual collation in use by determining the database's default collation.
    • or simply default to utf8_general_ci, because Drupal created any new tables in that collation anyway. If there is a collation mismatch in existing tables, then the database administrator needs to resolve that through other means anyway.

    Pro: Drupal uses a collation that should work for all international users by default. Except for existing sites, we can also assume a certain collation for debugging.

    Con: On existing sites, utf8_general_ci (or an auto-determined value) has to end up in settings.php somehow; failure to do so will lead to "Illegal mix of collations" errors. Second, Drupal overrides the database configuration, which feels a bit poorly.

sun’s picture

FileSize
943 bytes

To save work for others, attached is a custom.install module update function that converts an existing database and tables/columns into utf8_unicode_ci.

After doing so, I was able to finally test Pathauto's CAST() SQL expression. And it turns out that the MySQL manual about SET NAMES is correct and we have to change the database connection initialization,

  1.   mysqli_query($connection, 'SET CHARACTER SET utf8');
    

    if we want to respect the default collation of the database, i.e. mapping to aforementioned 1) in #18.

  2.   mysqli_query($connection, 'SET NAMES utf8 COLLATE utf8_unicode_ci');
    

    if we want to force a globally configured collation, i.e. mapping to aforementioned 2) in #18.

EDIT: After converting the collations, Schema module reports no mismatches.

cog.rusty’s picture

@sun:

A configurable (2) seems fine at first glance (except if there are tables from other non-Drupal applications which disagree).

But I think (1) has a problem. There are many Drupal databases out there with default database charset set to latin1 and Drupal tables set to utf8. If (1), unlike SET NAMES, gets the connection collation from the database default, that would be a big problem.

Crell’s picture

Can I just say for the record "screw you, MySQL!"

Given that everything else in Drupal is UTF8, do we need to support non-UTF8 character sets in the first place? Is there a reason why we'd want to allow other collations, and more importantly what else would that break? (Eg, all HTTP responses are forced to UTF8, as I recall.)

sun’s picture

@Crell: Yes, Drupal should require/enforce a UTF8 character set. However, not necessarily the collation. Although the utf8_unicode_ci should work for all languages, there may be valid reasons for having to use a different collation.

In particular, each collation comes with different transliteration and sorting order rules. As can be read in the MySQL bug report I linked to, the commit additionally introduced a new utf8_german3_ci collation for German, which intends to retain the original behavior of utf8_general_ci (WTF?)...

The documentation page about Unicode character sets skims the detailed differences of those utf8 collations. (EDIT: Note the user comment about Hebrew on that page, for example of a possible application-specific need to use a language-specific collation.)

Damien Tournoud’s picture

Most of the time, collation mismatch can be easily resolved by MySQL.

There are some special cases where the server cannot resolve that (see http://dev.mysql.com/doc/refman/5.0/en/charset-collate-tricky.html), and those become problematic. I would consider that a bug in the module that issues those queries.

In the Pathauto query, would changing:

SELECT COUNT(nid) FROM node n LEFT JOIN url_alias alias ON CONCAT('node/', CAST(n.nid AS CHAR)) = alias.src
WHERE alias.src IS NULL AND n.type IN ('page', 'location', 'webform')

to

SELECT COUNT(nid) FROM node n LEFT JOIN url_alias alias ON alias.src = CONCAT('node/', CAST(n.nid AS CHAR))
WHERE alias.src IS NULL AND n.type IN ('page', 'location', 'webform')

Fix the problem?

cog.rusty’s picture

It doesn't seem to work. I tried it in a database where the database, the table, and the column were all set to utf8_unicode_ci. Apparently CONCAT() enforces the default collation of the utf8 character set (utf8_general_ci), which it gets from the CAST().

And now that I think of it, no amount of enforcement of utf8_unicode_ci will change that...

sun’s picture

I already explained in the OP as well as in #19 that CAST() always uses the collation of the database connection. SET NAMES triggers utf8_general_ci, since it's missing COLLATE. Either we add that, or we replace SET NAMES with SET CHARACTER SET, which uses the default collation of the database for the connection.

However, overall we have to decide on how to move forward, by choosing one of the options in #18 + #19. Either respect the database configuration, or globally enforce a certain collation, optionally alterable via settings.php.

cog.rusty’s picture

I just tested it, you are right, it comes from the connection collation.

At this point, I think either a workaround for CASTs must be found or a configurable collation must be enforced.

Trusting the database collation is no good, because people have been doing one-click database creation from their cpanels for years, without anybody or anything instructing them to change the charset or the collation.

Damien Tournoud’s picture

I don't really get it. What part of CONCAT('node/', CAST(n.nid AS CHAR)) = alias.src actually fails? The CONCAT()? The equality operator?

I don't see how the CONCAT() could fail if the string literal and the CAST() default to the connection collation.

cog.rusty’s picture

My test showed that the problem is the CAST().

Using a database connection with utf8_general_ci (because of Drupal's SET NAMES utf8) and a table and a column with utf8_unicode_ci (because the user wanted it so), and running a similar query:

- with column = CONCAT('literal1', 'literal2') there is no problem.
- with column = CONCAT('literal1', CAST(15 as char)) a conflict with utf8_general_ci occurs.

It seems that CONCAT with only literals does not become utf8_general_ci.

The result is exactly the same whether the database collation is set to utf8_general_ci or to utf8_unicode_ci or even to latin1_swedish_ci.

andypost’s picture

subscribe

sun’s picture

Status: Needs work » Needs review
FileSize
7.73 KB

Attached patch implements option 2) of #18 and #19, i.e. enforcing utf8_unicode_ci.

Status: Needs review » Needs work

The last submitted patch, drupal.default-collation.30.patch, failed testing.

EvanDonovan’s picture

Subscribing.

I have had CAST() issues with "latin1_swedish_ci" set as default on CPanel before, and had to change collation manually on all tables. I didn't realize that this was considered a bug in Drupal now. I think that Drupal should enforce utf8_unicode_ci, but have it be configurable in settings.php.

sun’s picture

Status: Needs work » Needs review
FileSize
7.74 KB

Hopefully, it's able to access the connection options this time.

Status: Needs review » Needs work

The last submitted patch, drupal.default-collation.33.patch, failed testing.

Berdir’s picture

Fatal error: Cannot access protected property DatabaseConnection_mysql::$connectionOptions

I'm working on this...

Berdir’s picture

Status: Needs work » Needs review
FileSize
7.86 KB

Ok, added a getter to fetch the collation, let's try this.

sun’s picture

I think that getter already exists.

Status: Needs review » Needs work

The last submitted patch, drupal.default-collation.37.patch, failed testing.

sun’s picture

oh dear lord.

CREATE TABLE `test` (
  `name` varchar(255) collate utf8_unicode_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

INSERT INTO `test` (`name`) VALUES
('abcde\\f'),
('abc%\\_');

SELECT
name,
hex(name),
name LIKE 'abc%\\\\_',
name COLLATE utf8_general_ci LIKE 'abc%\\\\_'
FROM `test`;

name      hex(name)        name LIKE 'abc%\\\\_'   name COLLATE utf8_general_ci LIKE 'abc%\\\\_'
abcde\f   61626364655C66   0                       1
abc%\_    616263255C5F     0                       1

WTF?

cog.rusty’s picture

Hmm... So, literals seem to be utf8_general_ci.

What was the database collation and the connection collation?
Or does this come from the default collation of utf8?

---- Edit:
Duh, I missed the point. The right column is absurd.

sun’s picture

Sorry for not following up earlier, but it seems you realized the absurdity in the meantime. I left out the hex(name COLLATE utf8_general_ci) column in that dump, because the result was identical to the regular hex() column.

What's also not visible: It seems that none of the % and _ wildcards are behaving the same under utf8_unicode_ci. Only name LIKE 'abc%' returns 1 on both rows, everything else does not.

cog.rusty’s picture

I am surprised nobody got scared and jumped in to say "Don't touch that!" >;-)

arjenlentz’s picture

Changing the default collation on tables to utf8_unicode_ci is likely to break other languages/datasets, the fact that it appears to "solve" this one particular issue does not validate it as a generically applicable solution.

I am verifying with my former colleagues at MySQL what the exact before/after behaviour is (see my comment on http://bugs.mysql.com/43593), to be able to validate/verify both the original assertion in this Drupal bug report as well as the MySQL bugfix.

For both before/after, we need to know whether it relates to
- transcription (like u umlaut can be transcribed to ue and be sorted after)
- sort order (and in what way, position in alphabet please)
- equivalence (two different chars regarded as the same)
* added question: does collation ever introduce equivalence, or always merely order?

The assertion in the drupal bug is that the bugfix actually makes rows disappear as
duplicates on rebuilding of the table. this would indicate equivalence. Now, as far as I
know a "'ß' LATIN SMALL LETTER SHARP S" -similar to all other German characters- is either separate next to S, separate at end of alphabet, or transcribed to SS, so I don't see how it could ever trigger a duplicate.

So we need answers to the above before discussing patches.

ShannonK’s picture

Subscribing. BUT....

I unfortunately don't speak techie, so, can someone let me know if I'm understanding at least a little of what you guys are saying here. Is the consensus that rookies (like me) should wait on this if we're receiving this error?

In my case, I was creating a view for a calendar view and I got the following message:

* user warning: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' query: SELECT s.uid, v.updated FROM views_object_cache v INNER JOIN sessions s ON v.sid = s.sid WHERE s.sid != 'a766bd11b4b631924015d810fed49984' and v.name = 'date_browser' and v.obj = 'view' ORDER BY v.updated ASC in /home/shannon/public_html/modules/views/views_ui.module on line 249.
* user warning: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' query: SELECT s.uid, v.updated FROM views_object_cache v INNER JOIN sessions s ON v.sid = s.sid WHERE s.sid != 'a766bd11b4b631924015d810fed49984' and v.name = 'date_browser' and v.obj = 'view' ORDER BY v.updated ASC in /home/shannon/public_html/modules/views/views_ui.module on line 249.

For what it's worth I have Drupal 6.16 and MySQL 5.1.30. We just have the English language on our site. My host is Hostgator.

From reading your posts this sounds pretty complicated. Should I wait for a patch? I'm not sure what I should do...

cog.rusty’s picture

No, there is nothing to wait for, you should try to fix the problem. This discussion, when and if comes to a conclusion, won't help you with your problem in any way.

Your problem is that:
- Table "views_object_cache" has collation utf8_genereral_ci
- Table "sessions" has collation utf8_unicode_ci
This may be an indication (or not) that there are more tables with the wrong collation.

But this is not the place to discuss it. Take a look at a similar support question that I answered today at http://drupal.org/node/800016 and ask there for any clarification you want.

YesCT’s picture

In preparation for the new "major" issue priority, I'm tagging this (not actually critical) issue as priority-major.
See: http://drupal.org/node/669048#comment-3019824 (#669048: We should have four priorities not three (introduce a new 'major' priority)) for more information about the coming major issue priority.

Crell’s picture

Another data point I just discovered the hard way:

Even when using MySQL 5.0, which doesn't fatal on mismatched collations, if you try to join table A and table B on A.foo to B.foo, and A.foo and B.foo are using different character collations, *all indexes will be ignored* and MySQL will slowly go through and convert and compare each field individually. This results in queries whose performance sucks for reasons that have nothing to do with the query.

I hate character encodings.

Has there been any other progress on this issue? Ajren, did your MySQL colleagues have anything to add?

Crell’s picture

So statement 1: MySQL's collation handling sucks. We cannot fix the fact that it sucks. We are not going to try.

Statement 2: Given that, what can we do to Drupal to at least keep from making matters worse? That's as much as we're going to be able to do, so let's focus on that. This issue is marked critical so we really should try to resolve it. :-)

sun’s picture

I'm still baffled. Current status and circular reference is:

  • utf8_general_ci in MySQL 5.1 leads to duplicate key errors, due to collation changes in utf8_general_ci. Solution: Switch all collations to utf8_unicode_ci + fix core to force/respect the collation.
  • utf8_unicode_ci in MySQL 5.1 leads to exceptional character comparison effects, especially but perhaps not limited to LIKE expressions. See #39 and #41. Solution: Back to utf8_general_ci? Which of both is worse - not being able to store or not being able to find stored data?

I'm kinda lost. If we were able to simply state "Stay away from MySQL 5.1"... but obviously this ain't possible.

The only nice thing to discover in this issue so far is that Drupal's database tests revealed aforementioned weird LIKE expression bug under utf8_unicode_ci.

Berdir’s picture

I guess we just have to add utf8_nottotallybroken_ci :)

Are there any mysql collation experts we could ask? (In or outsite the drupal community)

cog.rusty’s picture

@sun,
What they admitted for utf8_general_ci was only index errors, for which the suggested re-import. Is there any bug report for irreparable duplicate key errors?

Crell’s picture

I've pinged an old contact of mine at MySQL. Hopefully he remembers me. :-)

datacharmer’s picture

@sun,
I have been looking at the bug report for a while, and although I grasp the general issue, I yet fail to nail the problem down to a specific case.
Even the discussion in the MySQL bug report (Bug#40053) has some fuzzy angles that I don't understand if they are relevant to this issue or not.

Could you provide a contained SQL test case that shows the problem? Specifically, I need a (possibly short) snippet of SQL, including the CREATE TABLE, INSERT, and settings of all "%character%" and
"%collation%" variables.
Something that I can run on MySQL 5.0.91 and get a pass (if I understand your problem correctly), and then run on 5.1.48 and get a fail.
Or, if my assumptions about MySQL versions were wrong, I need a set of instructions that gave you the expected results in your previous database (which you need to identify with version and settings) and the new one.

Thanks

Giuseppe

Crell’s picture

Component: database system » mysql database

(Note for those playing at home: datacharmer is the MySQL contact I mentioned. :-) Yay for Vendor involvement!)

andypost’s picture

This issue leads to interesting troubles with localization because strings are searched in {locales_source} by exact string. So better to add some hash(string) column to this table to prevent comparison on strings as l10n_server already does. Also there's discussion #532512: Plural string storage is broken, editing UI is missing

andypost’s picture

Confirm this bug on Debian - mysql 5.1.48-1

Code to reproduce this bug (edit by chx: added SET NAMES utf8)

SET NAMES utf8;
DROP TABLE demo;
CREATE TABLE `demo` (
  `test` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`test`)
) CHARSET=utf8 COLLATE=utf8_general_ci;
INSERT INTO demo VALUES ('Reißer');
INSERT INTO demo VALUES ('Reiser');

ERROR 1062 (23000): Duplicate entry 'Reiser' for key 'PRIMARY'

cog.rusty’s picture

Hmm... This looks like a straight bug introduced in MySQL 5.1 with utf_general_ci. It is not just a conversion thing, and doesn't happen in MySQL 5.0 with any collation.

catch’s picture

Like Crell I'm not sure what Drupal can do here. Is there a MySQL bug report for this?

klonos’s picture

...well, according to #54 datacharmer is the right person to answer that question ;)

@datacharmer: ping?

chx’s picture

I can't reproduce on 5.1.41-3ubuntu12.3-log (Ubuntu) Edit: Oh, I can! Needed SET NAMES UTF8, added.

chx’s picture

#39 looks broken to my eyes too many backslashes in the LIKE but then I might be mistaken.

This issue is not trivial to fix by far. Because the way we store data it might be insolvable. You have a Swedish - Hungarian site, what collation should node.title be, utf8_swedish_ci or utf8_hungarian_ci?

sun’s picture

Priority: Critical » Normal

So, everyone is able to replicate this bug. Nice.

I wanted to suggest to get back to the decision in #18 - whether to enforce a specific collation or make Drupal respect the default collation of the database.

That, I thought, would "solve" the pressing issue to some extent, as people, who have to use a different collation, were actually able to use a different collation. As of now, Drupal enforces the one and only utf8_general_ci for all new tables and columns, so there is no workaround for this situation. Since we can be sure that this bug will occur with MySQL 5.1.x, regardless of whether MySQL will fix anything, some hosting providers will expose this bug. When having to deal with those, it would be nice, if you could just go ahead and change the database default collation to utf8_unicode_ci (or whatever you have to use), and Drupal would respect that database configuration setting for all tables and columns being created.

At least, that's how I got around this weird collation issue. By changing the driver's connection initialization, new tables + columns created by modules are all in the expected database collation (the default, instead of enforced utf8_general_ci).

The next best consideration and decision on this issue was that, in general, such issues could be completely avoided, if Drupal was using and enforcing utf8_unicode_ci as collation. Technically, that's the fix I applied, too. (i.e., utf8_unicode_ci instead of any other specific collation)

However, in trying to adapt this solution for D7, our database tests revealed an even more weird collation issue with utf8_unicode_ci, which has been dumped into #39. (@chx: The backslashes are required for escaping, those SQL statements are 1:1 what D7's current database tests are doing, and therefore, also the reason why the patches in here did not pass.)

Technically speaking, certain records should no longer be found (albeit still existing) when using certain LIKE expressions/wildcards (there may be further quirks). This, however, did not yield any fatal errors or misbehaviors on my sites yet. Thus, it's still a viable solution to switch to utf8_unicode_ci - even if it may not be 100% free of bugs.

In summary:

  1. There is no working solution for MySQL 5.1 currently. Its collations are broken.
  2. Core's current enforcement of utf8_general_ci does not allow to work around these issues by configuring a different database default collation on a per-site basis, based on own/custom requirements. This can be considered a bug on its own, although the counter-argument was raised that enforcing a certain collation potentially makes more (collation) bugs relicable. Not sure whether I agree with this, as it's primarily the charset that matters. In any case, we need to ensure that Drupal is installed with a utf8 character set. We need to prevent "Illegal mix of collations" errors. Effectively, that's what this entire issue is about.
  3. Our current database tests are failing when enforcing utf8_unicode_ci. Those failures are kinda weird, eventually destroying the option to enforce a different default collation. But perhaps, we might also have a bug in our mysql driver that leads to these failures. Although the comparison failures can be replicated with the test SQL statements in #39 outside of Drupal.
  4. As hosting providers are starting to adopt MySQL 5.1, more and more international Drupal users will face this issue. You can't really ask your provider to downgrade.

Anyway, demoting from critical for now. Although this issue breaks entire sites, core and contributed functionality.

Damien Tournoud’s picture

I do also believe that #39 is bogus. But that doesn't meant that there are no issue with utf8_unicode_ci collations.

sun’s picture

If #39 is bogus, then our database tests may be bogus, too? If that would be the case, then we might be able to make one of the patches here work?

Damien Tournoud’s picture

Confirmed #39.

This is documented in http://bugs.mysql.com/bug.php?id=39808

Damien Tournoud’s picture

mysql> SELECT '|' COLLATE utf8_unicode_ci LIKE '%||%' ESCAPE '|';
+----------------------------------------------------+
| '|' COLLATE utf8_unicode_ci LIKE '%||%' ESCAPE '|' |
+----------------------------------------------------+
|                                                  0 | 
+----------------------------------------------------+
1 row in set (0.00 sec)

=> Wrong result.

mysql> SELECT '|' COLLATE utf8_general_ci LIKE '%||%' ESCAPE '|';
+----------------------------------------------------+
| '|' COLLATE utf8_general_ci LIKE '%||%' ESCAPE '|' |
+----------------------------------------------------+
|                                                  1 | 
+----------------------------------------------------+
1 row in set (0.00 sec)

=> Correct result.

That's on MySQL 5.0.51a-24+lenny4.

Damien Tournoud’s picture

Exactly the same result on 5.1.47-MariaDB-mariadb82.

Damien Tournoud’s picture

It seems that some double escaping is going on, because for '%|||%' it's the other way around.

Damien Tournoud’s picture

This is fun:

SELECT '|' COLLATE utf8_unicode_ci LIKE '|' ESCAPE '|';
=> 0 (correct, the pattern is invalid)

SELECT '|' COLLATE utf8_unicode_ci LIKE '||' ESCAPE '|';
=> 1 (correct)

SELECT '|' COLLATE utf8_unicode_ci LIKE '|||' ESCAPE '|';
=> 0 (correct, the pattern is invalid)

SELECT '|' COLLATE utf8_unicode_ci LIKE '||%' ESCAPE '|';
=> 1 (correct)

SELECT '|' COLLATE utf8_unicode_ci LIKE '%||' ESCAPE '|';
=> 0 (incorrect)

SELECT '|' COLLATE utf8_unicode_ci LIKE '%||%' ESCAPE '|';
=> 0 (incorrect)

SELECT '|' COLLATE utf8_unicode_ci LIKE '%|||%' ESCAPE '|';
=> 1 (incorrect)
Crell’s picture

Yuck yuck yuck. Agreed on downgrading this issue, though, as it's not Drupal's issue but MySQL's.

So in terms of possible workarounds, what is the problem with #37 (give or take a conversion script for site upgrades)? That does seem like the least-bad solution to me at the moment...

JamesDay’s picture

If you get duplicate errors when you try to load data, please correct your data. That's where the problem is.

1. The root cause is a correction in the MySQL utf8_general_ci collation that in 5.0 and earlier versions wrongly failed to treat s and ss as the same character: http://bugs.mysql.com/bug.php?id=27877
2. The import problem happens because the data has duplicate keys. They always were duplicates according to the German language sorting rules. MySQL just failed to detect them as duplicates because the collation was wrong. With the fix it will correctly identify them as duplicates.
3. If you get duplicate key errors when you import data from a 5.0 or earlier version into a 5.1 version, the correct action is to locate and correct the duplicate records in your data. Switching collations in Drupal or the database completely misses the point: the data needs correction, it shouldn't have been possible to enter it originally.
4. After the correction in 5.1 the utf8_general_ci collation works as intended and it's fine for Drupal to continue to use it. The easiest solution for Drupal is probably to continue to use it and help those who are importing data with duplicate values that were saved in earlier MySQL versions.
5. It seems that potential character set or collation issues both within Drupal and the MySQL utf8_unicode_ci collation have been found as a result of investigating this and those probably merit fixes.

We will seek to avoid in the future:
a. Correcting this in the way we did. It would have been less disruptive for upgraders to introduce a new collation with the corrected order under a different name and let people choose to use the incorrect or correct rules. The utf8_german3_ci mentioned here was one option that could have worked better if we'd used it: http://bugs.mysql.com/bug.php?id=43593 . Still messy, but perhaps less so.
b. The upgrade inconvenience that resulted from this change.

Sorry for the inconvenience. We got it wrong in the first place and then used a less good solution than we could have. We've learned from the experience and don't plan to repeat it.

This issue is gradually fading over time, because the number of people who have never had a chance to store data with duplicate values because they have always used 5.1 or later is gradually increasing. It's only an issue for those who've been using 5.0 or earlier and have those duplicates. For this reason it doesn't seem to merit drastic work on the part of Drupal. Encouraging new users of Drupal to use 5.1 or later so they are never exposed to this issue would be good.

This post isn't the official view of Oracle. For that, seek out a company spokesperson.

James Day, MySQL Principal Technical Support Engineer, Oracle UK

Crell’s picture

Thanks, James! So to confirm, it sounds like your recommendation is:

- Use utf8_general_ci across all of Drupal.

- Discourage use of utf8_unicode_ci on the grounds that it's still buggy.

- If you have data collision problems when upgrading (as noted in the original post), that's a MySQL problem, not a Drupal problem, so punt.

Does that render this issue a won't fix?

sun’s picture

Thanks for your input, @JamesDay. I entirely disagree with 1. and 3. of your list, and can prove with test data, but that rather belongs into MySQL's issue tracker (not exactly sure into which issue though; seems like bug #27877 led to actual collation change, so I'll try there).

Therefore, this topic keeps on doing what it already started to do, it identified

1) a regression caused by the change applied via http://bugs.mysql.com/bug.php?id=27877 in MySQL 5.1 regarding utf8_general_ci. To be corrected/fixed over there.

2) a bug in MySQL's utf8_unicode_ci collation handling. To be fixed over in http://bugs.mysql.com/bug.php?id=39808, as @Damien noted.

3) a deficiency in Drupal's mysql driver, as it prevents usage of other collations than utf8_general_ci, which can be required, especially when dealing with multilingual data or when upgrading to MySQL 5.1. Drupal needs to support other collations, but enforce the utf8 character set.

To fix 3), I think a patch along the lines of #37 is required -- minus the switch to default to utf8_unicode_ci. The remaining changes in that patch intend to allow to set the default collation for MySQL via $databases in settings.php.

sun’s picture

Status: Needs work » Needs review
FileSize
5.98 KB

Basically, this.

Damien Tournoud’s picture

Status: Needs review » Reviewed & tested by the community

Yep, #74 looks like all that we need to do for Drupal right now. Given more choice is always good.

sun’s picture

To fix 1), I've added some details to http://bugs.mysql.com/bug.php?id=27877#c317080 - although I suspect it's not the right issue.

Crell’s picture

#74 looks good to me, and is as much as we can do on the Drupal end. Oy!

Dries’s picture

Status: Reviewed & tested by the community » Fixed

#74 looks good to me too. Committed to CVS HEAD.

sun’s picture

Over in http://bugs.mysql.com/bug.php?id=27877#c317175, MySQL started to suggest to not use utf8_general_ci anymore, because the collation won't be changed anymore. As a resolution, usage of utf8_unicode_ci has been suggested.

I'm trying hard to explain the critical problem of the collation changes over there.

However, if that really happens to be the end of the story, then we may have to adjust Drupal's installer to allow to select the collation to put into settings.php (if installing on MySQL).

Regular users won't understand that option, so ideally, we'd simply install in a better collation by default, if there wasn't that other bug with utf8_unicode_ci.

__
Bottom line:

Does all of this really happen by coincidence? Or is it possible that it is someone's plan to change things in a way that make the software's default behavior suddenly incompatible and unusable for a relatively large share of its users?

andypost’s picture

edit: As I understand mysql totally ignores all http://en.wikipedia.org/wiki/Diacritic
This looks terrible!

Same trobles with russian language for utf8_general_ci

mysql> SET NAMES utf8 COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT version();
+-------------------+
| version()         |
+-------------------+
| 5.0.51a-24+lenny4 |
+-------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE demo (id VARCHAR(255), PRIMARY KEY (id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO demo VALUES ('е');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO demo VALUES ('Ё');
ERROR 1062 (23000): Duplicate entry 'Ё' for key 1

Same troubles with utf8_unicode_ci


mysql> SET NAMES utf8 COLLATE utf8_unicode_ci;
Query OK, 0 rows affected (0.24 sec)

mysql> SELECT version();
+-------------------+
| version()         |
+-------------------+
| 5.0.51a-24+lenny4 |
+-------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE demo (id VARCHAR(255), PRIMARY KEY (id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO demo VALUES ('е');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO demo VALUES ('Ё');
ERROR 1062 (23000): Duplicate entry 'Ё' for key 1

the same with

mysql> SELECT version();
+-----------+
| version() |
+-----------+
| 5.1.48-1  |
+-----------+
1 row in set (0.00 sec)

PS: Italian has more troubles http://bugs.mysql.com/bug.php?id=43593#c298890

bjaspan’s picture

subscribe

MustangGB’s picture

Priority: Normal » Major
MustangGB’s picture

Tag update

JamesDay’s picture

sun, for Drupal to support different collations is be good because that is one way to handle these differences.

andypost, you have started a nice discussion among my Russian co-workers about IO and IE. :) You may find this bug report where we helped with a problem encountered by someone who had introduced a custom collation to do this of interest: http://bugs.mysql.com/bug.php?id=51976 . It's one of the cases where locals have differing views, as may users of Russian dialects in different countries, and your comment appears to reflect Belarusian use more then Russian, though opinions there differ also.

For the future there are some standards for how to handle non-default locales. Anyone who is curious might want to read more about UCA + CLDR, particularly http://cldr.unicode.org/index/cldr-spec/collation-guidelines . It won't help in the short term but I expect that, at no particular date, with no guarantee, there will be improvement in how this sort of thing is handled. Then people with different opinions, or different regional dialects of a particular language, may more easily get behavior that matches what they expect.

James Day, MySQL Principal Technical Support Engineer, Oracle UK

Pierre Paul Lefebvre’s picture

Subscribe too

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

markus_petrux’s picture

I'm not re-opening in the hope that this comment is read by those involved in this discussion, which might be enough to answer the following:

Could #74 be backported to D6?

sun’s picture

Version: 7.x-dev » 6.x-dev
Status: Closed (fixed) » Patch (to be ported)

Yes, that's doable. If no one beats me to it, I'll try to come up with an equivalent patch.

sun’s picture

Status: Patch (to be ported) » Needs review
FileSize
3.58 KB

Alright. I think this is pretty complete. Needs testing.

Since at least here in Germany and over there in Russia it's quite common to store values in German or Russian, I've specifically added a note about MySQL 5.1+ using a default that's incompatible with those languages. Most probably however, that note also applies to other languages.

At least the original bug with German SHARP-S remains, but MySQL entirely fails to understand that they introduced a critical regression, and I'm not sure what else I could try to make them understand. Also, contrary to follow-ups on this issue, the follow-ups on the MySQL bug report officially suggest to use utf8_unicode_ci as replacement collation, so expert/vendor advice on whether utf8_unicode_ci should or could be used seems to heavily digress. As of now, it is not safe to install and use Drupal with the default collation of MySQL 5.1+, which means that Drupal's installer should actually ask the user whether the site may store non-English data, and dynamically set utf8_unicode_ci or another collation... while utf8_unicode_ci is not safe to use either, because of the LIKE wildcard bug (issue)... it's a total mess. And as mentioned before, all of this leads to gazillions of mysterious bug reports filed against arbitrary projects on drupal.org. From my perspective, we could as well simply drop support for MySQL 5.1+, because right now, I am unable to properly and safely store and read Drupal content in German.

Let's face it: If we forget about that database's name for a moment, then we simply wouldn't support it, or at least, we'd support <=5.0 only.

markus_petrux’s picture

Thanks!

Here's an updated patch to fix the following issues:

1) The mysql driver was not patched (only the mysqli driver was).
2) Expose $db_collation as a global variable for processing of settings.php.

Tested and seems to work well.

klonos’s picture

It's been half a year now. Can we get this in?

Status: Needs review » Needs work

The last submitted patch, drupal-DRUPAL-6.collation.90.patch, failed testing.

marcoka’s picture

That is somehow extreme important. Today i did some custom querys with a .SELECT.. WHERE name =''Teppichboden". With utf8_general_ci this returns the row with Teppichboden and Teppichböden.

klonos’s picture

...and that is bad because?

sun’s picture

Status: Needs work » Reviewed & tested by the community

Thanks, Markus! Let's get this sucker in.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Ok, committed, thanks.

bobburns’s picture

This post went on and on with no solution. It is usually a database problem caused by a database set up as general_ci instead of unicode. Fantastico is known to cause this among other automated setup programs for Drupal. Since these posts get indexed in search engines - a solution posted is important or it is garbage in garbage out.

Simply convert the database and the tables.

Use a simple php program below

// your connection
mysql_connect("localhost","username","password");
mysql_select_db("databasename");

// convert code
$res = mysql_query("SHOW TABLES");
while ($row = mysql_fetch_array($res))
{
    foreach ($row as $key => $table)
    {
        mysql_query("ALTER TABLE " . $table . " CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci");
        echo $key . " =&gt; " . $table . " CONVERTED<br />";
    }
}

Where username, password and databasename are all your current values

Save it as something like "convert.php" (no quote marks of course) and push it up to the root of your server

Navigate to the server from your browser and run the script - and it will return "CONVERTED" for all tables it converts.

Delete the file when you are done

The mixed mode error should go away.

It fixes errors like this below - the trigger module seems to bring the warning about mixed mode collations

user warning: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' query: SELECT aa.aid, a.type FROM trigger_assignments aa LEFT JOIN actions a ON aa.aid = a.aid WHERE aa.hook = 'nodeapi' AND aa.op = 'presave' ORDER BY weight in /home/saltatv/public_html/MYSITE.com/profesionales/modules/trigger/trigger.module on line 146.

EvanDonovan’s picture

Status: Fixed » Needs work

@bobburns: Are you on the latest D6? Also, does the patch fix the issue for people who already have a database with mixed collations - I forget.

chx’s picture

Status: Needs work » Fixed

There is nothing to work on.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

MustangGB’s picture

Can I just confirm the summary of this issue from an outsiders point of view

In the long run we would prefer to use utf8_unicode_ci
However it's implementation is buggy and therefore currently not recommended
For now the use of utf8_general_ci is preferred

Is this correct?

If so, does this apply to MySQL 5.1 only, are the bugs still present in 5.5 and can utf8_unicode_ci be used in conjunction with 5.5 without issues?

asb’s picture

sun’s picture

@bobburns + others seeking for a collation conversion script:

#19 contained a schema + data conversion script as a Drupal module update.

This module update was executed on multiple, larger databases already, and ran fairly quick, so a batch doesn't seem to be required.

I might add that conversion functionality as UI feature to Demo module at some point. Proper export/import handling for database/connection/schema/data collations in light of this issue was already changed via #765186: Collation not dumped for tables/columns, triggers nebulous MySQL 5.0 => 5.1 incompatibilities.

Lastly, acquia also has some good news on this issue, but I don't know whether I'm allowed to disclose it.

sun’s picture

benheinzeroth’s picture

Issue summary: View changes

Updated issue summary.

bburg’s picture

Component: mysql database » ajax system

Posting here as an FYI to future Drupalers with collation problems in their databases (this issue is the most relevant in the top search results).

I've been playing around with this on a database that appeared to be initialized with default collation set to latin1_swedish_ci. This was causing problems with Feeds importer trying to import content with some diacritical marks to fields stored in that collation. e.g.

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE2\x80\xA8' for column 'body_value' at row 1

Converting the tables to utf8 seems to fix the issue, but you can't just blanket convert all tables/columns to use utf8_unicode_ci collation because some columns will need to be store their information with utf8_bin instead. The _ci in these collation names refers to case-insensitive, which also means that they are accent insensitive (at least partly). So "Sandal" == "sandal", and "für" == "fur" in these collations. As they are considered to be equal with a case-insensitive collation, they will not be considered be unique entries in a primary key column and you will get an error when trying to convert.

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'für' for key 'PRIMARY'

Instances where this popped up as a problem for me were the uri column in the file_managed table and word in search_total. I am currently working to update the script in #19 to work in D7 and consider the binary/case sensitivity status of a table before altering it.

bburg’s picture

Component: ajax system » database system

mysql database no longer appears to be an available component, and the form defaulted to the top item.

bburg’s picture

FileSize
2.83 KB

Here is a the updated version of the script I came up with, which will work with D7, and consider whether the collation was binary or case-insensitive. Un-tested anywhere except my local environment, so use at your own risk.

bburg’s picture

The script in #108 will cause updates to the Drupal search index to stall (a cron task) as it doesn't update the word column in the search_index table to also use utf8_bin collation. I'll follow up with an update to the script, but in the meantime, the query to fix this is:

alter table search_index modify column word varchar(50) CHARACTER SET utf8 COLLATE utf8_bin;

SergFromSD’s picture

Hey bburg,

Just starting a new project for a site in German and came across this discussion. As you were the last person to respond (about a year ago) wanted to know if this is still an issue and if you end up using utf8_bin for your database collation in D7 instead of utf8_general_ci? Did you have an issue with the drupal tables defaulting to utf8_general_ci as originally reported when this thread was created? This is a project where the content will be imported for tables that are encoded with in latin1_swedish_ci.

Thanks

bburg’s picture

FileSize
2.92 KB

@sgarcia.sd@gmail.com

I managed to dig up what I used for that script. I'm not sure how much it deviated from #109