Problem
- When configuring a different database default collation than MySQL's default collation for utf8,
utf8_general_ci
, and new tables are created by Drupal, then Drupal does not respect that database default collation and usesutf8_general_ci
instead.
Goal
- Make Drupal respect the default collation of the database under MySQL.
Details
- MySQL 5.1 contains an incompatible, i.e. not backwards-compatible change for the
utf8_general_ci
collation, which mainly affects primary keys, unique keys, and indexes.- Among other (incompatible) changes, the German letter sharp-s (
ß
) is interpreted differently (as a single characters
), which can easily lead to duplicate key errors. For example, when having two usernamesReißer
andReiser
. - The upgrade notes http://dev.mysql.com/doc/refman/5.1/en/upgrading-from-previous-series.html state:
Incompatible change: Character set or collation changes were made in MySQL 5.1.21, 5.1.23, and 5.1.24
that may require table indexes to be rebuilt.
For details, see Section 2.12.3, "Checking Whether Table Indexes Must Be Rebuilt". - Further documentation on http://dev.mysql.com/doc/refman/5.1/en/checking-table-incompatibilities.... states:
MySQL 5.1.24, 5.4.0 (Bug#27877)
Affects indexes that use the utf8_general_ci or ucs2_general_ci collation for columns that contain 'ß' LATIN SMALL LETTER SHARP S (German).
Affected tables can be detected by CHECK TABLE ... FOR UPGRADE as of MySQL 5.1.30, 5.4.0 (see Bug#40053). - The actual bug report and commit that lead to the change: http://bugs.mysql.com/bug.php?id=43593
- Among other (incompatible) changes, the German letter sharp-s (
- The solution for this incompatible collation change is to convert the database/tables/columns into
utf8_unicode_ci
, which works for all languages (not only German), but is often mentioned to be slower due to multi-byte comparisons (though I couldn't find any benchmarks that prove this). - By default, Drupal's MySQL database driver appends the following to the
CREATE TABLE
statement:
DEFAULT CHARACTER SET UTF8
As this suffix does not contain the
COLLATE
statement, MySQL always defaults to the default collation of the current connection, i.e.utf8_general_ci
. - Users who managed to get around this incompatible collation change in
utf8_general_ci
(by converting their data toutf8_unicode_ci
) may think that they resolved this issue, but they have not and they cannot. Installing new modules, creating new content types (CCK), adding fields, and creating any arbitrary table, triggers aforementioned default character set and collation, i.e.:While existing (converted) tables are using
utf8_unicode_ci
, every new table is usingutf8_general_ci
. - The bogus mix of different collations stays hidden, until you try to use a module that exposes the mismatch due to advanced SQL expressions. For example, Pathauto will trigger the following error message:
user warning: Illegal mix of collations (utf8_general_ci, IMPLICIT) and (utf8_unicode_ci, IMPLICIT) for operation '=' query: SELECT COUNT(nid) FROM node n LEFT JOIN url_alias alias ON CONCAT('node/', CAST(n.nid AS CHAR)) = alias.src WHERE alias.src IS NULL AND n.type IN ('page', 'location', 'webform')
That is, because CAST() is using the default collation of the connection: http://dev.mysql.com/doc/refman/5.1/en/charset-convert.html (see below regarding
SET NAMES
)
Procedure
- To resolve this critical bug, we need to remove the
DEFAULT CHARACTER SET UTF8
suffix fromCREATE TABLE
statements, so as to make MySQL derive the default database collation.
Links
- List of default character sets and their default collations: http://dev.mysql.com/doc/refman/5.1/en/charset-charsets.html
Notes
- Originally, I assumed that our initialization of the MySQL database connection would be guilty, since
SET NAMES utf8;
"officially" means that the default collation of the server is used (
utf8_general_ci
for charsetutf8
). OnlySET CHARACTER SET utf8
should respect the default collation of the database for the current connection. At least, that's what is precisely stated in http://dev.mysql.com/doc/refman/5.1/en/charset-connection.htmlHowever, further manual testing revealed that tables are created in the proper default collation even when leaving the
SET NAMES
as is.Though I didn't test the CAST() use-case from above yet. (!)
If the MySQL manual is correct, then we may have to change our database connection initialization from
SET NAMES utf8
intoSET CHARACTER SET utf8;
as that would default the default connection collation to the default database collation -- instead of the default server collation for the character set utf8.
- As of now, it seems like above mentioned suffix for
CREATE TABLE
is Drupal's only enforcement of theutf8
character set for created tables. By removing the suffix, tables are created in the default character set and collation of the database, which may not be utf8. This also affects Drupal's installer.I'm actually not sure about the implications, but theoretically, it shouldn't really matter in which character set and/or collation the data is stored - the data should be properly converted during getting and setting. (But you surely may not read all stored data via other tools, such as phpMyAdmin.)
Anyhow, I do not pretend to be an expert in this area, but we may need to add a installation requirement to ensure that the default collation of the database is
utf8_*
, so as to prevent storing data in non-utf8 collations. - For international Drupal users, which also includes mono-lingual Drupal installations in a different language than English, the default collation of
utf8_general_ci
effectively is unusable starting with MySQL 5.1. While you can probably install Drupal, you potentially will not be able to import an exported database dump of that installation due to duplicate key errors -- which is where this entire story began for me.Therefore, one further question is whether we should not require and enforce
utf8_unicode_ci
during installation.
Comment | File | Size | Author |
---|---|---|---|
#111 | utf8_collate_update-111.txt | 2.92 KB | bburg |
#108 | utf8_collate_update.txt | 2.83 KB | bburg |
#90 | drupal-DRUPAL-6.collation.90.patch | 5.23 KB | markus_petrux |
#89 | drupal-DRUPAL-6.collation.89.patch | 3.58 KB | sun |
#74 | drupal.default-collation.74.patch | 5.98 KB | sun |
Comments
Comment #1
sunAlthough probably intended to be more generic, the following manual page describes the difference between
SET NAMES
andSET CHARACTER SET
better:http://dev.mysql.com/doc/refman/5.1/en/set-option.html
Comment #2
zroger CreditAttribution: zroger commentedI've seen this soooo many times and was never able to track it down. Working on numerous multilingual sites and importing production databases into my local environment, this seems like it could very well be the culprit.
I'm not sure how I'll be able to verify the patch, since I was never able to reliably replicate the errors, but I'll try anyways. Anyhow, thanks for taking the time to write this up.
Comment #3
jpmckinney CreditAttribution: jpmckinney commentedThanks for the detailed report. I'm worried for users who have poorly configured databases on shared hosts whose default collation is something ridiculous and worse than utf8_general_ci. Here's a patch where we enforce utf8_unicode_ci.
Comment #4
sunWhy is this bug report critical? If it also applies to D6?
See http://drupal.org/project/issues/search?text=Illegal+mix+of+collations
Because no one was able to track down the real problem, and instead assumed a bogus database configuration. "Use the defaults" or "Install from scratch" were common replies to close those issues.
With the incompatible collation change in MySQL 5.1, we can no longer ignore this bug. Drupal destroys data here.
Comment #5
sun@jpmckinney: I fear that we cannot force any default. Why? Because that would trigger the very same bug for existing installations, just vice-versa (and this time, also including English Drupal installations, which ran just fine with utf8_general_ci previously).
However, we also need to discuss the questions I raised in the notes.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commentedDefault installation are rarely UTF8. Debian, which powers most of our test slaves, is defaulting to
latin1
, and that explains the failure in the original patch.Comment #7
Damien Tournoud CreditAttribution: Damien Tournoud commentedLet's just include the collation in both the CREATE TABLE and the SET NAMES, default to
utf8_unicode_ci
and make that configurable in the database connection array.The only thing I'm worried about is how to migrate existing installations.
Comment #8
sun@Damien: After having spent days with debugging this overall charset/collation issue, I fear that we can't enforce any default, unless we require users to update their settings.php before upgrading to D7. When enforcing any default, users have to make 100% sure that their existing data matches.
Based on http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/, we cannot safely convert data. Not sure whether that post is outdated or applies to certain versions of MySQL only. I've used approach #1 to convert a couple of my sites without unwanted schema changes.
Thinking out aloud - can we think of a way to dynamically apply
utf8_unicode_ci
, in case the default database collation does not start withutf8
?Comment #9
klonossubscribing...
Comment #10
cog.rusty CreditAttribution: cog.rusty commentedTaking a step back, just two questions:
- Is there anything that guarantees that other "collation incompatibilities which affect indexes" won't also creep in utf8_unicode_ci in a future MySQL version?
- Is there anything here which can't be fixed by rebuilding the indexes after a MySQL version change?
Comment #11
sunCan we identify the default database collation during upgrade? We perhaps can. Don't sue me for this patch, it's just an idea.
@cog.rusty:
1) No, there is no guarantee. There's even no guarantee that existing sites are not purposively using a different collation already to fix other, custom issues they experienced on their site. But if Drupal core enforces any default, then utf8_unicode_ci is the most safe collation.
2) Nope, as mentioned in the OP: For me, the entire story began when I tried to import a MySQL 5.0 database dump into 5.1. Interestingly, when migrating a database via mysqlhotcopy across servers and versions, then you are able to run that database, run mysql upgrade commands, and those will simply kill (delete) any bogus records due to this collation change.
Comment #12
jpmckinney CreditAttribution: jpmckinney commented@cog.rusty: Future versions of MySQL, PHP, Apache, etc. may of course introduce bugs. But that's not a reason to avoid existing bugs (:
Comment #13
jpmckinney CreditAttribution: jpmckinney commentedComment #15
jpmckinney CreditAttribution: jpmckinney commented@sun: in #11, you store 'default_collation', but then you never use it? I don't see other occurrences of 'default_collation'.
Comment #16
sunOf course you don't see any usage of that, because I merely threw away that patch as idea ;) This one may provide some more clues. Still not sure whether it makes sense and is reliable.
Comment #18
sunPlease bear in mind: This has to be backported to D6.
I guess there are two different ways to solve this issue:
DEFAULT CHARACTER SET utf8
, so as to always use the default collation of the database. And add a requirement check to Drupal's installer, so as to only allow installation if the collation of the database uses a utf8 charset, i.e. the default collation starts withutf8_*
.Therefore, new installations can use
utf8_general_ci
orutf8_unicode_ci
, or any other utf8 collation of MySQL, i.e. whatever makes sense for the site/user. Existing installations can use whatever they used, and we simply add a step to the handbook pages for upgrading.Pro: We do not have to fiddle with the collation in run-time code. Keep the database configuration away from Drupal.
Con: While we can suggest and direct users to set the default collation to
utf8_unicode_ci
, most users probably won't have any clue of what they are doing. Second, I'm not sure whether you can alter the default database collation on shared hosts.DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci
, so as to force all sites to useutf8_unicode_ci
. But make the collation configurable in settings.php.For existing installations,
utf8_general_ci
, because Drupal created any new tables in that collation anyway. If there is a collation mismatch in existing tables, then the database administrator needs to resolve that through other means anyway.Pro: Drupal uses a collation that should work for all international users by default. Except for existing sites, we can also assume a certain collation for debugging.
Con: On existing sites,
utf8_general_ci
(or an auto-determined value) has to end up insettings.php
somehow; failure to do so will lead to "Illegal mix of collations" errors. Second, Drupal overrides the database configuration, which feels a bit poorly.Comment #19
sunTo save work for others, attached is a custom.install module update function that converts an existing database and tables/columns into
utf8_unicode_ci
.After doing so, I was able to finally test Pathauto's
CAST()
SQL expression. And it turns out that the MySQL manual aboutSET NAMES
is correct and we have to change the database connection initialization,if we want to respect the default collation of the database, i.e. mapping to aforementioned 1) in #18.
if we want to force a globally configured collation, i.e. mapping to aforementioned 2) in #18.
EDIT: After converting the collations, Schema module reports no mismatches.
Comment #20
cog.rusty CreditAttribution: cog.rusty commented@sun:
A configurable (2) seems fine at first glance (except if there are tables from other non-Drupal applications which disagree).
But I think (1) has a problem. There are many Drupal databases out there with default database charset set to latin1 and Drupal tables set to utf8. If (1), unlike SET NAMES, gets the connection collation from the database default, that would be a big problem.
Comment #21
Crell CreditAttribution: Crell commentedCan I just say for the record "screw you, MySQL!"
Given that everything else in Drupal is UTF8, do we need to support non-UTF8 character sets in the first place? Is there a reason why we'd want to allow other collations, and more importantly what else would that break? (Eg, all HTTP responses are forced to UTF8, as I recall.)
Comment #22
sun@Crell: Yes, Drupal should require/enforce a UTF8 character set. However, not necessarily the collation. Although the utf8_unicode_ci should work for all languages, there may be valid reasons for having to use a different collation.
In particular, each collation comes with different transliteration and sorting order rules. As can be read in the MySQL bug report I linked to, the commit additionally introduced a new
utf8_german3_ci
collation for German, which intends to retain the original behavior ofutf8_general_ci
(WTF?)...The documentation page about Unicode character sets skims the detailed differences of those utf8 collations. (EDIT: Note the user comment about Hebrew on that page, for example of a possible application-specific need to use a language-specific collation.)
Comment #23
Damien Tournoud CreditAttribution: Damien Tournoud commentedMost of the time, collation mismatch can be easily resolved by MySQL.
There are some special cases where the server cannot resolve that (see http://dev.mysql.com/doc/refman/5.0/en/charset-collate-tricky.html), and those become problematic. I would consider that a bug in the module that issues those queries.
In the Pathauto query, would changing:
to
Fix the problem?
Comment #24
cog.rusty CreditAttribution: cog.rusty commentedIt doesn't seem to work. I tried it in a database where the database, the table, and the column were all set to utf8_unicode_ci. Apparently CONCAT() enforces the default collation of the utf8 character set (utf8_general_ci), which it gets from the CAST().
And now that I think of it, no amount of enforcement of utf8_unicode_ci will change that...
Comment #25
sunI already explained in the OP as well as in #19 that CAST() always uses the collation of the database connection.
SET NAMES
triggersutf8_general_ci
, since it's missingCOLLATE
. Either we add that, or we replaceSET NAMES
withSET CHARACTER SET
, which uses the default collation of the database for the connection.However, overall we have to decide on how to move forward, by choosing one of the options in #18 + #19. Either respect the database configuration, or globally enforce a certain collation, optionally alterable via settings.php.
Comment #26
cog.rusty CreditAttribution: cog.rusty commentedI just tested it, you are right, it comes from the connection collation.
At this point, I think either a workaround for CASTs must be found or a configurable collation must be enforced.
Trusting the database collation is no good, because people have been doing one-click database creation from their cpanels for years, without anybody or anything instructing them to change the charset or the collation.
Comment #27
Damien Tournoud CreditAttribution: Damien Tournoud commentedI don't really get it. What part of
CONCAT('node/', CAST(n.nid AS CHAR)) = alias.src
actually fails? The CONCAT()? The equality operator?I don't see how the CONCAT() could fail if the string literal and the CAST() default to the connection collation.
Comment #28
cog.rusty CreditAttribution: cog.rusty commentedMy test showed that the problem is the CAST().
Using a database connection with utf8_general_ci (because of Drupal's
SET NAMES utf8
) and a table and a column with utf8_unicode_ci (because the user wanted it so), and running a similar query:- with
column = CONCAT('literal1', 'literal2')
there is no problem.- with
column = CONCAT('literal1', CAST(15 as char))
a conflict with utf8_general_ci occurs.It seems that CONCAT with only literals does not become utf8_general_ci.
The result is exactly the same whether the database collation is set to utf8_general_ci or to utf8_unicode_ci or even to latin1_swedish_ci.
Comment #29
andypostsubscribe
Comment #30
sunAttached patch implements option 2) of #18 and #19, i.e. enforcing utf8_unicode_ci.
Comment #32
EvanDonovan CreditAttribution: EvanDonovan commentedSubscribing.
I have had CAST() issues with "latin1_swedish_ci" set as default on CPanel before, and had to change collation manually on all tables. I didn't realize that this was considered a bug in Drupal now. I think that Drupal should enforce utf8_unicode_ci, but have it be configurable in settings.php.
Comment #33
sunHopefully, it's able to access the connection options this time.
Comment #35
BerdirI'm working on this...
Comment #36
BerdirOk, added a getter to fetch the collation, let's try this.
Comment #37
sunI think that getter already exists.
Comment #39
sunoh dear lord.
WTF?
Comment #40
cog.rusty CreditAttribution: cog.rusty commentedHmm... So, literals seem to be utf8_general_ci.
What was the database collation and the connection collation?
Or does this come from the default collation of utf8?
---- Edit:
Duh, I missed the point. The right column is absurd.
Comment #41
sunSorry for not following up earlier, but it seems you realized the absurdity in the meantime. I left out the
hex(name COLLATE utf8_general_ci)
column in that dump, because the result was identical to the regular hex() column.What's also not visible: It seems that none of the % and _ wildcards are behaving the same under utf8_unicode_ci. Only
name LIKE 'abc%'
returns 1 on both rows, everything else does not.Comment #42
cog.rusty CreditAttribution: cog.rusty commentedI am surprised nobody got scared and jumped in to say "Don't touch that!" >;-)
Comment #43
arjenlentz CreditAttribution: arjenlentz commentedChanging the default collation on tables to utf8_unicode_ci is likely to break other languages/datasets, the fact that it appears to "solve" this one particular issue does not validate it as a generically applicable solution.
I am verifying with my former colleagues at MySQL what the exact before/after behaviour is (see my comment on http://bugs.mysql.com/43593), to be able to validate/verify both the original assertion in this Drupal bug report as well as the MySQL bugfix.
For both before/after, we need to know whether it relates to
- transcription (like u umlaut can be transcribed to ue and be sorted after)
- sort order (and in what way, position in alphabet please)
- equivalence (two different chars regarded as the same)
* added question: does collation ever introduce equivalence, or always merely order?
The assertion in the drupal bug is that the bugfix actually makes rows disappear as
duplicates on rebuilding of the table. this would indicate equivalence. Now, as far as I
know a "'ß' LATIN SMALL LETTER SHARP S" -similar to all other German characters- is either separate next to S, separate at end of alphabet, or transcribed to SS, so I don't see how it could ever trigger a duplicate.
So we need answers to the above before discussing patches.
Comment #44
ShannonK CreditAttribution: ShannonK commentedSubscribing. BUT....
I unfortunately don't speak techie, so, can someone let me know if I'm understanding at least a little of what you guys are saying here. Is the consensus that rookies (like me) should wait on this if we're receiving this error?
In my case, I was creating a view for a calendar view and I got the following message:
For what it's worth I have Drupal 6.16 and MySQL 5.1.30. We just have the English language on our site. My host is Hostgator.
From reading your posts this sounds pretty complicated. Should I wait for a patch? I'm not sure what I should do...
Comment #45
cog.rusty CreditAttribution: cog.rusty commentedNo, there is nothing to wait for, you should try to fix the problem. This discussion, when and if comes to a conclusion, won't help you with your problem in any way.
Your problem is that:
- Table "views_object_cache" has collation utf8_genereral_ci
- Table "sessions" has collation utf8_unicode_ci
This may be an indication (or not) that there are more tables with the wrong collation.
But this is not the place to discuss it. Take a look at a similar support question that I answered today at http://drupal.org/node/800016 and ask there for any clarification you want.
Comment #46
YesCT CreditAttribution: YesCT commentedIn preparation for the new "major" issue priority, I'm tagging this (not actually critical) issue as priority-major.
See: http://drupal.org/node/669048#comment-3019824 (#669048: We should have four priorities not three (introduce a new 'major' priority)) for more information about the coming major issue priority.
Comment #47
Crell CreditAttribution: Crell commentedAnother data point I just discovered the hard way:
Even when using MySQL 5.0, which doesn't fatal on mismatched collations, if you try to join table A and table B on A.foo to B.foo, and A.foo and B.foo are using different character collations, *all indexes will be ignored* and MySQL will slowly go through and convert and compare each field individually. This results in queries whose performance sucks for reasons that have nothing to do with the query.
I hate character encodings.
Has there been any other progress on this issue? Ajren, did your MySQL colleagues have anything to add?
Comment #48
Crell CreditAttribution: Crell commentedSo statement 1: MySQL's collation handling sucks. We cannot fix the fact that it sucks. We are not going to try.
Statement 2: Given that, what can we do to Drupal to at least keep from making matters worse? That's as much as we're going to be able to do, so let's focus on that. This issue is marked critical so we really should try to resolve it. :-)
Comment #49
sunI'm still baffled. Current status and circular reference is:
I'm kinda lost. If we were able to simply state "Stay away from MySQL 5.1"... but obviously this ain't possible.
The only nice thing to discover in this issue so far is that Drupal's database tests revealed aforementioned weird LIKE expression bug under utf8_unicode_ci.
Comment #50
BerdirI guess we just have to add utf8_nottotallybroken_ci :)
Are there any mysql collation experts we could ask? (In or outsite the drupal community)
Comment #51
cog.rusty CreditAttribution: cog.rusty commented@sun,
What they admitted for utf8_general_ci was only index errors, for which the suggested re-import. Is there any bug report for irreparable duplicate key errors?
Comment #52
Crell CreditAttribution: Crell commentedI've pinged an old contact of mine at MySQL. Hopefully he remembers me. :-)
Comment #53
datacharmer CreditAttribution: datacharmer commented@sun,
I have been looking at the bug report for a while, and although I grasp the general issue, I yet fail to nail the problem down to a specific case.
Even the discussion in the MySQL bug report (Bug#40053) has some fuzzy angles that I don't understand if they are relevant to this issue or not.
Could you provide a contained SQL test case that shows the problem? Specifically, I need a (possibly short) snippet of SQL, including the CREATE TABLE, INSERT, and settings of all "%character%" and
"%collation%" variables.
Something that I can run on MySQL 5.0.91 and get a pass (if I understand your problem correctly), and then run on 5.1.48 and get a fail.
Or, if my assumptions about MySQL versions were wrong, I need a set of instructions that gave you the expected results in your previous database (which you need to identify with version and settings) and the new one.
Thanks
Giuseppe
Comment #54
Crell CreditAttribution: Crell commented(Note for those playing at home: datacharmer is the MySQL contact I mentioned. :-) Yay for Vendor involvement!)
Comment #55
andypostThis issue leads to interesting troubles with localization because strings are searched in {locales_source} by exact string. So better to add some hash(string) column to this table to prevent comparison on strings as l10n_server already does. Also there's discussion #532512: Plural string storage is broken, editing UI is missing
Comment #56
andypostConfirm this bug on Debian - mysql 5.1.48-1
Code to reproduce this bug (edit by chx: added SET NAMES utf8)
ERROR 1062 (23000): Duplicate entry 'Reiser' for key 'PRIMARY'
Comment #57
cog.rusty CreditAttribution: cog.rusty commentedHmm... This looks like a straight bug introduced in MySQL 5.1 with utf_general_ci. It is not just a conversion thing, and doesn't happen in MySQL 5.0 with any collation.
Comment #58
catchLike Crell I'm not sure what Drupal can do here. Is there a MySQL bug report for this?
Comment #59
klonos...well, according to #54 datacharmer is the right person to answer that question ;)
@datacharmer: ping?
Comment #60
chx CreditAttribution: chx commentedI can't reproduce on 5.1.41-3ubuntu12.3-log (Ubuntu) Edit: Oh, I can! Needed SET NAMES UTF8, added.
Comment #61
chx CreditAttribution: chx commented#39 looks broken to my eyes too many backslashes in the LIKE but then I might be mistaken.
This issue is not trivial to fix by far. Because the way we store data it might be insolvable. You have a Swedish - Hungarian site, what collation should node.title be, utf8_swedish_ci or utf8_hungarian_ci?
Comment #62
sunSo, everyone is able to replicate this bug. Nice.
I wanted to suggest to get back to the decision in #18 - whether to enforce a specific collation or make Drupal respect the default collation of the database.
That, I thought, would "solve" the pressing issue to some extent, as people, who have to use a different collation, were actually able to use a different collation. As of now, Drupal enforces the one and only utf8_general_ci for all new tables and columns, so there is no workaround for this situation. Since we can be sure that this bug will occur with MySQL 5.1.x, regardless of whether MySQL will fix anything, some hosting providers will expose this bug. When having to deal with those, it would be nice, if you could just go ahead and change the database default collation to utf8_unicode_ci (or whatever you have to use), and Drupal would respect that database configuration setting for all tables and columns being created.
At least, that's how I got around this weird collation issue. By changing the driver's connection initialization, new tables + columns created by modules are all in the expected database collation (the default, instead of enforced utf8_general_ci).
The next best consideration and decision on this issue was that, in general, such issues could be completely avoided, if Drupal was using and enforcing utf8_unicode_ci as collation. Technically, that's the fix I applied, too. (i.e., utf8_unicode_ci instead of any other specific collation)
However, in trying to adapt this solution for D7, our database tests revealed an even more weird collation issue with utf8_unicode_ci, which has been dumped into #39. (@chx: The backslashes are required for escaping, those SQL statements are 1:1 what D7's current database tests are doing, and therefore, also the reason why the patches in here did not pass.)
Technically speaking, certain records should no longer be found (albeit still existing) when using certain LIKE expressions/wildcards (there may be further quirks). This, however, did not yield any fatal errors or misbehaviors on my sites yet. Thus, it's still a viable solution to switch to utf8_unicode_ci - even if it may not be 100% free of bugs.
In summary:
Anyway, demoting from critical for now. Although this issue breaks entire sites, core and contributed functionality.
Comment #63
Damien Tournoud CreditAttribution: Damien Tournoud commentedI do also believe that #39 is bogus. But that doesn't meant that there are no issue with utf8_unicode_ci collations.
Comment #64
sunIf #39 is bogus, then our database tests may be bogus, too? If that would be the case, then we might be able to make one of the patches here work?
Comment #65
Damien Tournoud CreditAttribution: Damien Tournoud commentedConfirmed #39.
This is documented in http://bugs.mysql.com/bug.php?id=39808
Comment #66
Damien Tournoud CreditAttribution: Damien Tournoud commentedThat's on MySQL 5.0.51a-24+lenny4.
Comment #67
Damien Tournoud CreditAttribution: Damien Tournoud commentedExactly the same result on 5.1.47-MariaDB-mariadb82.
Comment #68
Damien Tournoud CreditAttribution: Damien Tournoud commentedIt seems that some double escaping is going on, because for '%|||%' it's the other way around.
Comment #69
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is fun:
Comment #70
Crell CreditAttribution: Crell commentedYuck yuck yuck. Agreed on downgrading this issue, though, as it's not Drupal's issue but MySQL's.
So in terms of possible workarounds, what is the problem with #37 (give or take a conversion script for site upgrades)? That does seem like the least-bad solution to me at the moment...
Comment #71
JamesDay CreditAttribution: JamesDay commentedIf you get duplicate errors when you try to load data, please correct your data. That's where the problem is.
1. The root cause is a correction in the MySQL utf8_general_ci collation that in 5.0 and earlier versions wrongly failed to treat s and ss as the same character: http://bugs.mysql.com/bug.php?id=27877
2. The import problem happens because the data has duplicate keys. They always were duplicates according to the German language sorting rules. MySQL just failed to detect them as duplicates because the collation was wrong. With the fix it will correctly identify them as duplicates.
3. If you get duplicate key errors when you import data from a 5.0 or earlier version into a 5.1 version, the correct action is to locate and correct the duplicate records in your data. Switching collations in Drupal or the database completely misses the point: the data needs correction, it shouldn't have been possible to enter it originally.
4. After the correction in 5.1 the utf8_general_ci collation works as intended and it's fine for Drupal to continue to use it. The easiest solution for Drupal is probably to continue to use it and help those who are importing data with duplicate values that were saved in earlier MySQL versions.
5. It seems that potential character set or collation issues both within Drupal and the MySQL utf8_unicode_ci collation have been found as a result of investigating this and those probably merit fixes.
We will seek to avoid in the future:
a. Correcting this in the way we did. It would have been less disruptive for upgraders to introduce a new collation with the corrected order under a different name and let people choose to use the incorrect or correct rules. The utf8_german3_ci mentioned here was one option that could have worked better if we'd used it: http://bugs.mysql.com/bug.php?id=43593 . Still messy, but perhaps less so.
b. The upgrade inconvenience that resulted from this change.
Sorry for the inconvenience. We got it wrong in the first place and then used a less good solution than we could have. We've learned from the experience and don't plan to repeat it.
This issue is gradually fading over time, because the number of people who have never had a chance to store data with duplicate values because they have always used 5.1 or later is gradually increasing. It's only an issue for those who've been using 5.0 or earlier and have those duplicates. For this reason it doesn't seem to merit drastic work on the part of Drupal. Encouraging new users of Drupal to use 5.1 or later so they are never exposed to this issue would be good.
This post isn't the official view of Oracle. For that, seek out a company spokesperson.
James Day, MySQL Principal Technical Support Engineer, Oracle UK
Comment #72
Crell CreditAttribution: Crell commentedThanks, James! So to confirm, it sounds like your recommendation is:
- Use utf8_general_ci across all of Drupal.
- Discourage use of utf8_unicode_ci on the grounds that it's still buggy.
- If you have data collision problems when upgrading (as noted in the original post), that's a MySQL problem, not a Drupal problem, so punt.
Does that render this issue a won't fix?
Comment #73
sunThanks for your input, @JamesDay. I entirely disagree with 1. and 3. of your list, and can prove with test data, but that rather belongs into MySQL's issue tracker (not exactly sure into which issue though; seems like bug #27877 led to actual collation change, so I'll try there).
Therefore, this topic keeps on doing what it already started to do, it identified
1) a regression caused by the change applied via http://bugs.mysql.com/bug.php?id=27877 in MySQL 5.1 regarding utf8_general_ci. To be corrected/fixed over there.
2) a bug in MySQL's utf8_unicode_ci collation handling. To be fixed over in http://bugs.mysql.com/bug.php?id=39808, as @Damien noted.
3) a deficiency in Drupal's mysql driver, as it prevents usage of other collations than utf8_general_ci, which can be required, especially when dealing with multilingual data or when upgrading to MySQL 5.1. Drupal needs to support other collations, but enforce the utf8 character set.
To fix 3), I think a patch along the lines of #37 is required -- minus the switch to default to utf8_unicode_ci. The remaining changes in that patch intend to allow to set the default collation for MySQL via $databases in settings.php.
Comment #74
sunBasically, this.
Comment #75
Damien Tournoud CreditAttribution: Damien Tournoud commentedYep, #74 looks like all that we need to do for Drupal right now. Given more choice is always good.
Comment #76
sunTo fix 1), I've added some details to http://bugs.mysql.com/bug.php?id=27877#c317080 - although I suspect it's not the right issue.
Comment #77
Crell CreditAttribution: Crell commented#74 looks good to me, and is as much as we can do on the Drupal end. Oy!
Comment #78
Dries CreditAttribution: Dries commented#74 looks good to me too. Committed to CVS HEAD.
Comment #79
sunOver in http://bugs.mysql.com/bug.php?id=27877#c317175, MySQL started to suggest to not use utf8_general_ci anymore, because the collation won't be changed anymore. As a resolution, usage of utf8_unicode_ci has been suggested.
I'm trying hard to explain the critical problem of the collation changes over there.
However, if that really happens to be the end of the story, then we may have to adjust Drupal's installer to allow to select the collation to put into settings.php (if installing on MySQL).
Regular users won't understand that option, so ideally, we'd simply install in a better collation by default, if there wasn't that other bug with utf8_unicode_ci.
__
Bottom line:
Does all of this really happen by coincidence? Or is it possible that it is someone's plan to change things in a way that make the software's default behavior suddenly incompatible and unusable for a relatively large share of its users?
Comment #80
andypostedit: As I understand mysql totally ignores all http://en.wikipedia.org/wiki/Diacritic
This looks terrible!
Same trobles with russian language for utf8_general_ci
Same troubles with utf8_unicode_ci
the same with
PS: Italian has more troubles http://bugs.mysql.com/bug.php?id=43593#c298890
Comment #81
bjaspan CreditAttribution: bjaspan commentedsubscribe
Comment #82
MustangGB CreditAttribution: MustangGB commentedComment #83
MustangGB CreditAttribution: MustangGB commentedTag update
Comment #84
JamesDay CreditAttribution: JamesDay commentedsun, for Drupal to support different collations is be good because that is one way to handle these differences.
andypost, you have started a nice discussion among my Russian co-workers about IO and IE. :) You may find this bug report where we helped with a problem encountered by someone who had introduced a custom collation to do this of interest: http://bugs.mysql.com/bug.php?id=51976 . It's one of the cases where locals have differing views, as may users of Russian dialects in different countries, and your comment appears to reflect Belarusian use more then Russian, though opinions there differ also.
For the future there are some standards for how to handle non-default locales. Anyone who is curious might want to read more about UCA + CLDR, particularly http://cldr.unicode.org/index/cldr-spec/collation-guidelines . It won't help in the short term but I expect that, at no particular date, with no guarantee, there will be improvement in how this sort of thing is handled. Then people with different opinions, or different regional dialects of a particular language, may more easily get behavior that matches what they expect.
James Day, MySQL Principal Technical Support Engineer, Oracle UK
Comment #85
Pierre Paul Lefebvre CreditAttribution: Pierre Paul Lefebvre commentedSubscribe too
Comment #87
markus_petrux CreditAttribution: markus_petrux commentedI'm not re-opening in the hope that this comment is read by those involved in this discussion, which might be enough to answer the following:
Could #74 be backported to D6?
Comment #88
sunYes, that's doable. If no one beats me to it, I'll try to come up with an equivalent patch.
Comment #89
sunAlright. I think this is pretty complete. Needs testing.
Since at least here in Germany and over there in Russia it's quite common to store values in German or Russian, I've specifically added a note about MySQL 5.1+ using a default that's incompatible with those languages. Most probably however, that note also applies to other languages.
At least the original bug with German SHARP-S remains, but MySQL entirely fails to understand that they introduced a critical regression, and I'm not sure what else I could try to make them understand. Also, contrary to follow-ups on this issue, the follow-ups on the MySQL bug report officially suggest to use utf8_unicode_ci as replacement collation, so expert/vendor advice on whether utf8_unicode_ci should or could be used seems to heavily digress. As of now, it is not safe to install and use Drupal with the default collation of MySQL 5.1+, which means that Drupal's installer should actually ask the user whether the site may store non-English data, and dynamically set utf8_unicode_ci or another collation... while utf8_unicode_ci is not safe to use either, because of the LIKE wildcard bug (issue)... it's a total mess. And as mentioned before, all of this leads to gazillions of mysterious bug reports filed against arbitrary projects on drupal.org. From my perspective, we could as well simply drop support for MySQL 5.1+, because right now, I am unable to properly and safely store and read Drupal content in German.
Let's face it: If we forget about that database's name for a moment, then we simply wouldn't support it, or at least, we'd support <=5.0 only.
Comment #90
markus_petrux CreditAttribution: markus_petrux commentedThanks!
Here's an updated patch to fix the following issues:
1) The mysql driver was not patched (only the mysqli driver was).
2) Expose $db_collation as a global variable for processing of settings.php.
Tested and seems to work well.
Comment #91
klonosIt's been half a year now. Can we get this in?
Comment #93
marcoka CreditAttribution: marcoka commentedThat is somehow extreme important. Today i did some custom querys with a .SELECT.. WHERE name =''Teppichboden". With utf8_general_ci this returns the row with Teppichboden and Teppichböden.
Comment #94
klonos...and that is bad because?
Comment #95
sunThanks, Markus! Let's get this sucker in.
Comment #96
Gábor HojtsyOk, committed, thanks.
Comment #97
bobburns CreditAttribution: bobburns commentedThis post went on and on with no solution. It is usually a database problem caused by a database set up as general_ci instead of unicode. Fantastico is known to cause this among other automated setup programs for Drupal. Since these posts get indexed in search engines - a solution posted is important or it is garbage in garbage out.
Simply convert the database and the tables.
Use a simple php program below
Where username, password and databasename are all your current values
Save it as something like "convert.php" (no quote marks of course) and push it up to the root of your server
Navigate to the server from your browser and run the script - and it will return "CONVERTED" for all tables it converts.
Delete the file when you are done
The mixed mode error should go away.
It fixes errors like this below - the trigger module seems to bring the warning about mixed mode collations
user warning: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' query: SELECT aa.aid, a.type FROM trigger_assignments aa LEFT JOIN actions a ON aa.aid = a.aid WHERE aa.hook = 'nodeapi' AND aa.op = 'presave' ORDER BY weight in /home/saltatv/public_html/MYSITE.com/profesionales/modules/trigger/trigger.module on line 146.
Comment #98
EvanDonovan CreditAttribution: EvanDonovan commented@bobburns: Are you on the latest D6? Also, does the patch fix the issue for people who already have a database with mixed collations - I forget.
Comment #99
chx CreditAttribution: chx commentedThere is nothing to work on.
Comment #101
MustangGB CreditAttribution: MustangGB commentedCan I just confirm the summary of this issue from an outsiders point of view
In the long run we would prefer to use utf8_unicode_ci
However it's implementation is buggy and therefore currently not recommended
For now the use of utf8_general_ci is preferred
Is this correct?
If so, does this apply to MySQL 5.1 only, are the bugs still present in 5.5 and can utf8_unicode_ci be used in conjunction with 5.5 without issues?
Comment #102
asb CreditAttribution: asb commentedProbably related: #998474: Duplicate entries in search_index (MySQL error ERROR 1062)
Comment #103
sun@bobburns + others seeking for a collation conversion script:
#19 contained a schema + data conversion script as a Drupal module update.
This module update was executed on multiple, larger databases already, and ran fairly quick, so a batch doesn't seem to be required.
I might add that conversion functionality as UI feature to Demo module at some point. Proper export/import handling for database/connection/schema/data collations in light of this issue was already changed via #765186: Collation not dumped for tables/columns, triggers nebulous MySQL 5.0 => 5.1 incompatibilities.
Lastly, acquia also has some good news on this issue, but I don't know whether I'm allowed to disclose it.
Comment #104
sunhttps://blueprints.launchpad.net/percona-server/+spec/utf8-general50-ci-5.1
Comment #105.0
benheinzeroth CreditAttribution: benheinzeroth commentedUpdated issue summary.
Comment #106
bburgPosting here as an FYI to future Drupalers with collation problems in their databases (this issue is the most relevant in the top search results).
I've been playing around with this on a database that appeared to be initialized with default collation set to latin1_swedish_ci. This was causing problems with Feeds importer trying to import content with some diacritical marks to fields stored in that collation. e.g.
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE2\x80\xA8' for column 'body_value' at row 1
Converting the tables to utf8 seems to fix the issue, but you can't just blanket convert all tables/columns to use utf8_unicode_ci collation because some columns will need to be store their information with utf8_bin instead. The _ci in these collation names refers to case-insensitive, which also means that they are accent insensitive (at least partly). So "Sandal" == "sandal", and "für" == "fur" in these collations. As they are considered to be equal with a case-insensitive collation, they will not be considered be unique entries in a primary key column and you will get an error when trying to convert.
SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'für' for key 'PRIMARY'
Instances where this popped up as a problem for me were the uri column in the file_managed table and word in search_total. I am currently working to update the script in #19 to work in D7 and consider the binary/case sensitivity status of a table before altering it.
Comment #107
bburgmysql database no longer appears to be an available component, and the form defaulted to the top item.
Comment #108
bburgHere is a the updated version of the script I came up with, which will work with D7, and consider whether the collation was binary or case-insensitive. Un-tested anywhere except my local environment, so use at your own risk.
Comment #109
bburgThe script in #108 will cause updates to the Drupal search index to stall (a cron task) as it doesn't update the word column in the search_index table to also use utf8_bin collation. I'll follow up with an update to the script, but in the meantime, the query to fix this is:
alter table search_index modify column word varchar(50) CHARACTER SET utf8 COLLATE utf8_bin;
Comment #110
SergFromSD CreditAttribution: SergFromSD commentedHey bburg,
Just starting a new project for a site in German and came across this discussion. As you were the last person to respond (about a year ago) wanted to know if this is still an issue and if you end up using utf8_bin for your database collation in D7 instead of utf8_general_ci? Did you have an issue with the drupal tables defaulting to utf8_general_ci as originally reported when this thread was created? This is a project where the content will be imported for tables that are encoded with in latin1_swedish_ci.
Thanks
Comment #111
bburg@sgarcia.sd@gmail.com
I managed to dig up what I used for that script. I'm not sure how much it deviated from #109