Since drupal by design, deals with everything in UTF-8, the default installation should probably make at least some of the SQL queries upon every connection to mysql:
mysql_query('SET NAMES "UTF8"');
mysql_query("SET collation_connection='utf8_general_ci'");
mysql_query("SET collation_server='utf8_general_ci'");
mysql_query("SET character_set_client='utf8'");
mysql_query("SET character_set_connection='utf8'");
mysql_query("SET character_set_results='utf8'");
mysql_query("SET character_set_server='utf8'");
This would allow drupal to be installed on a wider variety of servers by dealing with the case where the database is set up with another character set (say latin1), and not under the drupal administrator's control.
Comments
Comment #1
Wesley Tanaka commentedComment #2
Wesley Tanaka commentedpatch still applies against 4.7.0-test2
Comment #3
dries commentedNot sure we want to execute that for each request ...
Comment #4
Wesley Tanaka commentedWould a
be acceptable to you?
Comment #5
Wesley Tanaka commented... although i don't fully understand your concern. Since they're just altering settings in memory, they get executed very quickly.
Comment #6
Steven commentedAFAIK we have avoided doing this for compatibility with MySQL 3.x (which could probably be dropped now).
However, we also need to see what impact this has on existing databases. Drupal has been sending a UTF-8 bytestream for a long time, but non-UTF-8 MySQLs have been interpreting it as plain Latin-1 (or other) data.
For example, if the database character set was not UTF-8 (e.g. Latin-1), your patch results in corrupted data and breaks collation (on MySQL 4.1). I'm not sure what the situation is on MySQL 4.0 and earlier, but I know that some character set handling changed in 4.1.
Comment #7
Steven commentedOk, I did my homework. It turns out it's actually only since MySQL 4.1 that there is UTF-8 support. wtanaka's patch results in errors on MySQL 4.0 and earlier, so it is definitely not acceptable.
Current situation
On MySQL 4.0 and earlier (with no UTF-8 support), we have to keep doing what we do now: store UTF-8 encoded data in Latin-1 tables/columns and live with the fact that MySQL does not interpret non-ASCII correctly (but leaves it otherwise unharmed).
On MySQL 4.1 and up, this approach does not work because some unprintable Latin-1 bytes/characters (used by UTF-8) will be actively removed by MySQL. This has gone unnoticed for a while because it only affects a small set of characters, but it's a bug nevertheless.
DB storage
What wtanaka's patch does is tell MySQL 4.1 that the data we provide is UTF-8 (rather than the default charset, e.g. Latin-1), and that we want UTF-8 back. But, it says nothing about the encoding used to actually store the data, which should be UTF-8 as well.
The problem is that we have no control over the database storage charset from inside Drupal, and neither do most Drupal users on average hosts. Plus, if a user has table prefixing, they might not want to change the database charset for all tables.
What we can do is force UTF-8 per table in database.mysql. Using conditional comments, we can hide them from MySQL 4.0.x and lower too:
Upgrading
There is one sore point: existing users of MySQL 4.1+, with a non-UTF-8 database charset (e.g. Latin1).
They have Drupal's UTF-8 data sitting in their database, but MySQL thinks it's Latin1. So if we ask MySQL for UTF-8, it will try to convert this 'Latin1' to UTF-8, mangling the data in the process. Vice-versa, if we give MySQL UTF-8 data, it will convert it to Latin1, losing all non-Western characters in the process.
The solution is to go over each table and change the character set for all existing columns. The actual data does not change, only its interpretation for MySQL does (see http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html).
The patch
Attached is a patch which wraps wtanaka's commands in a MySQL4.1+version check. It also contains an updates.inc addition which performs the table conversions for MySQL4.1+ databases which are not in UTF-8. It uses the new update system to be granular at the table level to avoid timeouts.
Right now, it contains a list of all core tables (we cannot convert the entire database, due to database-prefixing issues).
What about contrib
First, all contributed modules will need to add a charset definition to their .mysql files for 4.7.
Second, we need to decide how to deal with converting contributed modules' tables. The lazy way is to say that each contributed module should copy system_update_167() into modulename.install and fill in their own list of tables. The pro-active way is to expand the list in system_update_167() to include the most important / maintained module's tables as well, and check the database (with SHOW TABLES) to see if they are present.
I'm leaning towards the lazy way: contrib modules need to be changed anyway, and we can make an easy template .install file for people to use. It's exactly what the upgrade system was designed for.
For completeness: what about pgsql? It only has per-database character sets, and they cannot be changed unless you drop/recreate the database, so it's really not our concern.
Comment #8
Steven commentedApparently dangling commas are all the rage these days.
Comment #9
Cvbge commented4.7 INSTALL.pgsql.txt gives
createdbexample which creates database with UNICODE encoding.In theory the database could have any encoding (http://www.postgresql.org/docs/8.1/static/multibyte.html) and it would transparently convert to it iff the client [PHP] encoding is set right - i.e. to UTF8 (because as I understand, Drupal allways sends UTF8 encoded characters to the database).
But there are two problems:
For example I had a database in unicode. I've dumped it, then created new database with LATIN2 encoding and tried to restore the dump. I got one warning:
WARNING: ignoring unconvertible UTF-8 character 0xc3a8The character came from drupal.org aggregator and probably contained some weird character. It should be safe to use not-unicode encoding in server with unicode encoding in client if the client uses only characters in the server encoding range (but this is allmost impossibile in most cases)
So for postgresql I'd suggest:
Impact for existing users:
Since drupal 4.6 was quite broken with postgres I suggest we fix this for 4.7.
Comment #10
Steven commentedYou're right that it's also an issue on PgSQL, but I don't know if PgSQL actually removes unprintable Latin1/Latin2/... characters like MySQL 4.1+ does. This is really the reason the "Put UTF-8 in Latin1/2"-approach has become a problem for us.
Of course, with PgSQL we have the advantage that we can expect some db knowledge and control from those who use it, so perhaps dropping/restoring the database is not that big of a problem. But we cannot do this automatically from within update.php and we definitely need to tell the admin about this.
I've updated the patch with an encoding check for pgsql which should work according to docs, but which I cannot test myself. And the message it outputs needs to be changed to something useful and/or include a link to instructions on how to convert your database.
Cvbge: can you test and update the patch with instructions?
Comment #11
Cvbge commentedPatch no longer applies.
I think instead of
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database()we can useSHOW server_encoding. We should also test for aliases of 'UNICODE' encoding, that is for 'UTF8' (http://www.postgresql.org/docs/8.1/static/multibyte.html)Besides this and more documentation in INSTALL.pgsql.txt I think we should also add a warning message to the admin/settings (is that possibile?) if the encoding is wrong, so users installing and not upgrading drupal will get the message too.
It will remove (and from version 8.1 will replace with it's hex value? Or throw error? Docs/human information is not clear to me...) unconvertable characters when client and server encoding differ. But by default the client and server encoding are the same. I've done a test (http://cvbge.org/encoding_tests/) and if I didn't do something wrong, the results indicate that in such case everything will work (at least for common 1-byte encodings).
So I'd put something like this: if (server encoding != UTF8) warning(UTF8 encoding recommended)
Comment #12
Steven commentedWouldn't 'SHOW server_encoding' tell you the default server-wide encoding for databases rather than the actual encoding for the current database? We need to test this.
Comment #13
Cvbge commentedNo, tests showed it's per-database, documentation also says "Shows the server-side character set encoding. At present, this parameter can be shown but not set, because the encoding is determined at database creation time."
Comment #14
gregglesI marked two other issues as duplicates of this one:
http://drupal.org/node/26990
http://drupal.org/node/39557
Comment #15
Steven commentedHere is an updated patch. The pgsql check was changed as per cvbge's comments. The check was also added to admin/settings. The updates.inc code was tweaked (thanks Neil).
Comment #16
Steven commentedBecause contrib modules need to run this update too for their own tables, I've refactored it so it is reusable. A contrib module will only need a
mymodule.installfile like this:I'll add this to the module upgrading guide (and post it on the mailinglist) once committed.
Comment #17
Cvbge commentedThis cvs $id thing changing date from YYYY-MM-DD to YYYY/MM/DD is really annoying...
I've tested the patch with postgresql. Made changes described below. Have following questions/concerns:
1. Shouldn't _system_update_utf8() go to update.php ?
2. After clicking save in admin/settings the drupal_set_message() about postgresql wrong encoding gets duplicated, so it looks like this:
Changes:
I've put the warning in system_update_169() inside t().
I've also lowered the strength of the warning: from
"Your PostgreSQL database is set up with the wrong character encoding ('. $encoding .'). Please recreate it with UTF-8/Unicode encoding"
to
"Your PostgreSQL database is set up with the wrong character encoding (%encoding). It is possibile it will not work as expected. It is advised to recreate it with UTF-8/Unicode encoding."
Also changed url for postgresql docs to something more appropriate.
The drupal_set_message() in admin/settings with severity error makes the form unfolded. I've changed it to 'status'. It's still clearly visible. The same for watchdog() call in updates.inc, change from WATCHDOG_ERROR TO WATCHDOG_WARNING.
As discussed, I've removed the pg_set_client_encoding() from database.pgsql.inc
Comment #18
rkerr commentedFor what it's worth, we used only the following commands with MySQL 4.1 and that seems to be enough for the connection to work properly. (MySQL server is already configured to be UTF-8 by default).
in database.mysq.inc, db_connect():
mysql_query("SET NAMES utf8;");
mysql_query("SET CHARACTER SET utf8;");
In my.cnf:
[mysqld]
default-character-set=utf8
character-set-server=utf8
[client]
default-character-set=utf8
A useful command for converting MySQL tables is:
ALTER TABLE my_table CONVERT TO CHARACTER SET 'utf8'
Comment #19
Steven commentedFor the message duplication, we'll have to if ($_POST) the check in system.module I suppose. I don't understand your comment about "severity error makes the form unfolded" though. Are you referring to the collapsible fieldsets? They remain collapsed for me in Safari and Firefox.
Oh and the lack of t() in updates.inc was intentional. We cannot depend on the localization system to work there.
I'll make a new patch soon.
rkerr: Your query is no good, because we are dealing with UTF-8 data stored with the wrong character set attribute. We only want to re-interpret the data, not convert it.
Comment #20
Cvbge commentedIndeed. It's working now. They were not collapsed previously though :/
Still, I'd left that as a warning/status message.
We can't? Why?
Comment #21
Wesley Tanaka commentedprobably because localization related tables may have changed schema.
Comment #22
Steven commentedOkay, final patch committed.
I applied a slight modification to the update system, after discussion with drumm. It's essentially a bugfix for distinguishing newly installed modules from modules with no updates. They both used to be indicated with '0', which meant that if you updated a contrib module and re-enabled it, its schema would get bumped up whether you had run update.php or not. That is fixed now: never-installed modules are -1, installed-but-updateless modules are 0.
Comment #23
drewish commentedI'm getting a whole pack of errors like:
user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\access DEFAULT CHARACTER SET utf8' at line 1 query: ALTER TABLE \access DEFAULT CHARACTER SET utf8 in /usr/local/www/drupal/includes/database.mysql.inc on line 118.Not sure if it matters but the machine is running PHP 5.1, though I've already upgraded one machine running 5.0.4...
Comment #24
rkerr commentedSteven:
we are dealing with UTF-8 data stored with the wrong character set attribute. We only want to re-interpret the data, not convert it.
I thought the "convert to character set" query only converts table and field character sets in one shot, and doesn't actually modify the data in the table.
Otherwise, you may convert the _table_ itself to utf-8, but you could still have text fields in the table that are the old character set. At least that was my experience a few months ago.
Comment #25
drewish commentedI started trying to debug the error on my local machine and ran into the error that Syscrusher describes in this email sent to the dev mailing list:
My local machine is running MySQL 4.1.12a-nt and PHP 5.0.5. I wonder if it's related to this message that gordon posted to the dev list:
Comment #26
dries commentedI get these now using MySQL 4.1.12-max:
(Looks like that query is executed twice per page view?)
Comment #27
dries commentedNevermind, I didn't ran the update script. Running update.php fixed my problem.
Comment #28
chx commentedI guess then it's fixed? (I do not do dare to change an issue status set by Dries so I leave it to him to set this back to fixed.)
Comment #29
Steven commentedI fixed the PHP5 issue, but I can't reproduce the collation issue. It's odd, because I did see a similar error during the patch development, on the node edit form (which does LOWER('username') = LOWER(user)). But not for the access table (I made sure there are rows in it, before doing the upgrade).
I'm not sure if there is anything we can do to avoid it though.
Dries: no
<pre>tags please.Comment #30
markus_petrux commentedWhat about contrib modules? Should module authors change that?
Does the INSTALL.mysql.txt file needs a node about this? I mean, the database can be defined with a default charset, etc.
Or the UPGRADE.txt file could add a note to perform something like:
...the user running the web server process may not have permission to alter the database.
Comment #31
markus_petrux commented>> What about contrib modules? Should module authors change that?
Wow! I have just seen it, through the last commit made by the votingapi module. And also seen the Converting 4.6 modules to 4.7 document.
Very nice!
I believe there is more documentation for module authors that needs to be updated:
:-?
Comment #32
Steven commentedThe main reason we don't want to change the database encoding ourselves is because it could affect sites where Drupal shares the database through prefixing.
As far as the creation of the database goes, I'm not sure. The method of setting it per table means we don't care what the database encoding is.
I'm marking this issue as fixed. The upgrade warning is not something I think we can hide easily.
Comment #33
drewish commentedSorry to re-open this but I just updated to HEAD and tried re-running the 169 update on the machine mentioned in comment #23 and I'm still getting the same errors:
user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\access DEFAULT CHARACTER SET utf8' at line 1 query: ALTER TABLE \access DEFAULT CHARACTER SET utf8 in /usr/local/www/drupal/includes/database.mysql.inc on line 118.I looked around and most other ALTER TABLE statments in upgrades.inc don't prefix the table name with a backslash. What's the reason for including the backslashes on lines 1547 and 1549?
Comment #34
Steven commented*sigh*
http://www.php.net/manual/en/language.types.string.php
Don't you just love nasty changes in minor revisions?
Committed a fix.
Comment #35
drewish commentedThanks Steven, that fixed it. What do you mean? Major breaks in BC on minor versions is one of the best parts of using PHP :p My friend used to justify litering saying that it made jobs, I guess the PHP core developers are just trying to create billable hours for PHP consultants ;)
Comment #36
ax commentedstill buggy:-
-
-
-
in _system_update_utf8() doesn't take into account the associative array form of
in _system_update_utf8() won't work, as db_fetch_array() returns an
associative array, but list() only works on numerical arrays and assumes
the numerical indices start at 0.
strange commit #24358: Make utf-8 update PHP5 compatible ...
buggy and redundant:
as a SET NAMES 'x' statement is equivalent to these three statements:
and don't set character_set_server|collation_server - what about other dbs?
so the following should be sufficient:
and questionable (it breaks some things for my customized Drupal):
i see more and more session variables and globals. this is usually an indication for poor application design. oop could help.
#4 shouldn't be discussed here
Comment #37
ax commentedComment #38
Steven commentedAttached a patch which addresses 1, 2 and 3.
About #4: the whole upgrade system is based around sessions so we can do updates that take longer than the standard 30 second PHP time limit.
Comment #39
Steven commentedComment #40
Wesley Tanaka commentedI'm pretty sure that the original list of statements is not always equivalent to SET NAMES 'utf8' on its own. I had been using SET NAMES 'utf8' on its own, and ran into some kind of collation-related error (not with drupal) a while back, and adding the latter 6 lines (without modifying the database) caused the error to go away.
I didn't investigate at all so unfortunately, I don't know what the minimal set is. But I do think that at least one of the latter 6 variable settings has a non-0 effect.
From the mysql documentation, it would seem that the server character set and collation are not affected by SET NAMES.
Those settings do have effects, and I imagine it would cause problems somewhere if they weren't set to UTF8.
There's a hint that literal strings take the server character set, which I think would cause errors in statements like db_query("SELECT * FROM {table} WHERE a < '%s'", $string); if the server collation weren't set to a UTF-8 compatible collation.
So I imagine that
should probably be left in.
Comment #41
Steven commentedwtanaka: hmm I can't really see how you would conclude that. The part about the literal string being in Latin1 on that MySQL page is just a simplification to keep the explanation simple, I think. The snippet explains collations and how they apply to MySQL built-in vars, nothing more.
The text on http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html is pretty specific about which variables are used when.
Of course it's possible there is some inconsistency or bug in some mysql setups, but it's something we would need to deal with specifically, when it can be reproduced.
Committed the fixes above.
Comment #42
Wesley Tanaka commentedI thought that "The literal string '@' has the server character set (latin1 by default)" was pretty clear in specifying that literals take the server character set.
the (currently single) comment on http://dev.mysql.com/doc/refman/4.1/en/charset-collate-tricky.html:
appears to be talking about the error that I had seen before with only SET NAMES 'UTF8' being called on each connection.
Comment #43
Wesley Tanaka commentedI looked at the link you posted,
http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
but it doesn't actually seem to say anything one way or the other about when the server character set is used.
That leads me to the conclusion that said page is incomplete, since the two links I posted in comment #40 mention uses of the server character set.
Comment #44
sja1 commentedJust a quick comment, as I noticed further up discussion about automatically updating Mysql databases from Latin1 to UTF8. I recently updated (manually) a Mysql 4.1 database for a Spanish language site from Latin1 to UTF8, and discovered that in addition to adding SET NAMES utf8 to database.mysql.inc, I had to delete the contents of the locales_source and locales_target tables, and then reinstall the Spanish .po file. Until I did that all accented characters in translated strings would show up garbled.
I believe the translations are stored in Blob format, and so are left unchanged (presumably in Latin1) when the rest of the text columns in the database get converted to UTF8. If there is no way to automate conversion of the blobs, a note should be output to users alerting them to the need to:
1. Export their translation (in case they have made custom changes to the translation they want to save)
2. Uninstall the translation prior to updating the database (otherwise they may not be able to get back into drupal. Translated strings will contain garbled characters tend to break the html, and can make it impossible to log in and/or administrate the system using Drupal's admin interface)
3. Reinstall their .po file only after the database has been updated.
I found I didn't have to do any conversion to the .po file before reinstalling it. I simply reinstalled the same .po file after converting the database, and it worked.
Comment #45
sja1 commentedOops, didn't realize I had changed the title of the whole thread. This is just to set it back...
Comment #46
(not verified) commented