Since drupal by design, deals with everything in UTF-8, the default installation should probably make at least some of the SQL queries upon every connection to mysql:

  mysql_query('SET NAMES "UTF8"');
  mysql_query("SET collation_connection='utf8_general_ci'");
  mysql_query("SET collation_server='utf8_general_ci'");
  mysql_query("SET character_set_client='utf8'");
  mysql_query("SET character_set_connection='utf8'");
  mysql_query("SET character_set_results='utf8'");
  mysql_query("SET character_set_server='utf8'");

This would allow drupal to be installed on a wider variety of servers by dealing with the case where the database is set up with another character set (say latin1), and not under the drupal administrator's control.

Comments

Wesley Tanaka’s picture

Status: Active » Needs review
StatusFileSize
new838 bytes
Wesley Tanaka’s picture

patch still applies against 4.7.0-test2

dries’s picture

Not sure we want to execute that for each request ...

Wesley Tanaka’s picture

Would a

foreach ($config_variable as $sql) {
  mysql_query($sql);
}

be acceptable to you?

Wesley Tanaka’s picture

... although i don't fully understand your concern. Since they're just altering settings in memory, they get executed very quickly.

Steven’s picture

Status: Needs review » Needs work

AFAIK we have avoided doing this for compatibility with MySQL 3.x (which could probably be dropped now).

However, we also need to see what impact this has on existing databases. Drupal has been sending a UTF-8 bytestream for a long time, but non-UTF-8 MySQLs have been interpreting it as plain Latin-1 (or other) data.

For example, if the database character set was not UTF-8 (e.g. Latin-1), your patch results in corrupted data and breaks collation (on MySQL 4.1). I'm not sure what the situation is on MySQL 4.0 and earlier, but I know that some character set handling changed in 4.1.

Steven’s picture

Status: Needs work » Needs review
StatusFileSize
new24.44 KB

Ok, I did my homework. It turns out it's actually only since MySQL 4.1 that there is UTF-8 support. wtanaka's patch results in errors on MySQL 4.0 and earlier, so it is definitely not acceptable.

Current situation
On MySQL 4.0 and earlier (with no UTF-8 support), we have to keep doing what we do now: store UTF-8 encoded data in Latin-1 tables/columns and live with the fact that MySQL does not interpret non-ASCII correctly (but leaves it otherwise unharmed).

On MySQL 4.1 and up, this approach does not work because some unprintable Latin-1 bytes/characters (used by UTF-8) will be actively removed by MySQL. This has gone unnoticed for a while because it only affects a small set of characters, but it's a bug nevertheless.

DB storage
What wtanaka's patch does is tell MySQL 4.1 that the data we provide is UTF-8 (rather than the default charset, e.g. Latin-1), and that we want UTF-8 back. But, it says nothing about the encoding used to actually store the data, which should be UTF-8 as well.

The problem is that we have no control over the database storage charset from inside Drupal, and neither do most Drupal users on average hosts. Plus, if a user has table prefixing, they might not want to change the database charset for all tables.

What we can do is force UTF-8 per table in database.mysql. Using conditional comments, we can hide them from MySQL 4.0.x and lower too:

CREATE TABLE tablename (
  ...
) TYPE=MyISAM
/*!40100 DEFAULT CHARACTER SET utf8 */ ;

Upgrading
There is one sore point: existing users of MySQL 4.1+, with a non-UTF-8 database charset (e.g. Latin1).

They have Drupal's UTF-8 data sitting in their database, but MySQL thinks it's Latin1. So if we ask MySQL for UTF-8, it will try to convert this 'Latin1' to UTF-8, mangling the data in the process. Vice-versa, if we give MySQL UTF-8 data, it will convert it to Latin1, losing all non-Western characters in the process.

The solution is to go over each table and change the character set for all existing columns. The actual data does not change, only its interpretation for MySQL does (see http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html).

The patch
Attached is a patch which wraps wtanaka's commands in a MySQL4.1+version check. It also contains an updates.inc addition which performs the table conversions for MySQL4.1+ databases which are not in UTF-8. It uses the new update system to be granular at the table level to avoid timeouts.

Right now, it contains a list of all core tables (we cannot convert the entire database, due to database-prefixing issues).

What about contrib
First, all contributed modules will need to add a charset definition to their .mysql files for 4.7.

Second, we need to decide how to deal with converting contributed modules' tables. The lazy way is to say that each contributed module should copy system_update_167() into modulename.install and fill in their own list of tables. The pro-active way is to expand the list in system_update_167() to include the most important / maintained module's tables as well, and check the database (with SHOW TABLES) to see if they are present.

I'm leaning towards the lazy way: contrib modules need to be changed anyway, and we can make an easy template .install file for people to use. It's exactly what the upgrade system was designed for.

For completeness: what about pgsql? It only has per-database character sets, and they cannot be changed unless you drop/recreate the database, so it's really not our concern.

Steven’s picture

StatusFileSize
new24.66 KB

Apparently dangling commas are all the rage these days.

Cvbge’s picture

For completeness: what about pgsql? It only has per-database character sets, and they cannot be changed unless you drop/recreate the database, so it's really not our concern.

4.7 INSTALL.pgsql.txt gives createdb example which creates database with UNICODE encoding.

In theory the database could have any encoding (http://www.postgresql.org/docs/8.1/static/multibyte.html) and it would transparently convert to it iff the client [PHP] encoding is set right - i.e. to UTF8 (because as I understand, Drupal allways sends UTF8 encoded characters to the database).

But there are two problems:

  1. UNICODE have more characters than e.g. LATIN2, and it's not possibile to convert some character from unicode to utf.
    For example I had a database in unicode. I've dumped it, then created new database with LATIN2 encoding and tried to restore the dump. I got one warning:
    WARNING: ignoring unconvertible UTF-8 character 0xc3a8
    The character came from drupal.org aggregator and probably contained some weird character. It should be safe to use not-unicode encoding in server with unicode encoding in client if the client uses only characters in the server encoding range (but this is allmost impossibile in most cases)
  2. PHP by default does not set the client encoding, or more precisely the encoding is automatically set to the database encoding. So if database is set to LATIN2, then the client encoding is also set to LATIN2. This should work theoretically, but I'm not certain. Furthermore, some text functions/operators might not work as expected. So I'd say it's better to manually set client encoding on connection.

So for postgresql I'd suggest:

  • update INSTALL.pgsql.txt to explicitly suggest using UNICODE (UTF8) as database encoding and add an explanation somewhere.
  • modify db_connect() to use pg_set_client_encoding('UNICODE');

Impact for existing users:

  • if someone used UNICODE as database encoding, there will be no difference
  • if someone used not-UNICODE as database encoding, but UNICODE as client encoding, there will be no difference [but this situation is unlikely]
  • if someone used not-UNICODE as database and client encoding, his not standard ASCII characters will be broken when displayed. I believe it is possibile to fix this by either dumping/restoring the database, or changing to drupal source to set different encoding ("incorrect" one ;))

Since drupal 4.6 was quite broken with postgres I suggest we fix this for 4.7.

Steven’s picture

StatusFileSize
new24.71 KB

You're right that it's also an issue on PgSQL, but I don't know if PgSQL actually removes unprintable Latin1/Latin2/... characters like MySQL 4.1+ does. This is really the reason the "Put UTF-8 in Latin1/2"-approach has become a problem for us.

Of course, with PgSQL we have the advantage that we can expect some db knowledge and control from those who use it, so perhaps dropping/restoring the database is not that big of a problem. But we cannot do this automatically from within update.php and we definitely need to tell the admin about this.

I've updated the patch with an encoding check for pgsql which should work according to docs, but which I cannot test myself. And the message it outputs needs to be changed to something useful and/or include a link to instructions on how to convert your database.

Cvbge: can you test and update the patch with instructions?

Cvbge’s picture

Status: Needs review » Needs work

Patch no longer applies.

I think instead of SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database() we can use SHOW server_encoding. We should also test for aliases of 'UNICODE' encoding, that is for 'UTF8' (http://www.postgresql.org/docs/8.1/static/multibyte.html)

Besides this and more documentation in INSTALL.pgsql.txt I think we should also add a warning message to the admin/settings (is that possibile?) if the encoding is wrong, so users installing and not upgrading drupal will get the message too.

I don't know if PgSQL actually removes unprintable Latin1/Latin2/... characters like MySQL 4.1+ does.

It will remove (and from version 8.1 will replace with it's hex value? Or throw error? Docs/human information is not clear to me...) unconvertable characters when client and server encoding differ. But by default the client and server encoding are the same. I've done a test (http://cvbge.org/encoding_tests/) and if I didn't do something wrong, the results indicate that in such case everything will work (at least for common 1-byte encodings).

So I'd put something like this: if (server encoding != UTF8) warning(UTF8 encoding recommended)

Steven’s picture

I think instead of SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database() we can use SHOW server_encoding.

Wouldn't 'SHOW server_encoding' tell you the default server-wide encoding for databases rather than the actual encoding for the current database? We need to test this.

Cvbge’s picture

No, tests showed it's per-database, documentation also says "Shows the server-side character set encoding. At present, this parameter can be shown but not set, because the encoding is determined at database creation time."

greggles’s picture

I marked two other issues as duplicates of this one:

http://drupal.org/node/26990
http://drupal.org/node/39557

Steven’s picture

Status: Needs work » Needs review
StatusFileSize
new25.84 KB

Here is an updated patch. The pgsql check was changed as per cvbge's comments. The check was also added to admin/settings. The updates.inc code was tweaked (thanks Neil).

Steven’s picture

StatusFileSize
new26.64 KB

Because contrib modules need to run this update too for their own tables, I've refactored it so it is reusable. A contrib module will only need a mymodule.install file like this:

function mymodule_update_1() {
  return _system_update_utf8(array('table1', 'table2', 'table3'));
}

I'll add this to the module upgrading guide (and post it on the mailinglist) once committed.

Cvbge’s picture

StatusFileSize
new28.63 KB

This cvs $id thing changing date from YYYY-MM-DD to YYYY/MM/DD is really annoying...

I've tested the patch with postgresql. Made changes described below. Have following questions/concerns:

1. Shouldn't _system_update_utf8() go to update.php ?
2. After clicking save in admin/settings the drupal_set_message() about postgresql wrong encoding gets duplicated, so it looks like this:

    * Your PostgreSQL database is set up with the wrong character encoding (LATIN2). It is possibile it will not work as expected. It is advised to recreate it with UTF-8/Unicode encoding. More information can be found in the PostgreSQL documentation.
    * The configuration options have been saved.
    * Your PostgreSQL database is set up with the wrong character encoding (LATIN2). It is possibile it will not work as expected. It is advised to recreate it with UTF-8/Unicode encoding. More information can be found in the PostgreSQL documentation.

Changes:

I've put the warning in system_update_169() inside t().

I've also lowered the strength of the warning: from
"Your PostgreSQL database is set up with the wrong character encoding ('. $encoding .'). Please recreate it with UTF-8/Unicode encoding"
to
"Your PostgreSQL database is set up with the wrong character encoding (%encoding). It is possibile it will not work as expected. It is advised to recreate it with UTF-8/Unicode encoding."

Also changed url for postgresql docs to something more appropriate.

The drupal_set_message() in admin/settings with severity error makes the form unfolded. I've changed it to 'status'. It's still clearly visible. The same for watchdog() call in updates.inc, change from WATCHDOG_ERROR TO WATCHDOG_WARNING.

As discussed, I've removed the pg_set_client_encoding() from database.pgsql.inc

rkerr’s picture

For what it's worth, we used only the following commands with MySQL 4.1 and that seems to be enough for the connection to work properly. (MySQL server is already configured to be UTF-8 by default).

in database.mysq.inc, db_connect():
mysql_query("SET NAMES utf8;");
mysql_query("SET CHARACTER SET utf8;");

In my.cnf:
[mysqld]
default-character-set=utf8
character-set-server=utf8

[client]
default-character-set=utf8

A useful command for converting MySQL tables is:
ALTER TABLE my_table CONVERT TO CHARACTER SET 'utf8'

Steven’s picture

For the message duplication, we'll have to if ($_POST) the check in system.module I suppose. I don't understand your comment about "severity error makes the form unfolded" though. Are you referring to the collapsible fieldsets? They remain collapsed for me in Safari and Firefox.

Oh and the lack of t() in updates.inc was intentional. We cannot depend on the localization system to work there.

I'll make a new patch soon.

rkerr: Your query is no good, because we are dealing with UTF-8 data stored with the wrong character set attribute. We only want to re-interpret the data, not convert it.

Cvbge’s picture

I don't understand your comment about "severity error makes the form unfolded" though. Are you referring to the collapsible fieldsets? They remain collapsed for me in Safari and Firefox.

Indeed. It's working now. They were not collapsed previously though :/
Still, I'd left that as a warning/status message.

Oh and the lack of t() in updates.inc was intentional. We cannot depend on the localization system to work there.

We can't? Why?

Wesley Tanaka’s picture

probably because localization related tables may have changed schema.

Steven’s picture

Status: Needs review » Fixed

Okay, final patch committed.

I applied a slight modification to the update system, after discussion with drumm. It's essentially a bugfix for distinguishing newly installed modules from modules with no updates. They both used to be indicated with '0', which meant that if you updated a contrib module and re-enabled it, its schema would get bumped up whether you had run update.php or not. That is fixed now: never-installed modules are -1, installed-but-updateless modules are 0.

drewish’s picture

I'm getting a whole pack of errors like:
user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\access DEFAULT CHARACTER SET utf8' at line 1 query: ALTER TABLE \access DEFAULT CHARACTER SET utf8 in /usr/local/www/drupal/includes/database.mysql.inc on line 118.

Not sure if it matters but the machine is running PHP 5.1, though I've already upgraded one machine running 5.0.4...

rkerr’s picture


Steven:
we are dealing with UTF-8 data stored with the wrong character set attribute. We only want to re-interpret the data, not convert it.

I thought the "convert to character set" query only converts table and field character sets in one shot, and doesn't actually modify the data in the table.

Otherwise, you may convert the _table_ itself to utf-8, but you could still have text fields in the table that are the old character set. At least that was my experience a few months ago.

drewish’s picture

I started trying to debug the error on my local machine and ran into the error that Syscrusher describes in this email sent to the dev mailing list:

> Eek! I happened to be building a new sandbox site for myself and checked out
> CVS just before or just after you committed that update. When I try to run the
> update.php script against my previous 4.7 HEAD database, it loops endlessly
> doing.....something. I'm not sure what it's up to, but the behavior is
> repeatable. Maybe I'm the only one to whom this will happen, because I may have
> gotten a weird half-updated Drupal snapshot, but I thought I ought to report it.
> In this case, no big deal -- just a sandbox, and I can nuke it and start over.
> But I wouldn't want this to happen on a production site.

My local machine is running MySQL 4.1.12a-nt and PHP 5.0.5. I wonder if it's related to this message that gordon posted to the dev list:

> Just thought that you would like to know that this doesn't work under
> php 5. there is this error
> 
> Fatal error: Only variables can be passed by reference
> in /var/www/surf/database/updates.inc on line 1496
dries’s picture

I get these now using MySQL 4.1.12-max:

Warning: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation 'like' query: SELECT * FROM access WHERE status = 1 AND type = 'host' AND LOWER('192.168.0.153') LIKE LOWER(mask) in /Users/dries/Sites/drupal-cvs/includes/database.mysql.inc on line 118

Warning: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation 'like' query: SELECT * FROM access WHERE status = 0 AND type = 'host' AND LOWER('192.168.0.153') LIKE LOWER(mask) in /Users/dries/Sites/drupal-cvs/includes/database.mysql.inc on line 118

(Looks like that query is executed twice per page view?)

dries’s picture

Status: Fixed » Active

Nevermind, I didn't ran the update script. Running update.php fixed my problem.

chx’s picture

I guess then it's fixed? (I do not do dare to change an issue status set by Dries so I leave it to him to set this back to fixed.)

Steven’s picture

I fixed the PHP5 issue, but I can't reproduce the collation issue. It's odd, because I did see a similar error during the patch development, on the node edit form (which does LOWER('username') = LOWER(user)). But not for the access table (I made sure there are rows in it, before doing the upgrade).

I'm not sure if there is anything we can do to avoid it though.

Dries: no <pre> tags please.

markus_petrux’s picture

What about contrib modules? Should module authors change that?

Does the INSTALL.mysql.txt file needs a node about this? I mean, the database can be defined with a default charset, etc.

CREATE DATABASE dbname DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

Or the UPGRADE.txt file could add a note to perform something like:

ALTER DATABASE dbname DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

...the user running the web server process may not have permission to alter the database.

markus_petrux’s picture

>> What about contrib modules? Should module authors change that?

Wow! I have just seen it, through the last commit made by the votingapi module. And also seen the Converting 4.6 modules to 4.7 document.

Very nice!

I believe there is more documentation for module authors that needs to be updated:

:-?

Steven’s picture

Status: Active » Fixed

The main reason we don't want to change the database encoding ourselves is because it could affect sites where Drupal shares the database through prefixing.

As far as the creation of the database goes, I'm not sure. The method of setting it per table means we don't care what the database encoding is.

I'm marking this issue as fixed. The upgrade warning is not something I think we can hide easily.

drewish’s picture

Sorry to re-open this but I just updated to HEAD and tried re-running the 169 update on the machine mentioned in comment #23 and I'm still getting the same errors:
user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\access DEFAULT CHARACTER SET utf8' at line 1 query: ALTER TABLE \access DEFAULT CHARACTER SET utf8 in /usr/local/www/drupal/includes/database.mysql.inc on line 118.

I looked around and most other ALTER TABLE statments in upgrades.inc don't prefix the table name with a backslash. What's the reason for including the backslashes on lines 1547 and 1549?

  if (count($convert_to_binary)) {
    // Convert text columns to binary
    $ret[] = update_sql("ALTER TABLE \{$table} ". implode(', ', $convert_to_binary));
    // Convert binary columns to UTF-8
    $ret[] = update_sql("ALTER TABLE \{$table} ". implode(', ', $convert_to_utf8));
  }
Steven’s picture

*sigh*
http://www.php.net/manual/en/language.types.string.php

Before PHP 5.1.1, backslash in \{$var} hasn't been printed.

Don't you just love nasty changes in minor revisions?

Committed a fix.

drewish’s picture

Thanks Steven, that fixed it. What do you mean? Major breaks in BC on minor versions is one of the best parts of using PHP :p My friend used to justify litering saying that it made jobs, I guess the PHP core developers are just trying to create billable hours for PHP consultants ;)

ax’s picture

Title: drupal does not work out of the box on mysql set up with non-utf8 char set » Drupal does not work out of the box on mysql set up with non-utf8 char set
Status: Fixed » Active
    still buggy:
  1.     $url = parse_url($GLOBALS['db_url']);
        $db_name = substr($url['path'], 1);
    

    in _system_update_utf8() doesn't take into account the associative array form of

        $db_url = array (
          'default' => 'mysql://username:password@localhost/databasename'),
          'second' => 'mysql://username:password@localhost/databasename'),
        )
    
  2.     list(, $create) = db_fetch_array(db_query('SHOW CREATE DATABASE `%s`', $db_name));
        if (preg_match('/utf8/i', $create)) {
    

    in _system_update_utf8() won't work, as db_fetch_array() returns an
    associative array, but list() only works on numerical arrays and assumes
    the numerical indices start at 0
    .

    strange commit #24358: Make utf-8 update PHP5 compatible ...

  3. buggy and redundant:

  4.     mysql_query('SET NAMES "utf8"', $connection);
        mysql_query('SET collation_connection="utf8_general_ci"', $connection);
        mysql_query('SET collation_server="utf8_general_ci"', $connection);
        mysql_query('SET character_set_client="utf8"', $connection);
        mysql_query('SET character_set_connection="utf8"', $connection);
        mysql_query('SET character_set_results="utf8"', $connection);
        mysql_query('SET character_set_server="utf8"', $connection);
    

    as a SET NAMES 'x' statement is equivalent to these three statements:

    mysql> SET character_set_client = x;
    mysql> SET character_set_results = x;
    mysql> SET character_set_connection = x;

    Setting character_set_connection to x also sets collation_connection to the default collation for x.
    and don't set character_set_server|collation_server - what about other dbs?

    and don't set character_set_server|collation_server - what about other dbs?

    so the following should be sufficient:

        mysql_query('SET NAMES "utf8"', $connection);
    
  5. and questionable (it breaks some things for my customized Drupal):

  6.     $_SESSION['update_utf8'] = $tables;
        $_SESSION['update_utf8_total'] = count($tables);
    

    i see more and more session variables and globals. this is usually an indication for poor application design. oop could help.

#4 shouldn't be discussed here

ax’s picture

Status: Active » Needs work
Steven’s picture

StatusFileSize
new2.69 KB

Attached a patch which addresses 1, 2 and 3.

About #4: the whole upgrade system is based around sessions so we can do updates that take longer than the standard 30 second PHP time limit.

Steven’s picture

Status: Needs work » Needs review
Wesley Tanaka’s picture

Status: Needs review » Needs work

I'm pretty sure that the original list of statements is not always equivalent to SET NAMES 'utf8' on its own. I had been using SET NAMES 'utf8' on its own, and ran into some kind of collation-related error (not with drupal) a while back, and adding the latter 6 lines (without modifying the database) caused the error to go away.

I didn't investigate at all so unfortunately, I don't know what the minimal set is. But I do think that at least one of the latter 6 variable settings has a non-0 effect.

From the mysql documentation, it would seem that the server character set and collation are not affected by SET NAMES.

Those settings do have effects, and I imagine it would cause problems somewhere if they weren't set to UTF8.

There's a hint that literal strings take the server character set, which I think would cause errors in statements like db_query("SELECT * FROM {table} WHERE a < '%s'", $string); if the server collation weren't set to a UTF-8 compatible collation.

So I imagine that

  mysql_query("SET collation_server='utf8_general_ci'");
  mysql_query("SET character_set_server='utf8'");

should probably be left in.

Steven’s picture

Status: Needs work » Fixed

wtanaka: hmm I can't really see how you would conclude that. The part about the literal string being in Latin1 on that MySQL page is just a simplification to keep the explanation simple, I think. The snippet explains collations and how they apply to MySQL built-in vars, nothing more.

The text on http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html is pretty specific about which variables are used when.

Of course it's possible there is some inconsistency or bug in some mysql setups, but it's something we would need to deal with specifically, when it can be reproduced.

Committed the fixes above.

Wesley Tanaka’s picture

I thought that "The literal string '@' has the server character set (latin1 by default)" was pretty clear in specifying that literals take the server character set.

the (currently single) comment on http://dev.mysql.com/doc/refman/4.1/en/charset-collate-tricky.html:

Posted by Thierry Danard on November 5 2004 9:34pm [Delete] [Edit]

Obvious collating is not always so obvious to the database engine (version 4.1).

A query with no collating instructions such as select concat(mycolumn, '%') from mytable does not work unless "mycolumn" is of the same character set as '%'.

In my case, the entire database uses UTF-8, and by default, '%' is assumed to be latin1, causing an error to be triggered.

appears to be talking about the error that I had seen before with only SET NAMES 'UTF8' being called on each connection.

Wesley Tanaka’s picture

I looked at the link you posted,
http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html

but it doesn't actually seem to say anything one way or the other about when the server character set is used.

That leads me to the conclusion that said page is incomplete, since the two links I posted in comment #40 mention uses of the server character set.

sja1’s picture

Title: Drupal does not work out of the box on mysql set up with non-utf8 char set » Careful with translations - locale_source & locale_target may need special treatment

Just a quick comment, as I noticed further up discussion about automatically updating Mysql databases from Latin1 to UTF8. I recently updated (manually) a Mysql 4.1 database for a Spanish language site from Latin1 to UTF8, and discovered that in addition to adding SET NAMES utf8 to database.mysql.inc, I had to delete the contents of the locales_source and locales_target tables, and then reinstall the Spanish .po file. Until I did that all accented characters in translated strings would show up garbled.

I believe the translations are stored in Blob format, and so are left unchanged (presumably in Latin1) when the rest of the text columns in the database get converted to UTF8. If there is no way to automate conversion of the blobs, a note should be output to users alerting them to the need to:

1. Export their translation (in case they have made custom changes to the translation they want to save)
2. Uninstall the translation prior to updating the database (otherwise they may not be able to get back into drupal. Translated strings will contain garbled characters tend to break the html, and can make it impossible to log in and/or administrate the system using Drupal's admin interface)
3. Reinstall their .po file only after the database has been updated.

I found I didn't have to do any conversion to the .po file before reinstalling it. I simply reinstalled the same .po file after converting the database, and it worked.

sja1’s picture

Title: Careful with translations - locale_source & locale_target may need special treatment » Drupal does not work out of the box on mysql set up with non-utf8 char set

Oops, didn't realize I had changed the title of the whole thread. This is just to set it back...

Anonymous’s picture

Status: Fixed » Closed (fixed)