Hello, everyone!

Here is my route to upgrading the UTF-8 data stored in MySQL as latin1 to a utf8-based charset and collation.

First, a conversion shell script:

1  DBFROM=dp
2  DBTO=dp
3  LOGIN=dp
4  PASS=XXX
5  mysqldump  --extended-insert=FALSE --default-character-set=latin1  -u $LOGIN -p$PASS $DBFROM >dp.sql
6  cat dp.sql |sed  -e 's/DEFAULT CHARSET=latin1;/DEFAULT CHARSET=utf8 COLLATE utf8_bin;/'>dp2.sql
7  cat dp2.sql |sed  -e 's/SET NAMES latin1/SET NAMES utf8/'>dp3.sql
8  echo " drop database $DBTO; create database $DBTO character set utf8 collate utf8_bin;"|mysql -u $LOGIN -p$PASS
9  mysql -u $LOGIN -p$PASS $DBTO <dp3.sql

Let us go over the lines.

Lines 1-4 configure the connection options.

Line 5 dumps the contents of the database. The --default-character-set=latin1 option disables the recoding of the date to the current locale of the system. Since the MySQL considers your UTF-8 data to be in latin1, this may result in data corruption. We also set --extended-insert=FALSE to easily detect the errors.

Line 6 passes over the dump of the data and changes the declaration of the default charset to UTF-8. It also specifies the collation scheme to be utf8_bin. This is the trickiest point and it took me a couple of hours to figure it out. Your fake latin1 data uses the latin1_swedish_ci collation by default where ci stands for case-insensitive. If you dump your data with the latin1 charset and latin1_swedish_ci collation and try to insert with the UTF-8 charset and its default utf8_general_ci collation, you will likely get unique key contraint errors as MySQL will start to consider the cyrillic А to be equal to а, and the greek Ρ to be equal to ρ just and it earlier considered the latin A to be equal to a.

Line 7 modifies the header of the dump file so that the import always happens in utf-8 mode, notwithstanding the system locale.

Line 8 drops the database and creates it with the needed charset and collation parameters on by default.

Line 9 is the actual import.

You may want to clean search tables and convert their collation back to utf8_general_ci which will allow you to make case-insensitive searches:

delete from search_index;
delete from search_total;
alter table search_index character set utf8 collate utf8_general_ci;
alter table search_total character set utf8 collate utf8_general_ci;

There is also one patch that we should apply to Drupal before your site becomes completely usable:

--- /home/dp/drupal-4.6.5/modules/user.module→2005-11-29 21:14:29.000000000 +0100
+++ dp/modules/user.module→ 2006-12-25 10:48:26.000000000 +0100
@@ -56,7 +56,11 @@
       $query .= "u.uid = %d AND ";
       $params[] = $value;
     }
-    else {
+    else if ($key == 'name') {
+      $query .= "u.name = '%s' AND ";
+      $params[] = $value;
+    }
+   else {
       $query .= "LOWER(u.$key) = '%s' AND ";
       $params[] = strtolower($value);
     }

The patch gets rid of the dangerous LOWER() function on username search. Remember that by specifying the utf8_bin collation on the users table we DO allow usenames that differ only in case.

Comments

mikhailian’s picture

Oh, yep.... to keep all the info in one place, one more patch is needed for Drupal versions < 4.7.

--- /home/dp/drupal-4.6.5/includes/database.mysql.inc→2005-12-07 16:41:17.000000000 +0100
+++ dp/includes/database.mysql.inc→ 2006-12-24 22:49:25.000000000 +0100
@@ -30,7 +30,10 @@
   $connection = mysql_connect($url['host'], $url['user'], $url['pass'], TRUE) or die(mysql_error());
   mysql_select_db(substr($url['path'], 1), $connection) or die('unable to select database');

+  mysql_query('SET NAMES utf8');
   return $connection;
 }

This patch configures the mysql connection to use UTF-8 for reading an writing.

pwolanin’s picture

I think this is overly complicated- it worked for me to mysqldump as latin1 and then import as utf8 (I'll post the link).

---
Work: BioRAFT

mikhailian’s picture

Not for a multilingual website with a ~500Mb database and many thousands users.

dkruglyak’s picture

My problem is documented over here http://drupal.org/node/108052, but I am not sure from this thread which solution to use and why. I only have an english DB. Please help!

mrknowitall’s picture

I had my lead tech test this on our mysqldump and he said it worked fine. I tried doing it myself and found it complicated also. I think it is all in the who and how.

Jason Lee
http://www.anonycryption.com

Steven’s picture

I would not advise using this method. Changing Drupal's usernames to be case sensitive has a bunch of other consequences, and changing the database to use the utf8_bin collation everywhere by default is certainly not right.

--
If you have a problem, please search before posting a question.

mikhailian’s picture

Nice to here from you, Steven

Unfortunately, I had only two options. The one is what I described above and the other was about going through the database and changing manually everything that could not be inserted due to the unique key constraints enforced by the utf8_general_ci collation.

BTW, what is wrong about the utf8_bin? The order of characters in the Unicode standard reflects more or less the order of letters in the national alphabets. Given the number of questionable decisions taken for utf8_general_ci (e.g. bumping together CYRILLIC CAPITAL LETTER IE, CYRILLIC CAPITAL LETTER IO, CYRILLIC SMALL LETTER IE, CYRILLIC SMALL LETTER IO), utf8_bin looks quite appealing even for general use.

As for the changing Drupal's usernames to be case-sensitive, I am well aware of the potential problems and I even introduced a bugreport to attract attention to the issue.

AliBey’s picture

My story is a bit different. My administrator upgraded SQL 4 to 5 and I have lost all Polish chars. I was (and am) not able to get the backup (don't ask why), so what I did was:
changed the encoding for tables to utf8_unicode_ci
downloaded the broken backup of the database
using regex changed in backup file the broken Polish chars to proper value and saved the file to both ANSI and UTF8 encoding.
When I uploaded the file in both encodings back to Drupal database, the fonts remained messy.

Does anyone have any idea what I may do?

Albert

AliBey’s picture

it works, I simply changed wrong table. Sorry for mess:)

RoiDanton’s picture

Thanks for your posting! That script made my day!!

But whats about this here:

delete from search_index;
delete from search_total;
alter table search_index character set utf8 collate utf8_general_ci;
alter table search_total character set utf8 collate utf8_general_ci;

Do I need them? How do I execute them?

And are there patches for drupal 5.x installations??

Acert93’s picture

http://drupal.org/node/187689#comment-638628

I would recommend reading this thread on moving a database from 4.0 to 4.1+ and dealing with the UTF8 issue. Jirka and Wim covered a lot of ground with all the variables and how to best migrate the data.

glass.dimly’s picture

This is simpler than it appears:

mysql -uusername -ppassword --default-character-set=utf8 database < backup.sql

Worked for fixing drupal 4.7 to drupal 5 upgrade that introduced garbage characters.