I recently migrated to a new server and have run into a "garbled text" issue. At first I didn't know what the issue was, but after a couple differnt threads it was narrowed down to the fact my old host was running MySQL 4.0 (Latin1 encoding) and my new one is running MySQL 4.1 (UTF8 text encoding).

I have been searching for an official resolution to this issue as I have read a lot of contradictory (and "hit-and-miss" advice). Specifically:

(1) What is the correct method to move from a MySQL 4.0 server to a MySQL 4.1 server?

(2) If you made the mistake of moving your database from MySQL 4.0 to 4.1, and users added content, is there a process to preserve this content while correctly updating the database?

Like I said, there is a lot of support requests about this issue (Drupal Search, Google Search of the Drupal Site) although I haven't read an official suggestion from Drupal themselves. Some of the suggestions I have read:

- Do a MySQL dump, dropping the table for Latin1 and restore the database and change the character set to UTF8 (also here)

- Drupal has an [unsupported?] upgrade module already in Drupal (Does it work with 4.7? 5?)

- Run inconv

- Manually replace latin1 with utf8

- Use the ALTER DATABASE command to change the default character set and then alter each table with ALTER TABLE for every table you find with an error message.

- Alter the drupal update.php file by inserting some code provided by Heine that appears to update each table

- A user noted possible permission issues with using a 4.0 database on a 4.1 server and pointed to an article on MySQL.com for advice on how to resolve permission issues, but I haven't seen any other discussion of this or how to properly impliment

In many of the threads you will see "this didn't work for me" and "you need to do this, too" type of comments. It would be great if Drupal developers who are proficient in MySQL could give definitive instructions on how to resolve this issue. Support requests are quite frequent in this regards so it would cut down on forum clutter AND give concrete, Drupal approved directions to resolving the issue. To recap:

(1) What is the correct method to move from a MySQL 4.0 server to a MySQL 4.1 server?

(2) If you made the mistake of moving your database from MySQL 4.0 to 4.1, and users added content, is there a process to preserve this content while correctly updating the database?

Thanks!

Comments

Project:Documentation» Drupal core
Version:» 5.3
Component:Developer Guide» mysql database

While this is a plea for documentation, it is,m until answered, a support request. Moving queues.

Sorry about putting this in the incorrect area sepeck. If someone knows the correct way to deal with this issue I would be happy to write a small tutorial for the documentation.

Anyhow, thanks.

I have been toying around with a few of the suggestions. I couldn't get iconv to work (and I am not 100% sure if that works anyhow), but I was able to remove the garbled text when dumped from MySQL 4.0.

Step #1: Do a mysqldump on the 4.0 server as follows:

mysqldump -uusername -ppassword --add-drop-table --default-character-set=latin1 database > backup41.sql

Step #2: Populate the empty database on the 4.1 server as follows:

mysql -uusername -ppassword --default-character-set=utf8 database < backup41.sql

What did this do? This removed the odd  and other random garbles.

What didn't this do? Looking at the MySQL database, tables still have under "Collation" have "latin1_swedish_ci"

Is this a problem? If so, how do I resolve this?

Also, worth noting, this does NOT work if you migrated your database to the 4.1 server. It MUST be dumped from the 4.0 server and then populated on the 4.1 server. Dumping from the 4.1 and repopulating didn't remove garbled text (ugh! 2 weeks of lost updates).

Thanks for any advice anyone can offer.

A small followup:

I setup a new test account using the Drupal MySQL 4.1 database, and when browsing the tables in PHPMyAdmin the charset is listed as:

COLLATION (fresh Drupal install with 4.1 MySQL DB on a server with MySQL 4.1)
utf8_general_ci

As noted above, using the mysqldump command with "--add-drop-table --default-character-set=latin1" the tables still show under Collation latin1_swedish_ci:

COLLATION (DB on MySQL 4.0 drumped and repopulated on a 4.1 server using the method mentioned previously)
latin1_swedish_ci

So where am I at with this issue? I am not quite sure what the next appropriate step to correctly migrate the database.

The garbled text is gone, but the database still indicates latin1 as the charset. Is this acceptible? Will this cause problems when upgrading to Drupal 5 and beyond?

+1 to icluding some more visible info somewhere in the handbook, rather than just repetitive support requests in the queue.

I ran into this while doing my test clones from production site to localhost: The production webhosting runs MySQL 4.0.x, while my local install is on 4.1.x.

- I agree this is confusing, the site was totally down after the move and I didn't know why.

- In some cases, I can't use MySQL commands directly, no access to the webhosting command line (terminal). I believe this is the case for several sites on shared webhosts.

- Currently I'm moving the site repeatedly each time I update the test install, so I got a bit of experience for my particular use case (which is moving to another server):

* I've the MySQL Dumper utility installed on the hosting (using .htaccess / .htpasswd authentication!), because my database is too big to be dumped by PhpMyAdmin (time limit on php execution). MySQL dumper have batch processing (multi-request) and so works for me.

* I export (backup) the database using this utility.

* On the MySQL 4.1 install, I tweaked the MySQL dumper a bit: In the file mysqldumper/inc/mysql.php, at the end of function MSD_mysql_connect() (just above the final 'return;'), I added the line: $res=MSD_query('SET NAMES utf8');. This makes sure the incoming data in later step are treated as UTF-8.

* I create the database on the new MySQL 4.1 server (using normal PhpMyAdmin, for it's better features), with the collation "utf8-general". THIS IS IMPORTANT!!! Initially, i logically chosen my native "utf8-czech" and it seemed to work for a while, but later problems came: All new tables (newly enabled modules for example) are created as 'utf8-general', and joined queries then fail with "Illegal mix of collations" error, bringing the site down again! (I had to manually change all tables and columns to general, so better use 'utf8-general' from the start.)

* I import the backup file (restore) to the new server, via the tweaked MySQL Dumper. (That's important: If the 'SET NAMES utf8' query was not involved, the database will get broken - single utf-8 bytes re-encoded as utf-8 again - and I can't see how to correct that later.)

* Moved site works :-)

I'm sure there are also another ways to accomplish it, but this works smoothly for me.

Component:mysql database» documentation
Category:support» task

Oh, and this IS about documentation. Neither of us here is asking how to do it. We're asking to have it documented in handbook section, to get rid of repetitive support requests.

Thanks for the feedback Jirka. I was concerned that the Collation tables still had latin1_swedish_ci, and your comments about new content being added as utf8_general confirmed that doing the mysqldump and dropping the latin1 charset was only a partial solution.

I am a novice at the commandline, so if someone could point out the last few steps to resolve this I will make sure to submit some Documentation for Drupal ASAP. This issues has appeared a lot on the forums and on other sites (as I showed in my first links). Some concrete instructions from Drupal would cut down on support requests and resolve the confusion I have found in many threads trying to address this issue. With more hosts moving to UTF8 based MySQL versions I see this issue becoming more common, so increased visibility of the "suggestion resolution path" would be a great service to the community :)

Until someone writes it, it is a support request. Once the proper way is determined, then it can be turned into documentation.... by anyone with a registered drupal.org account.

Status:Active» Postponed (maintainer needs more info)

I think we're trying to collect the experience together here, so then we may build some text on top of that, and call for someone able to include it somewhere. (That's a "task" as I understand it ;) A more experienced MySQL expert is probably needed here, too...

I hope to find time, to go through all the links and related pages, and bring here some initial list of points, but I can't promise given my limited time.

Also more experience is welcome, so setting to "needs more info" which seems fitting.

After browsing the forums even more, hoping I overlooked the solution, I found a number of user requests from over a year ago (!) that had the same issue and no response/resolution was ever offered. As this as a long standing issue and needs to be documented, I went ahead and created a small bounty for anyone willing to offer their assistance in writing up some documentation for migrating an existing Drupal site from a server running MySQL 4.0 to MySQL 4.1:

http://drupal.org/node/191823

I know it isn't a lot, but it should be easy money for an experienced Drupal/MySQL developer--and it would be a great assistance to many users and cut down on the support questions.

Status:Postponed (maintainer needs more info)» Active

OK, I see there's not much movement here... Let my try with some initial text for further improvements...

The text below definitely needs some review and feedback, as it's mostly just a synthesis of available resources. I tested very little of this myself. Improvements welcome, and also some native speaker's correction is probably needed. (BTW, I'm NOT applying for the mentioned bounty - this is no expert's reliable text, and also the arrangements would be more trouble than it's worth, if applicable)

------------------------------------------------------

MYSQL 4.0 -> 4.1 MIGRATION

The MySQL version 4.1 introduced some changes to character set handling, which are often giving us trouble if moving a Drupal site from one to the other.

Basically, Drupal always uses utf-8 encoding, while MySQL 4.0 and lower doesn't support utf-8. It doesn't really matter: Drupal still sends utf-8 to the database, and receive it back fine, only the database engine thinks it's Latin1. No problem with that.

4.1 supports utf-8, so Drupal is happy with that even more. But while upgrading the database server, or moving the site to 4.1, we need to change the character set from latin1 to utf-8 WITHOUT actually converting data (because we already have the data in utf-8, really), and this is a bit tricky job.

Additionally, there are so called "collations". We need to remember, that Drupal always uses the "utf8_general_ci" one, but old version defaults so some Swedish Latin1. If you have different collations on your various Drupal tables, you'll probably encounter serious problems due to the "Illegal mix of collations" error. Both character set (default, where applicable) and collation are set for whole database, single tables, and single columns in them!

IF YOU ARE MOVING THE SITE FROM 4.0 TO 4.1, YOU NEED TO BACK-UP PROPERLY, AND CONVERT YOUR DATABASE IMMEDIATELY, WITHOUT TRYING TO RUN THE SITE INBETWEEN.

There are a few different scenarios:

1. UPGRADED SERVER TO 4.1, DATA STAYING THE SAME

In this case you have the utf-8 data already there, but your encoding/collation is wrong. You MAY NOT use normal "ALTER TABLE ... CHANGE CHARACTER SET" commands, because then MySQL will "convert" your data accordingly, messing it up.

There's a nice trick: We change the data columns from text/varchar types to corresponding binary types (so MySQL forgets about the encoding/collation without really doing anything with your data), and then back (with correct "utf8_general_ci").

Good news is, that there's already an implementation of this inside Drupal ( _system_update_utf8($tables) and update_convert_table_utf8($table) ), but it doesn't do its job automatically... You need to call these somehow.

There was some sort of a helper module, which seems to be for 4.7.x branch only:
http://aymanh.com/mysql-4-0-to-4-1-encoding-problem-and-solution

There's a code snippet for virtually any branch:
http://mostrey.be/drupal_character_encoding_mysql

It's still a good idea to double check with PhpMyAdmin or similar tool, whether your database was corrected properly, all tables processed, and "utf8_general_ci" everywhere. If you have just some other collation (for example "utf8_czech_ci"), you may safely change it manually.

Alternatively, you may proceed in the same way as scenario 2. below (export, then import correctly), but while exporting from 4.0, you must use the old "Latin1" character set, to avoid damage. Then, you need to change that to correct "utf8" in the file, and import to 4.1. According to http://textsnippets.com/posts/show/84 , the correct command-line steps are:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql

2. MOVING DATA FROM 4.0 SERVER TO (OTHER) 4.1 SERVER

You need to export (dump, backup - however you call this) the data from the old 4.0 database, and import to 4.1 along with correct utf-8 encoding information, so that the database engine understand the data correctly and doesn't do some unwanted "conversion".

The new database (on MySQL 4.1) should be created as "utf8_general_ci", which is somewhere inherited to newly created tables, so otherwise you risk problems later.

For command line, there are quite a few suggestions, generally agreeing on:

mysqldump -uusername -ppassword --add-drop-table --default-character-set=latin1 database > backup41.sql
mysql -uusername -ppassword --default-character-set=utf8 database < backup41.sql

If you're unable to access the command line, you may use some of the web-based tools, such as PhpMyAdmin ( http://www.phpmyadmin.net/ ) or for large databases MySQL Dumper ( http://www.mysqldumper.de/ ), which is able to handle large data avoiding php timeout problems. These may be installed to your webserver, and are also necessary for doing backups. Make sure that such applications are always protected from unauthorized access! If there's no other option (MySQL Dumper), consider at least basic HTTP authentication: http://qdig.sourceforge.net/Tips/HttpAuthGuide

Dumping the old data should make no problems; you should use the character set "Latin1" if set explicitly somewhere.

While Importing the data, you need to enforce utf-8 encoding for your new database AND connection between the database and the used web-based application. This depends on the exact application used, but if it didn't work straight away, you're likely to succeed trying the following:
- Set "utf8" and "utf8_general_ci" everywhere you can in the application
- Change all occurences of "Latin1" to "utf8" in your data file, if any. (NOTE: Unless you search only for database commands in the file, this may also affect your data! For example, you might have an article posted on your site, speaking about Latin1 encoding, which may get changed accidentally, so be careful! Also be sure, that the software (text editor in worst case) is capable of utf-8 handling. Always try on a COPY of your data first!)
- Enforce the connection is utf-8, by ensuring that the application executes the query "SET NAMES utf8" right after connection to the database. (If necessary, add it. For MySQL Dumper, it's a line $res=MSD_query('SET NAMES utf8'); added just above final return; of function MSD_mysql_connect() in the file mysqldumper/inc/mysql.php.)

Now, you should have all your tables in the utf-8 encoding, "utf8_general_ci" collation, with correct data. If you're doing this first time, please double-check through PhpMyAdmin, or some other way. As far as you're on utf-8 correctly, you may correct the collation from national ones to general manually, if needed.

3. "OOPS! I'VE IMPORTED DATA IN A WRONG WAY, CHARACTERS ARE MESSED UP, AND NO BACKUP"

If your server was just upgraded, and you didn't do anything with your data yet, then no need to worry: Your data is probably OK, just misunderstood by the new version. You need to proceed through scenario 1. above, immediately.

If you've just imported a backup/dump file from the old server - OOPS! This is a slopy way, and you should always avoid getting to this point. If you have ANY backup, please use it, and proceed along the above lines again. If not, well... There's a chance, but THIS IS UNSAFE AND UNTESTED!

Most probably, MySQL 4.1 received your utf-8 data with original "Latin1" encoding info, and so "converted" to utf-8 again. Don't try to run your site in this state - any new data will lower the chance of successfull recovery! Your probably only hope is (after doing a backup!) the suggestion from the first scenario:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql

The first command forces "Latin1", so MySQL should "convert" your data back to the original state. You may only hope, that there's no significant distortion from the double-conversion, and no new significant data added inbetween. The rest is the same as scenario 1.

If going through this unsafe way, be prepared for unexpected problems later. Do frequent back-ups, test your site extensively before going live again, and keep an eye on your logs closely.

4. MIGRATING DOWN FROM 4.1 TO 4.0

This is probably applicable if moving a site to testing environment, which is not fully up-to-date, and should be rare.

You need to dump / import your data in a normal way, using (on the command line) the switch --compatible=mysql40 while exporting from 4.1.

See also the official MySQL page: http://dev.mysql.com/doc/refman/4.1/en/charset-upgrading.html

That looks good. I would suggest putting it in the Troubleshooting FAQ section.

There's an easier way to do this. The update.php script comes with a function convert_to_utf8. I wrote a small guide on http://mostrey.be/drupal_character_encoding_mysql and answering the bounty on http://drupal.org/node/191823 am writing a full-fledged tutorial to contribute. This method has been tested by multiple people with success.

Yes, I've seen that (and taken ideas from there, honestly). Your page deals with scenario #1, and is really good on that, as far as I can tell.

But we also need to cover the scenario of moving site (dump from 4.0 -> import to 4.1), where the update.php function doesn't help: On 4.0 there's nothing to convert yet. On 4.1 data are already *really* messed up, as the conversion is performed during import (assuming the database was created as utf-8, as needed for Drupal). That's why there are other steps.

I also thought some background would be good, so people know what is it all about.

It would be really nice, if someone else review this, perhaps some feedback from people dealing with the various scenarios. I can't be really sure about my points, since I really tested only the scenario #2 with MySQL Dumper, and there seems to be quite a lot of possible problems.

(As for the bounty, you can take it, I already said I'm not applying to that.)

I would like to thank you BOTH for all the work you have put into this issue.

I will be hard at work on this tonight and will test out a couple of the scenarios (I have one site that has a lot of data added after it moved to Drupal 4.1, so old stuff has the garbled text and there is new content on top of it... and I have a half dozen sites moved, but no content changes, with the old 4.0 databases still live on the old server).

I will let everyone know how this all turns out.

Ok, I tried running the modified update.php, and while it removed garbled text on data added while the database was on the 4.0 server, all data added when the site was moved to the 4.1 server was truncated partially through the content (at random lengths at that) and was causing significant errors. So right now I am looking at just moving a copy of the old database still on the 4.0 server (Scenario #2).

I have 2 general questions. #1, Wim's method has you move the database to the 4.1 and then run the modified update.php file. From my tests this did remove the garbled text and converted the collation fields to utf8_general_ci and the content appeared to be ok. Would everyone agree this is a safe method? I ask because Jirka's comments in post#14 indicate the update.php method is great for sites where the MySQL server version was changed, but not for moving the database from one sever (w/ 4.0) to another server (w/ 4.1).

I did a test where a database was migrated from a 4.0 server to a 4.1 server, and ran the update.php method. The garbled text was gone and the database (via PHPMyAdmin) was listing collation tables as utf8_general_ci, but I am curious if this could lead to any issues I am not aware of? If NOTHING was added to the site between moving it from the two servers, is this a safe method? Or, as Jirka noted, the data is already *really* messed up?

The second question relates Jirka's approach to scenario 2 as well "2. MOVING DATA FROM 4.0 SERVER TO (OTHER) 4.1 SERVER". The first half is easy enough to accomplish:

mysqldump -uusername -ppassword --add-drop-table --default-character-set=latin1 database > backup41.sql
mysql -uusername -ppassword --default-character-set=utf8 database < backup41.sql

And indeed it does remove garbled text and, as Jirka notes, does not alter the tables (i.e. they are still listed as "latin1_swedish_ci" instead of "utf8_general_ci"). Specifically, I am wondering if you could elaborate a little on the following steps:

(A) - Set "utf8" and "utf8_general_ci" everywhere you can in the application
(B) - Change all occurences of "Latin1" to "utf8" in your data file, if any. (NOTE: Unless you search only for database commands in the file, this may also affect your data! For example, you might have an article posted on your site, speaking about Latin1 encoding, which may get changed accidentally, so be careful! Also be sure, that the software (text editor in worst case) is capable of utf-8 handling. Always try on a COPY of your data first!)
(C) - Enforce the connection is utf-8, by ensuring that the application executes the query "SET NAMES utf8" right after connection to the database. (If necessary, add it. For MySQL Dumper, it's a line $res=MSD_query('SET NAMES utf8'); added just above final return; of function MSD_mysql_connect() in the file mysqldumper/inc/mysql.php.)

I labeled the above a,b, and c.

In example (A), what do you mean "in the application"? Do you mean an application like PHPMyAdmin? If so

In example (B), if you have never mentioned UTF8 or Latin1 on your website, would a mass edit (through a text editor like 'Text Pad') of search&replace work? Is it as simple as

Latin1 => UTF8
latin1_swedish_ci => uft8_general_ci

Is this case sensative?

I am testing as much of this out as possible and will do my part to try to offer feedback and offer any assistance in clarification of the instructions where possible. Both of the samples Jirka and Wim offered were great first efforts. Wow guys, thanks.

This documentation is going to be a great help to a lot of users guys. Kudos!

Well, if you're able to import old dump into 4.1 without breaking it (at low-level), then you can happily proceed through modified update.php. I'm unsure on this, perhaps you should look into your database *files* (i.e. where the engine stores it) after import, to see how it really is. MySQL 4.1 is also able to convert from database to the query results, so what you see through queries is not that reliable here. Then we can decide whether my steps are necessary or not. I'm unsure whether I have some serious misunderstanding here, going to double-check as time permits.

Secondly, my scenario #2 is in fact two alternative scenarios: First half for command line (just pasted from somewhere, I never did it this way, as I've no command line access on live server), second half, alternatively, for web-based applications (which is my case). The latter is highly dependent on the exact app used, so that's why I employed so vague advice. Try-and-error, basically, having not much tested myself. If moving data through web apps, I really found it easiest to just add "SET NAMES utf8" to the app, and so get the DB right immediately, than to tweak update.php afterwards (if it even works for this case).

So, (A) is about PhpMyAdmin and friends, yes, and untested as I said (I just hope there might be some application out there, allowing the user to configure "SET NAMES ...." through UI, instead of tweaking code)
(B) Yes, it might work, as far as your editor handle utf-8 correctly, and doesn't get too flooded with the long file. I've no idea about case-sensitivity, should be checked against MySQL manual.

As for content added in the garbled site-state, I can't see any easy solution here: You have both *good* and *bad* data in the same tables, so no general conversion is good for both. If there's not much of such contents, you might want to paste into wordpad, delete the posts, and re-submit to repaired site afterwards. Otherwise you probably need some more complicated helper script, to take the new contents only (based on timestamps), move to a temporary table and delete originals, then convert the main table, an re-insert the new rows from temporary table back. I think such a script isn't written yet, and I'm unsure how it should look like. Better have backups, and solve the problem during post-update testing before going live, really.

Stay tuned, going to test a few things now...

Thanks for the reply Jirka.

Just a small note: When I do the following:

mysqldump -uusername -ppassword --add-drop-table --default-character-set=latin1 database > backup41.sql
mysql -uusername -ppassword --default-character-set=utf8 database < backup41.sql

It does remove the garbled text (odd A's), BUT it doesn't alter the COLLATION tables, which are still latin1_swedish_ci (instead of utf8_general_ci).

Ok, I did a test for scenario #2, to be absolutely sure:

- I took a 4.0 database dump from live server. It was done through web-based "MySQL Dumper" application - unfortunately I can't test any other way, because PhpMyAdmin dies on timeout due to database size, I've no access to the live server's command line, and I've no testing 4.0 install.
- I examined the file: There's absolutely NO mention of character set or collations in it.
- I created three empty databases: x1, x2, x3. All three as "utf8_general_ci", to avoid future trouble with "Illegal mix of collations".
- To x1, I imported the dump through the very same (unpatched) MySQL Dumper application, i.e. WITHOUT the "SET NAMES utf8" query executed.
- To x2, I imported it the same way, but WITH "SET NAMES utf8" (this is the way I normally do it)
- To x3, I imported it from command line using mysql -u**** -p**** --default-character-set=utf8 x3 < ****.sql. Unfortunately, it crashed after very few tables imported, due to duplicate entry error. This proves, that web-apps are not necessarily usable in conjunction with other tools. But however, enough data were already imported to allow further examination.
- I observed the database through PhpMyAdmin: All three databases, all tables, and all table-columns were set to "utf8_general_ci".
- x1 was garbled, x2 not garbled, x3 also not garbled.
- I executed the update.php snippet linked above on x1, and it didn't help. Still garbled.

Conclusion: This confirms my text above (the long one):

- All tables were created as "utf8_general_ci", although for x1 there was absolutely no mention of it (neither in the file, nor the application), so this tells us that it depends on the database default (which I created as utf8 general). This means that for future new tables created by Drupal, database probably needs to be set to "utf8_general_ci".

- x2 employed "SET NAMES utf8" (the same does Drupal itself, BTW), x3 (command line) employed "--default-character-set=utf8" which seems to do exactly the same. Both resulting in correct database without need for further update.php fixes.

- x1 didn't employ any "SET NAMES", and resulted in data garbled beyond repair, update.php snippet didn't help any more.

- My 4.0 file didn't contain any encoding information. I don't know if there may be some, and how the scenario changes then. But at the very least, my use case is one of possibly happening ones.

SO:

* If transferring data via export from 4.0 -> SQL dump file without encoding info -> import to 4.1, it's absolutely necessary to enforce utf8 on the import:

Import:
SQL file (no info) -> connection (enforced utf8) -> DB is happy -> correct data inside (utf8)
Runtime:
correct data inside (utf8) -> DB is happy -> connection (enforced utf8) -> Drupal is happy (utf8)

The SET NAMES query is about the encoding of CONNECTION between database and php - if wrong, database engine will do unwanted "conversion" and store garbled data inside:

Import:
SQL file (no info) -> connection (assumed to be Latin1) -> DB converts -> bad data inside (utf8 garbled)
Runtime:
bad data inside (utf8 garbled) -> DB is happy -> connection (enforced utf8) -> Drupal receives bad data :-(

No simple/reliable help then.

* The update.php snippet doesn't work in any way in this scenario, as it only changes database columns' encoding-labels, not really data inside. It's either unnecessary here, or unable to fix the problem.

Update.php fix attempt:
bad data inside (utf8-garbled) -> change to binary -> still garbled data inside (no encoding label) -> change to utf8 text -> still bad data inside (utf8 garbled)

* Collation is most likely taken from database-wide default. You should set/change your database encoding/collation before import, or else correct all tables/rows' collations on imported tables manually.

* This is all my "scenario #2", and the above proves that it's entirely different from #1.

------

* Scenario #1, which I can't test, means that database engine got upgraded, or in some other way old *inside stored* data came to MySQL 4.1 install. I'm unsure if this may also occur on import, if there was some explicit Latin1 info in the SQL file - this is not my case, and I don't know whether such a thing exists.
So data are OK, but encoding-labels wrong. Again the engine does unwanted conversion and so garbles the data, but in this case it happens on the way out:

Runtime:
correct data inside (mislabelled as Latin1) -> DB converts -> connection (enforced utf8) -> Drupal receives bad data :-(

* In this scenario, update.php snippet helps:

Update fix:
correct data inside (mislabelled as Latin1) -> change to binary -> still correct data inside (no encoding label) -> change to utf8 text -> still correct data inside (utf8)
An then runtime:
correct data inside (utf8) -> DB is happy -> connection (enforced utf8) -> Drupal is happy (utf8)

* But you shouldn't store any new content to the site before the update.php fix:

Write operation before fix:
Drupal sends utf8 -> connection (enforced utf8) -> DB converts -> stored as real Latin1 row, between other utf8 rows (just mislabelled as Latin1)

Then, you have different encodings mixed in the same table, and therefore no simple way to fix.

-------------
I hope I made myself clear. ;)

Thanks Jirka. I think I will have to try the MySQLDumpe app.

As for the commandline, I have tried it on a couple accounts and am getting the same issue: Collation labels aren't changing from latin1_swedish_ci to utf8_general_ci. The random Â's inserted into the text and the screwy ascii (e.g. alt+0149 • would get converted oddly) are gone though. e.g. I just tried this again with demo.pluswebdesign.com. Before it had the random insertion of Â's I mentioned before, now they are gone. To do this I:

1. Logged into the server running MySQL 4.0.

2. Did a database dump: mysqldump -udrupal -ppassword --add-drop-table --default-character-set=latin1 drupal > droplatin.sql

3. Downloaded the droplatin.sql and uploaded it to the server running MySQL 4.1.

4. I then populated the new account and database on the server running MySQL 4.1 by executing mysql -udrupal -ppassword --default-character-set=utf8 drupal < droplatin.sql

5. Random Â's are gone.

6. Collation tables *still* list latin1_swedish_ci instead of utf_general_ci

Below is a link of a pic of my PHPMyAdmin (an example of the random  issue on another Drupal site).

http://demo.pluswebdesign.com/files/latin1.jpg

Does anyone know why this isn't working for me? What did I do wrong that did work for Jirka?

- Joshua

And for anyone wanting to take a look at my MySQL databases...

• My Database from the 4.0 server after backing up via mysqldump -udrupal -ppassword --add-drop-table --default-character-set=latin1 drupal > droplatin.sql

http://www.pluswebdesign.com/files/demo40droplatin.sql

• A backup of my Database, after it imported to the 4.1 server via mysql -udrupal -ppassword --default-character-set=utf8 drupal < droplatin.sql

http://www.pluswebdesign.com/files/demobackup41.sql

• Finally, a mysqldump of the site on the 4.0 server without using the drop-latin1 command, instead just a normal dump via mysqldump -udrupal -ppassword drupal > 40backupnolatindrop.sql

http://www.pluswebdesign.com/files/40backupnolatindrop.sql

Edit: fixed links.

Took a quick look on that...

- Your dumps from 4.0 are both the same as far as I can see, both without any encoding info. So it probably doesn't matter whether you used "--default-character-set=latin1" or not. It may not hurt, still.
- Your dumps seem to be the same as my dumps via MySQL Dumper (as for the important bits), so probably no other scenarios revealed here.
- You used "--default-character-set=utf8" on import, and THAT's the correct trick.
- If you didn't use "--default-character-set=utf8", you'll be garbled, and update.php fix won't help, right? Can you try THAT, to confirm or deny the thoughts in this thread?
- The collation: Set utf8_general_ci on the 4.1 database before importing (preferrably while creating the new DB).
- We need to check, whether there's some switch on command line to enforce collation just like we did with charset. I'm unsure here, and don't have the time to search now.

After we fully understand your case, can we review the suggested documentation above, to identify any unclear/incorrect bits? This is supposed to be a documentation request.

*Jirka* If you didn't use "--default-character-set=utf8", you'll be garbled, and update.php fix won't help, right? Can you try THAT, to confirm or deny the thoughts in this thread?

Joshua: Let me first define garbled (for my experience): the only typological issues I have found is random insertion of the letter  as well as advanced ascii characters (like • alt+0149) which ended up looking quite odd (typically 3 different letters).

If the database was exported (from 4.0) and then imported (to 4.1) with a standard dump I get garbled text. I have not extensively tested the update.php route (due to an error message and truncated text), but from the little I have seen (will test more) the garbled text was removed *and* the collation tables were set to utf8_general_ci.

If the database was exported (from 4.0) and then imported (to 4.1) with --default-character-set=utf8 I do not get garbled text. Collation tables in my various attempts at this are still latin1_swedish_ci. I don't know what I am looking for in regards to a "correct" or "corrupted" database, but I could try running the modded update.php after this step as well for others to compare?

*Jirka* The collation: Set utf8_general_ci on the 4.1 database before importing (preferrably while creating the new DB). - We need to check, whether there's some switch on command line to enforce collation just like we did with charset. I'm unsure here, and don't have the time to search now.

Joshua: Yes. But before it can be documented we need to find out how and test it :) Kind of like the mysqldump with the drop latin1/add utf8 which I found on the forums. It seems to remove garbled text--which resulted in some users thinking the problem was fully resolved--but it doesn't update the collation, and from reading other threads here at Drupal this seems important?

*Jirka* After we fully understand your case, can we review the suggested documentation above, to identify any unclear/incorrect bits? This is supposed to be a documentation request.

Correct. It would be great if someone just knew how to resolve this through various methods and could just give us documentation. Right now we are kind of flying by the seat of our pants, checking out suggestions from various posts, testing them, and verifying what they do/don't do, and any potential issues so we can get to the point of writing up a nice set of documentation.

I can write nice documentation once we figure this stuff out... my problem is I have no clue what I am looking for. I am more than happy to post my results, test things out, and give feedback. If you needed a hosting account with commandline access I could accomodate this.

I just am not proficient enough on the technical end in regards to MySQL to say "this works" or "this is broken". e.g. You noted potential issues with update.php because it is ran after the DB is on the 4.1 server--but I wouldn't even know where to begin to identify such issues. So I am limited to being a tester and sounding board and offering any resources I have available.

Your feedback has been EXCELLENT btw. It would be great if a couple other MySQL/Drupal gurus here could chime in. 4.0 => 4.1/5.0 migrations are going to become more common as time goes on and there are literally dozens of support requests on the Drupal site (and on a number of blogs, see my links above) so this is a known issue that needs addressing. I am sure all those suffering these issues appreciate the time all those who have contributed here and in other threads to address this issue. Major kudos :)

*Jirka* We need to check, whether there's some switch on command line to enforce collation just like we did with charset. I'm unsure here, and don't have the time to search now.

Joshua: You got me thinking and I started searching and browsing some of the links I posted and part of the answer may be in there. Note especially the posts by ca_grover ericatkins post in the first link (although he doesn't say if you should do this before or after you migrate to the new server?)

http://drupal.org/node/60409
http://drupal.org/node/68100
http://www.uberellis.com/drupal/drupal-tip-6-mysql-4-0-and-mysql-4-1
http://www.google.com/search?hl=en&q=alter+table+Mysql+4.0+4.1+DRUPAL

In the first link ericatkins suggests running the SQL command (in PHPMyAdmin or other) for each table, something like:

...
ALTER TABLE blocks CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE blocks_roles CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE book CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE boxes CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE cache CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
...

I guess the question is: if this is done on the 4.1 server, is damage already done to the data?

(My testing for "garbled" is maybe more strict: My site is non-english, almost every content-string contains characters like ěščřžýáíé and similar in nearly every word. These go into total mess, several random chars replacing each of these, so it's really easy to spot. Just a side-note, why I think I see every problem for sure.)

Otherwise, I have suspicion that there's more that matters: If you import with utf8 connection enforced, but into Latin1 database, MySQL might convert data really and correctly into Latin1 (and store so), and then convert back to utf8 for Drupal, effectively working nicely but with Latin1 internal storage. If this is true, then update.php snippet should BREAK the data in this edge case, it would be a performance problem (double conversion), and first of all, the content will be distorted, as Latin1 is not capable of storing full set of utf8 characters (for example the Czech ěšščřčžž is in Latin2 only, not Latin1).

I hope to find time, and test this soon.

#24: I'm almost sure that the key here is, to have utf8 data already (whichever scenario it was), and only then you'll be able to ALTER TABLE to fix collation. If changing CHARACTER SET, you in fact perform data-conversion on-demand, breaking the data.

I guess the question is: if this is done on the 4.1 server, is damage already done to the data?

I think, that:
- If you did the import in a wrong way (--default-chartacter-set.....), then yes, damage is done.
- If you did the move correctly (needs more testing to define the term exactly), but have "Latin1" labels on the columns, then no, damage is not done yet, but you're going to do damage by ALTER TABLE
- If you did the import correctly and have utf8 labels (from creating the database, or after the update.php snippet), then no damage done and no danger of doing damage later.
- If you did the import in the hypothetical way mentioned in #25 (if that exists, going to test), then damage done, and you're going to revert part of it by ALTER TABLE, but part of the damage is irreversible.

I hope to be able, after more testing, to summarize the whole thing into some kind of a chart.

OK, there's no point in discussing partial info and partial experience all over again. I did a full set of tests:
- Took three SQL dumps: "FA" From 4.0 (one table cut from MySQL Dumper backups of my live site), "FB" from 4.1 (the same table from my localhost working copy of the site, via PhpMyAdmin), "FC" from 4.1 but exported in the 4.0 compatibility mode. The last one ("FC") is identical to "FA" and gave the same results, so I'm not listing it below. It just shows a possible way to migrate backwards (#11 - scenario 4).
- Created a big set of databases on testing local MySQL 4.1.19 server, one half ("DL") as "latin1_swedish_ci", other ("DU") as "utf8_general_ci".
- Imported the files into the databases from command line, enforcing utf8 ("CU") / latin1 ("CL") / none ("CN") for each combination. The option without encoding ("CN" - i.e. no --default-character-set=... specified) gave identical results to "CL", so I'm not listing it below.
- Observed collations / data visually via PhpMyAdmin on each resulting database.
- Copied MySQL's internal data files aside, and examined the hexadecimal byte-streams in there (manually as root user, to ensure no intrvence from MySQL engine).
- Based on these observations created the table and options shown below
- Tested all the fixes on that single table in each database, all worked exactly as expected. So all the below is 100% tested by me (unless I say that I'm not sure about something).

       +-----------------+-----------------+
       |       CU        |        CL       |
+---+--+-----------------+-----------------+
|   |  |latin1_swedish_ci|latin1_swedish_ci|
|   |  | Data stored: G1 | Data stored: OK |
|   |FA|Data visible: V? |Data visible: VD |
|   |  |   Fix: QA       |   Fix: QC       |
|   |  |  Safe: No :(    |  Safe: Yes :)   |
|DL +--+-----------------+-----------------+
|   |  | utf8_general_ci | utf8_general_ci |
|   |  | Data stored: OK | Data stored: G2 |
|   |FB|Data visible: OK |Data visible: VD |
|   |  |   Fix: --       |   Fix: QB+QC    |
|   |  |  Safe: Yes :)   |  Safe: Unsure :/|
+---+--+-----------------+-----------------+
|   |  | utf8_general_ci | utf8_general_ci |
|   |  | Data stored: OK | Data stored: G2 |
|   |FA|Data visible: OK |Data visible: VD |
|   |  |   Fix: --       |   Fix: QB+QC    |
|   |  |  Safe: Yes :)   |  Safe: Unsure :/|
|DU +--+-----------------+-----------------+
|   |  | utf8_general_ci | utf8_general_ci |
|   |  | Data stored: OK | Data stored: G2 |
|   |FB|Data visible: OK |Data visible: VD |
|   |  |   Fix: --       |   Fix: QB+QC    |
|   |  |  Safe: Yes :)   |  Safe: Unsure :/|
+---+--+-----------------+-----------------+

Database:
* DL - Default "latin1_swedish_ci" set before import.
* DU - Default "utf8_general_ci" set before import.
This is set while creating the database, or through ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci, assuming that you have permission. Should be set to avoid further problems with "Illegal mix of collations".

File:
* FA - Dump from MySQL 4.0.x (have no encoding information). The same may be also taken from 4.1.x using the backward compatibility switch --compatible=mysql40
* FB - Dump from MySQL 4.1.x (contains encoding information)

Connection character set on import:
* CU - Enforced as utf8, import done as mysql --user=name --password=pass --default-character-set=utf8 db_name < file.sql
* CL - Enforced as latin1 mysql --user=name --password=pass --default-character-set=utf8 db_name < file.sql, or none specified (the same result, latin1 is default) mysql --user=name --password=pass db_name < file.sql

Collation: "latin1_swedish_ci" or "utf8_general_ci", as seen on the table through PhpMyAdmin (after import).

Data stored: Actual state of data inside MySQL data-storage directory (analyzed in .MYD files on disk as raw byte-stream)
* OK - Valid utf8 data, all OK.
* G1 - Data stored in latin1 encoding (i.e. converted from utf8), some characters missing (for example "Ž" stored well in latin1 encoding as #8E, but "Ř" doesn't exist in latin1, so replaced with a question-mark).
* G2 - Data garbled by double utf8 encoding (i.e. "Ř" in correct utf8 is #C5 #98, but since the input stream was understood as latin1, each of these two bytes was "converted" to utf8 again, resulting in #C3 #85 #CB #9C).

Data visible: Data in tables, as seen through PhpMyAdmin (or through Drupal itself)
* V? - Visually correct, because MySQL does a conversion for you now on every query executed (i.e. bad for performance, too), but some special characters are replaced by question-marks. Depending on the language used, there may be none, few, many, or *all* question-marks.
* VD - Each special character turned into *two* other random characters. (Again, frequency depends on language used.)

Fix: Which queries you need to execute...
* QA - Your data got converted to latin1 on import. You need to convert back, but you'll not be able to get rid of question-marks - this damage is permanent. For each table, execute:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
* QB - Your data got converted from (fake) latin1 to (double-encoded) incorrect utf8. You need to convert back, before proceeding to better fix. For each table, execute:
ALTER TABLE table_name CONVERT TO CHARACTER SET latin1 COLLATE latin1_swedish_ci;
* QC - Your data are OK now, but you need to correct the encoding/collation info. This is tricky and needs to respect the existing structure of the table, so it's best done with a php code. The function update_convert_table_utf8('table_name'); inside update.php does this, and is used in the Heine's snippet linked above. The final queries, as generated for each table differently, are something like:

ALTER TABLE drup_boxes DEFAULT CHARACTER SET utf8;
ALTER TABLE drup_boxes CHANGE `body` `body` longblob DEFAULT '' NULL, CHANGE `info` `info` varbinary(128) DEFAULT '' NOT NULL;
ALTER TABLE drup_boxes CHANGE `body` `body` longtext CHARACTER SET utf8 DEFAULT '' NULL, CHANGE `info` `info` varchar(128) CHARACTER SET utf8 DEFAULT '' NOT NULL;

Safe:
* Yes - Your data are untouched, unchanged, the whole fix only affected MySQL table settings. If you didn't save anything to the database before fix (if any), you're 100% fine.
* Unsure - Your data undergone double conversion, binary data routed through the latin1 encoding. This should be safe theoretically, but still it's unsure whether a character or two got lost due to mapping the character set onto a different one, twice.
* No - Your data undergone double conversion, text routed through the incomplete latin1 encoding. Any characters not supported by latin1 (quite a LOT!) got changed to question-marks, permanently. This probably only affects your content, as Drupal usually don't use special chars internally, so the site probably won't crash, but the content is broken.

If you have the original file still (you *should* have it, since you just imported), better start over again in a better way, instead of going through the "Unsure" or even "No" scenarios.

If your server / hosting provider upgraded to 4.1 without really moving your data, then you suddenly found yourself in the state DL/FA/CL (see the table), and should proceed to the fix immediatelly (this is what wmonstrey describes). It would be also good to fix the database default collation for future tables.

So, as for Acert93 - You still have collation as latin1_swedish_ci, because you ended up in the DL/FA/CU state. If you haven't any latin1-incompatible characters in your content, QA is the fix for you, but if there are any "bad question-marks" - no luck... (Funnily, newly added content is probably OK in this case.) Better import again, into utf8-defaulting database (see "DU"). If you're unable to change your database defaults (some hostings are limitig your permissions), then better import as "CL", and then fix as "QC". But I'm unsure whether the db-default latin1 will be trouble in future with any new Drupal tables :-/ It might.

Some highlights:
* Note that there are TWO different "garbled data" options. You may even get third, if using the Q3 fix on DL/FA/CU case (latin1 stream misunderstood to be utf8, probably resulting in completely invalid data, but this shouldn't ever happen unless you tried a wrong fix).
* MySQL lies about the data: See difference DL/FA/CL vs. DU/FA/CL - you see the same data, but you don't have the same data, and you need a different fix! Also all the "garbled" problems only apply on characters above basic 7-bit ASCII, i.e. English is OK always and everywhere, only foreign characters or special punctuation and the like gets broken.
* MySQL 4.1 dump enforces utf8_general_ci on tables, while imported.
* For MySQL 4.0 dump it's taken from database defaults while imported to 4.1.
* For both you need to use "--default-character-set=utf8", unless the dump already contain "SET NAMES utf8" inside (depends on the way/application used to dump). But even then, it may not hurt to use it.
* Regardless the other things, you want to have the same collations on all Drupal tables; if there are differences, Drupal may stop working due to "Illegal mix of collations" error. (I encountered this, quite horrible, sidebars missing, half the features dead....) It should be "utf8_general_ci", and NOT for example "utf8_czech_ci" (or whichever else national one).
* There's no command line switch to enforce collation alone, my suspicion was false here. We need to act along the table, to get the right collation in the end.

Just did a small edit to the above, minor corrections.

(Also note, that on IE6/Windows you'll be sometimes unable to see my example Czech characters quoted in the text. If you see just empty squares instead - it's basically the very same problem (like the question marks) of utf8 text routed through incomplete latin1-alike encoding, demonstrating that Windows have the same problem, too. Luckily, this only happens on the screen, if there are Czech (or other) characters on a site that claims to be english. Unrelated to MySQL.)

I'm going to do a revision of my first text, with the new information included somehow, but don't expect anything till Thursday at least. My spare time is pretty overloaded now, and this is quite time-expensive.

Wow Jirka, good stuff. I am going to be running through your posts and doing some tests to validate over the next couple days as well as try some new code Wim so kindly sent me.

I have a question: What should I be looking at in the DB to determine the status of the export/import?

Are there any queries that can be run in PHPMyAdmin or the Commandline that will let me know what errors, if any, have cropped up? A couple simple tests for people to see the status of their DB would be helpful in indicating the health of their DB and direct them toward any changes.

The state is especially indicated by the process you've just done (importing), i.e. which character set switch used, and where the file came from. Otherwise, I think you may find things out by examining the (default) collation on database, on tables in question, and how your data looks like (i.e. OK/double garbled chars/question marks), and then comparing to my table. But really, I need a bit of time to revise my initial text, to include some directions "if this then check that and do xyz..."

I think this documentation is now heading towards 3 sections:
1. Basic introduction and a bit of background
2. Common cases (i.e. few of my "scenarios", mostly suggesting to drop garbled database and re-import the file in a better way - that's really easiest and safest way, mostly)
3. Other ways out (i.e. my table, with some basic guidelines to determine where you currently are, and so which fix to use)
...and 2 appendixes:
A. Full background (i.e. what happens inside, roughly, to understand the processes here)
B. Use of web-based applications instead of command line.

Finally, my revised text is here. I'm going to let it just hang here in thread for a while, in hope of some feedback, identifying any unclear bits, and perhaps some corrections from a native speaker. If no-one complains, I'll then try to put it in the Drupal handbook, somehow. I'm getting quite tired of this, now :-/

-----------------------------------------------------------------------------

MYSQL 4.0 -> 4.1 MIGRATION

The MySQL version 4.1 introduced some changes to character set handling, which are often giving us trouble if moving a Drupal site from one to the other (i.e. from 4.0 or below, to to 4.1 or above). Basically, Drupal always uses utf-8 encoding, but MySQL 4.0 and lower doesn't support it, so utf-8 data are treated as default Latin 1 inside the MySQL 4.0 engine. It doesn't matter there, as MySQL output echoes the same data as sent in still, so Drupal works fine, but it matters a lot in the migration process, where MySQL might attempt to "convert" the data from Latin 1 to utf-8, not knowing that we have utf-8 in there already, and so the site gets broken then.

There are a few typical symptoms, and typical scenarios, (all discussed later in more detail):

- "Garbled text": Any special characters (non-english, advanced punctuation and the like) in your site's contents turned into seemingly random pairs of weird characters after moving the site to 4.1. (Typically, "ÅÃÄ" are frequently seen, always followed by some other non-english character).

- "Question marks": Your contents is more or less fine, but some special characters (non-english, advanced punctuation and the like) are replaced by question-marks after the move to 4.1.

- "Illegal mix of collations": Your site worked after the move to 4.1 somehow, but later crashed badly (typically after running update.php on the next upgrade, installing a new module, or the like). Missing sidebars, virtually any part of the pages vanished, no way to access even administrative pages - all that is possible. Looking into the logs (manually in the 'watchdog' table in worst case), you'll see error messages "Illegal mix of collations".

IMPORTANT: All these symptoms mean, that your database is NOT healthy after the move. Don't try to run your site in such a state: Any new data stored to the database are only making things worse! You need to fix your database during the migration process, or immediately after, before going live again! Keep backups! If you did a wrong step, you'll most probably need to restore your database from backup, and try again; it's usually easier and more safe, than fixing broken data afterwards. If you don't have any backup, create one now! Also double-check your collations after you're done, to avoid running into the "Illegal mix of collations" problem later.

Note that basic English characters are the same across all encodings, so your English contents (as well as Drupal's internal data) is not affected, and you might not even notice that there's a problem, if running English site. But still, some things may be broken.

SCENARIO 1: UPGRADED SERVER TO 4.1, WITHOUT REALLY MOVING DATA

If you (or your hosting provider) just upgraded MySQL to 4.1+, without really moving the data to another machine or another database, you've correct utf-8 data already in your database, but most probably you've wrong encoding information (for details, see (B) in the table below). You see the "Garbled text" problem, and your tables are "latin1_swedish_ci" in the database.

You have two options:

A. Alter encoding/collation directly in database. This must be done in a tricky way (changing columns to binary types and back), to avoid real conversion of stored data.

The operation, however simple for the server, is not so easy to do manually, so probably the best way to accomplish this is a code snippet (as seen in multiple related posts, originally provided by Heine):

function convert_to_utf8($tables) {
  foreach($tables as $list) {
  update_convert_table_utf8($list);
  }
}
convert_to_utf8(array(
  'access', 'accesslog', 'aggregator_category',
  'aggregator_category_feed', 'aggregator_category_item',
  'aggregator_feed', 'aggregator_item', 'authmap', 'blocks',
  'book', 'boxes', 'cache', 'comments', 'contact',
  'node_comment_statistics', 'client', 'client_system', 'files',
  'filter_formats', 'filters', 'flood', 'forum', 'history',
  'locales_meta', 'locales_source', 'locales_target', 'menu',
  'node', 'node_access', 'node_revisions', 'profile_fields',
  'profile_values', 'url_alias', 'permission', 'poll', 'poll_votes',
  'poll_choices', 'role', 'search_dataset', 'search_index',
  'search_total', 'sessions', 'sequences', 'node_counter',
  'system', 'term_data', 'term_hierarchy', 'term_node',
  'term_relation', 'term_synonym', 'users', 'users_roles', 'variable',
  'vocabulary', 'vocabulary_node_types', 'watchdog'
));

This needs to be pasted into your update.php file, just below the lines:
  update_fix_schema_version();
  update_fix_watchdog_115();
  update_fix_watchdog();
  update_fix_sessions();

You need to look at your database first, and update the long list in the snippet, to include all the drupal tables you have. Then run update.php, so that you see the page where you select versions (no need to proceed further). This snippet uses a special function, which is already present in Drupal code, and performs the necessary table alterations automatically.

Be sure to always keep a copy of the original (unchanged) file, and return it back to original place afterwards.

Still, you need to change the default encoding/collation of your database, to avoid future "Illegal mix of collations" problems.
mysql --user=username --password=password --execute="ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;"

B. Export and re-import. This option is more complicated and resources-expensive for the server, and depends on your permissions on the system, but might be easier from user's perspective, especially if you feel uneasy about editing Drupal files manually. It's basically the same process, as migration between two different servers:

First, export your database (from the newly upgraded 4.1 server) in backwards-compatible mode:
mysqldump --user=username --password=password --default-character-set=latin1 --compatible=mysql40 dbname > dump.sql
This gives you exactly the same backup file, as if you did the dump before MySQL upgrade, i.e. removing any bad encoding information, and keeping the seemingly "Latin 1" data unchanged. (Better check, whether the file is correct, with all your tables inside, before proceeding further.)

Then, you need to empty your database, by removing all the tables, and change it's default encoding/collation to "utf8_general_ci":
mysql --user=username --password=password --execute="DROP TABLE table_name, table_name...... ; ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;"
This may be a bit annoying, as you need to specify names of all tables to drop - PhpMyAdmin might be a better option here. OR even better, if you have the necessary permissions, and no other tables are in the same database (i.e. from other applications than Drupal, on the same server), you may simply just drop the whole old database instead, and create a new one (as "utf8_general_ci"):
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
Both these ways should result in an empty database with default encoding/collation "utf8_general_ci".

Finally, import the file back into the database:
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql

Now, your database should be perfectly OK, "utf8_general_ci" set on all tables, and Drupal working fine. You're done.

2. MOVING DATA FROM 4.0 SERVER TO (OTHER) 4.1 SERVER

If you're migrating to a different server (which may as well be localhost for testing, or the like), you need to export the data from the old 4.0 database, and import to 4.1 along with correct utf-8 encoding information, so that the database engine understand the data correctly and doesn't do some unwanted "conversion". This is important; if you imported your file in a wrong way, your data got broken, and you'll probably need to try again from scratch, rather than fixing the broken data (which is always a bit unsafe).

On the old 4.0 server, export your database:
mysqldump --user=username --password=password --default-character-set=latin1 dbname > dump.sql

On the new 4.1 server, create the new database with "utf8_general_ci":
mysql --user=username --password=password --execute="CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
OR alternatively alter your existing database:
mysql --user=username --password=password --execute="ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;"
If the 4.1 database already was "utf8_general_ci", you may proceed happily to the next step; you needn't to do any of these two options then.

Import the file to your new database:
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql

Now, your database should be perfectly OK, "utf8_general_ci" set on all tables, and Drupal working fine. You're done.

3. MOVING DATA FROM 4.1 SERVER TO 4.0

This is probably rare - it might only happen, if you're moving the site to a local server for testing, having the local server not upgraded yet. It's rather simple: Export in backwards-compatible mode, then import to 4.0:

From 4.1:
mysqldump --user=username --password=password --default-character-set=latin1 --compatible=mysql40 dbname > dump.sql

To 4.0:
mysql --user=username --password=password dbname < dump.sql

4. ENCODING/COLLATION SETTINGS

MySQL keeps track of your data's nature by using an encoding/collation information on various places. While collation (the part with various nation-names in it) is only affecting some operations like sorting and comparing, and so may be changed safely, encoding (utf8 or latin1) is addressing the nature of your data, and so needs to be always correct:

- Table columns: The encoding here describes your data directly. This is the most important bit for your database health; any change done to single columns is likely to perform a real conversion of the data, so better don't try to alter encoding here. (Scenario 1. option A. shows how to alter encoding without really changing data.)

- Tables: The info on tables is used as a default for any new columns created inside the table. You may change this to utf8 with queries like ALTER TABLE table_name CHARACTER SET utf8 COLLATE utf8_general_ci;.

- Database: The info on database is used as a default for any new tables created inside the database. This is important for imports from 4.0 dumps, as these contain no information about tables encoding/collation. You may change the database defaults to utf8 with the query ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;, if you have permissions for that.

- Connection: This is the encoding of data coming in/out of MySQL engine, i.e. data processed by Drupal and shown on your screen. It's specified through the --default-character-set=utf8 switch for each new connection (or with SET NAMES utf8 query, as seen in Drupal itself, as well as *some* MySQL 4.1 dumps). Default is latin1, which we don't want.

If this is different from the table column's encoding, MySQL performs a conversion, attempting to present more or less correct data. Being that the case, you may encounter the "Garbled" or "Question marks" problems, either on data read from OR stored to the database, or at the very least a bit worse performance. This is where all the damage happens on migration.

Generally, whichever way you choose to manipulate your data, you need all these four settings to be "utf8" and "utf8_general_ci", or at the very least you need to have all table columns as "utf8_general_ci". Having table columns on latin1 means conversion (bad performance, and question-marks problem), and having table columns different from each other means "Illegal mix of collations" (MySQL can't perform certain operations (joins) between different collations).

5. ALTERNATIVE WAYS TO MANIPULATE YOUR DATABASE

Sometimes it happens, that you can't use the shown command-line examples, because you're on a different system, you've not enough permissions, or you have no access to the command line at all. If your hosting provider don't let you in, you may install some of the popular web-based applications, such as for example the popular PhpMyAdmin, or for large databases some other ones, like MySQL Dumper, which is able to handle large data avoiding php timeout problems. You may then execute your operations using these applications - sometimes it's even more comfortable.

Make sure that such applications are always protected from unauthorized access! If there's no other option (MySQL Dumper), consider at least basic HTTP authentication.

This page can't describe all the options of all applications in detail - please refer to corresponding manuals for that. As for the examples provided here - for dumping/importing data, you need to use the corresponding functionality provided, while other queries may be usually run directly:

For example, from this:
mysql --user=username --password=password --execute="DROP TABLE table_name, table_name...... ; ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;"
You need to take the actual query:
DROP TABLE table_name, table_name...... ; ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
And just paste it to the "SQL" input of PhpMyAdmin.

There's one important bit, though: The connection encoding. To emulate the --default-character-set=utf8 switch, it's necessary to execute SET NAMES utf8 query on every page load. Some applications do this, some not. PhpMyAdmin lets you choose the connection, while MySQL Dumper (speaking of version 1.21-b6) needs to be hacked a bit to import your MySQL 4.0 dump well: Add a line $res=MSD_query('SET NAMES utf8'); just above final return; of function MSD_mysql_connect() in the file mysqldumper/inc/mysql.php.

6. WHAT ELSE MAY HAPPEN, AND HOW TO FIX IT

Depending on how the migration was done (attempted), you may find yourself in four different states:

                   +----------------------------+-----------------------+
                   |Import done with the switch | Import done without   |
                   |--default-character-set=utf8|encoding specification,|
                   |  or SET NAMES utf8 query   |   or with latin1      |
+--------+---------+----------------------------+-----------------------+
|        |File from|                            |                       |
|        |MySQL 4.0|                            |           (B)         |
|        |(encoding|            (A)             | (Also happens after   |
|        |info not |                            | MySQL upgrade without |
|Database|included)|                            | really moving data)   |
|default +---------+----------------------------+-----------------------+
| latin1 |File from|                            |                       |
|        |MySQL 4.1|                            |                       |
|        |(with    |            (C)             |           (D)         |
|        | encoding| (You need to fix database  |(You need to fix datab.|
|        |  info)  |  default collation too!)   |default collation too!)|
+--------+---------+----------------------------+-----------------------+
|Database|No matter|                            |                       |
|default | which   |            (C)             |           (D)         |
| utf8   |  file   |                            |                       |
+--------+---------+----------------------------+-----------------------+

(A) - You have "latin1_swedish_ci" on your tables and columns, your contents looks more or less correct, but some special characters are replaced by question-marks. Currently, your data are stored in Latin 1 encoding, which can't handle all the characters, and requires MySQL to perform a conversion on each query.

It's probably best to use your backup now, and start the migration over again.

If you have no backup to revert to, you may try to execute the following query for each of your Drupal tables:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
This will convert your data and tables back to correct utf-8 state (including any data submitted to the site in the broken state, even), but however, the question-marks in your contents will stay. This is caused by latin1 being incapable to store full range of utf-8 characters, so the damage is permanent, and this way should not be considered safe.

You still need to correct the database defaults, to avoid running into the "Illegal mix of collations" error later:
ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci

(B) - You have "latin1_swedish_ci" on your tables and columns, and your contents looks "Garbled". But still, your data are stored as correct utf-8 and no damage done yet! You have only just wrong encoding information in your database (causing MySQL to break your data on the way out).

The fix is not as simple as executing a query or two. Please read the scenario 1. above for details, or revert to a backup if this is a result of unsuccessfull import just done. The fix is safe, however, unless some new data got submitted in garbled state.

(C) - Congratulations! You have "utf8_general_ci" on your tables and columns, correct data inside, and Drupal working well. Unless you need to correct the database defaults, you're done already, and completely safe.

(D) - You have "utf8_general_ci" on your tables and columns, but your contents is "Garbled". Unfortunately, that's really how your data are currently stored (after double utf-8 encoding, all two-byte characters turned into four bytes, and so into pairs of weird characters, one per original byte).

It's probably best to use your backup now, and start the migration over again.

If you have no backup to revert to, you may try to convert your data back with the following query for each of your Drupal tables:
ALTER TABLE table_name CONVERT TO CHARACTER SET latin1 COLLATE latin1_swedish_ci;
If everything went well, this took you back to the (B) state, which is still wrong, but allows you to proceed to an other fix at least. Now you may perform the steps described in scenario 1. above, to fix the remaining problem. But since your data undergone double conversion, this way should not be considered safe.

Depending on the circumstances, you might still need to correct the database defaults, to avoid running into the "Illegal mix of collations" error later:
ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci

7. LINKS

See also the official MySQL page.

More discussion may be also found on the original issue and various pages linked from there.

StatusFileSize
new5.48 KB

Here's the documentation I created already, with more backup and more information on how to enhance the update.php file. As per Heine's advice I'll create a separate script that does this, so users can simple download and use it without having to mingle with the update.php script.

Great, this one (very similar to your previous version I already reviewed) is very detailed and fine-guiding, but covers only a subpart of my documentation - namely, it's only scenario 1. approach A., and only fixes the state (B). Doesn't help much for scenario 2., and doesn't fix states (A) and (D).

But still, I appreciate your work. Really! I hope we can merge it all together somehow, or perhaps append one to the other, as additional explanation in more detail / broader scope. Especially, if you'll have the special script ready-to-use, then I'm going to remove my scenario 1. part A. paragraph, in favor to a reference to your text, which will be then much more usable than update.php tweaking. But your text, along with the script attached, will then need to live on Drupal.org (either appended to the page I'm about to submit, or separately). If we're going to set up single page together (which will be easier to find for users), of course we need to credit both of us ;)

It's indeed in everybody's interest to put this on one handy handbook page. I sent Acert93 a script last week that should get rid of the nasty "A" characters he had on both his old and new content, so after I receive feedback from him and after making a stand-alone script (that automatically detects all tables) we can give a swing at a first version on d.o?

Where should the documentation live? http://drupal.org/about/drupal.org-FAQ? chx mentioned this will become more and more of a hot issue so it should probably get a prominent place.

Sepeck suggested in #12 to put it into Troubleshooting FAQ, and I think that might be the right place: If I was searching this kind of information, I'll go there for sure.

This task is getting quite long in terms of time, a whole week of zero activity here in the issue now... So I decided to go forward, and submit the existing text to the Troubleshooting FAQ handbook section now, so that users may benefit from the already-usable text. Holding this back for long periods of time doesn't help anything.

The page: http://drupal.org/node/198184

Note that I submitted a further revised/improved text - after more testing I adjusted the notes about database default collations and "Illegal mix of collations" problem, as well as a few typofixes and a mention of (#32) separate helper script coming.

This issue stays open for further improvements, and I'm going to edit these into that page immediately, as they come.

Great work! I'll be uploading the stand-alone script tomorrow or this weekend so section 1.A. can simplified.

StatusFileSize
new3.36 KB

Here is the stand-alone script. It will check for the list of tables and run update_convert_table_utf8 on each of them. Could you include this in the documentation? The documentation should probably include a warning that use of this script comes with no support whatsoever and that the user should take a back-up before using it.

Thanks! I'm going to test this today or tomorrow, and fix any minor coding style issues found (seems there are a few). Also I need to check how this works with prefixed table names; we probably need to only change the Drupal tables with appropriate prefix (if any prefix used), other tables may belong to other applications (with some other encoding).
Stay tuned... ;)

Interesting. Catching prefixes is done automatically, with ALTER TABLE {'. $table .'}. I'm not sure how it can be checked if a certain table belongs to Drupal or not? Should I provide a form first which displays a checklisted list of all available tables so the user can uncheck any tables he doesn't want to convert?

Btw if you find any code style issues, be sure to check them against update.php as well.

Actually using {} doesn't cover the prefix problem since I'm getting the full table names straight out of the database, so it's best not to use it at all. Or to find another way to pick out all drupal tables.

I have some ideas, and I'm going to work on this right now. Currently setting up testing data; you may expect my proposals around midnight.

StatusFileSize
new13.73 KB

This was quite bigger work than I expected... Finally, I've rewritten 99% of the script.

The use cases I see are quite wide-ranged. For one thing, there may be table prefixes (as mentioned above) and non-drupal tables in the same database. Even worse, Drupal supports also per-table prefixes, which makes the whole thing pretty complex.

I managed to make the script auto-discover Drupal tables with all the prefixing stuff, but it's still not 100% sure: If someone runs some other application on the same database without prefixes (being just lucky that the other application uses different table names than Drupal), then there's no way to tell what's Drupal and what not. Additionally, the user may want to convert non-Drupal tables too (because the other application also uses utf8 and so suffers from the migration all the same), OR to avoid tables wrongly identified as Drupal (because the other application is not utf8). OR even convert just *some* of the non-drupal tables, as is the case of myself:

I'm testing on real database from my production site: Drupal 5.x with quite a few contribs have 131 tables with a prefix; another 79 tables belong to 3 other applications - two of them running also utf8 (I want to convert these too), the last one is on Windows 1250 encoding (so I don't want to convert it).

In this light, the script definitely needs some real user interface, with additional screen for manual table selection. So now it works quite similar to update.php (themed output, the same access check...), offers to process drupal/other/all tables, or to select tables manually on extra screen. To allow processing of non-drupal tables, the queries are performed without Drupal's native prefixing (i.e. no curly braces, intentionally).

Another problem is, that we don't know on which Drupal version the user is. The MySQL server upgrade occurs independently of Drupal versions. As for now, it's most likely to be somehow linked to 6.x coming (because 6.x is the first version requiring 4.1 minimum), but we still don't know from which version the upgrade is, and whether the user attempts to fix the database before (upgrading environment first), or after ("OOPS, help me!") the Drupal upgrade.

This means that our script may land on top of virtually any Drupal version, and suffer from various API differences between them. It would be great, if we do with just *one* script (as oposed to separate scripts for various Drupal versions). But since we're also likely to run with *really unhealthy* database (that's the whole point here, after all), I guess it's no big deal to just avoid as much of Drupal API's as possible. So now the script works the same for Drupal 4.7.x, 5.x, and 6.x (I tested) - a bit of code is in fact duplicated in the script, but not that much (luckily, there are no big API changes related to our task). I avoided Forms API (heavily changed) and Batch API (not available on older versions); the full bootstrap seems to be unavoidable, though (initially I worked on database bootstrap only, but there were nasty side-effects even with maintenance pages theming, and the usual uid=1 access check also needs full bootstrap). The access check fashioned after update.php now also includes both the 4.7.x/5.x and 6.x checks (that bit changed too) for best compatibility.

Yet another big problem was revealed, when I started testing on real data: Although the whole thing is just altering tables, without really processing data on php-level, the operations are still *very* time-expensive while performed on large tables. My site is not that big - a few thousands of nodes, 800 users, the biggest table is search index with about 15000 entries. But on my test install (rather slow desktop), it ran into php time limit immediately, after just few first tables from a long list. Because of this, neither of the previous scripts and snippets work for sites of my size (and there are *much* bigger out there).

The solution is a bit scary for me (not much experience on this), but obvious: Batch processing (multi request). I cloned the bare minimum from old update.php versions (no JavaScript, no themed progress-bars etc.), aiming to preserve the cross-version compatibility and potential broken-database safety. And fortunately, it works :) Even some of the single tables are nearly dangerous time-wise, but I guess that it should be OK on modern hardware for virtually any site. (The time chunks are increased from 1 to 2 seconds, because with all the code already executed in start-up phase of the script, there was very little time actually left.)

So finally the shape of our script became very similar to update.php bare skeleton. I successfully tested it on 4.7.10, 5.5 and 6.x-dev fresh installs, as well as a copy of 5.x production site mentioned (with prefixes and non-drupal tables); all fine.

Now attaching the script for testing. To use it - just copy the file (renamed back to mysqlfix.php) into your Drupal root directory (where update.php is), and point your browser to it.

I'm only going to put it into the documentation page after someone confirm that it works fine. I'm also unsure how to do it (no attachments in the handbook), probably I'll just link to the file-attachment here on issue (and include the nasty guidelines to rename from .txt to .php because of nasty limited attachments here). It's not nice, but I can't see a better way: This sort of material clearly belongs into the CVS (perhaps sandbox area or the like), but Drupal.org policy only allows this sort of contributions to the members who also maintain a module at the same time (as I learned in http://drupal.org/node/196928 ). This script is not going to be a module, though, due to it's obvious update-nature: It's not a good idea to limit access only to users, who already browsed through their broken site all the way to modules administration page, and so polluted their database by various broken entries.

Feedback welcome.

Wow, amazing job you did there! I just tested it on 4.7.10, 5.5 and 6-dev and it works as advertised. It's indeed best to have one script and in that light I agree it's better not to use FormAPI but regular HTML forms. The batch-processing, manually selecting of tables, it all works. Kudos!
I have a cvs account so I can put it up in a sandbox. I don't have the time now so I'd update the handbook page with a link to http://drupal.org/files/issues/mysqlfix.php_.txt, and we can update it when the script is available on cvs.

Thanks for testing. I've updated the manual page with the above link, and also changed the surrounding/related text a bit (just updated in-text references, nothing really new).

If you can put the script into sandbox, somewhere, it would be really nice. I believe that's where it should live, but I seem unable to get a cvs account myself, without going to maintain modules really. (As a sidenote, I'm also looking for this sort of help at http://drupal.org/node/196928).

Here is a direct link to the file (download starts automatically):
http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/wmostrey/my...

StatusFileSize
new249 bytes

Thanks a lot, now going to update the manual page to final version. I'm going to link to http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/wmostrey/my... though, to avoid hardcoding the file revision number (just in case any changes occur in future).

I also noticed that I did a mistake in the previous file: There was no $Id$ tag, so CVS can't record revision number into the file. Attached patch fixes that, if you would be so nice to commit it (just to make things clean).

Thanks again.

OK this has been updated. Thank you.

Thanks from me, too :)

I assume we're done here, then? Going to mark this fixed? Acert93?

Just a quick update.

• I gave the script a whirl on a site that was moved from a 4.0 server to another 4.1 server. This site had new content added... unfortunately I ran into the same issue of stripped tags and whatnot. The script is great--autodetects tables, ensures you are logged in as your base user, and for content not added on the 4.1 server it treats properly.

If either of you is interested in seeing the 'end result' of using the script on a DB migrated from 4.0 to 4.1 and then content added, I can email you a link to download the database. I think the lesson here is: If your database was incorrectly migrated, STOP! Any content you add to your website after migrating to MySQL 4.1+ (from 4.0) will be lost. Correctly migrate your database ASAP! A hard lesson to learn, hopefully this documentation will allow others to quickly identify the signs of a bad migration and fix it sooner than I did!

• I have read through all the (excellent!) documentation you both have written. I am amazed at how non-native speakers have such an excellent command of the English language. There are a couple rough spots I will offer some small corrections, but I must note that the end result is great. It is even formatted well and flows great.

• I am testing some of the various methods. It appears my databases on the 4.1 servers are "unhealthy" and cannot be fixed, so I am going to go back to the 4.0 server and test out Method #2 as well as Method #5. Just a note to those reading this: PHPMyAdmin has some pretty strict session limits and database sizes, so if your site is large (I cannot remember if it is 2MB or 10MB for PHPMyAdmin) you will want to skip forward to using MySQLDumper. Important! Note the final paragraph in #5 where Jirka outlines the small hack you will need to employ to get the correct connection encoding.

Sorry I didn't respond sooner. In my efforts to work on this I allowed some other stuff to get backed up that required my attention. I know many in the Drupal community will appreciate the excellent work you both contributed to the community. Per Wim's request, the bounty will be contributed to the Drupal Association.

As I run through the various methods I will leave feedback in regards to the documentation and any unexpected results (or questions) a novice user, like myself, may have. So far, I must say, it is excellently documented and explained in exacting detail. Great work guys!

Joshua

Small side-note: I got bitten by PhpMyAdmin limits too, previously. The nasty lesson here is, that when you first test your back-up workflow with PhpMyAdmin, on the tiny database of a newly installed site, all works well. But as your site grows bigger, PhpMyAdmin starts to fail silently, producing incomplete (!) yet sensibly looking (!!!) dumps, keeping you under impression that all is well backed-up (if you just check that the file is some megabytes long, no error message came, and *some* pile of Drupal tables is really inside). Until you need to really restore from these back-ups: I learned this lesson, fortunately, on nothing worse than moving to localhost for testing (so no harm done, really), but it was a nasty surprise to see that all my weekly-backups are totally unusable half a year back! I think it's php time-limit what kills PhpMyAdmin (no batch processing in there, I wonder why), so the actual size-limit depends on the server config (speed) a bit randomly.

This is out of scope for the documentation here, although I now feel like adding a tiny sidenote in there too, on next edit. And of course - any improvements and English fixes are welcome ;)

Hi Jirka,

thanks for this piece of hard work, I've tried to work around the manual update procedures of Drupal databases for some time and now jumped into the cold water with your script on five Drupal sites.

According to my first experiences after going live again, four of the sites survived (mostly) fine, and one became somehow damaged. I've not yet done in-depth-testing, but the results of the conversion appear pretty good so far (no comparison to the horrors I experienced when migrating from Debian/Woody to Debian/Sarge a few months ago which included the ugly MySQL transition and switch to UTF8).

Some notes:

* Before, some of the sites were not able to process non-latin characters (like Greek, Russian, or Arabic); this seems to work now.

* My largest and oldest site usually tends to be the most problematic one; with some 22k nodes and over two years of OS and Drupal updates, there are lots of old stuff which cause incompatibilities, and very often the pure size limits lots of batch operations (e.g. with pathauto 1.x); on this site, the script worked fine, also.

* Running mysqlfix.php outputs *lots* of warnings/error messages, on *all* five sites; I have no idea if this indicates any danger (sorry, I wasn't fast enogh to copy any of the messages, before the screen was refreshed)

* On the large site, at least one block's content was erased partially; it's content was cut off after an Umlaut (input format: Full HTML). Also, content from the "Amazontools" module was garbled ("Färöer" becomes "F�r�er").

* The conversion seems to cause *very* heavy server load *after* finishing; "mytop" displays *lots* of database activity, and it consumes currently all CPU resources oy my server for hours (100% CPU load on a Dual-Core AMD Opteron Processor 1212 HE); "mytop" displays *lots* of threads like:

Query SELECT t.word AS realword, i.word FROM search_total t LEFT JOIN search_index i ON t.word...
Query DELETE FROM search_index WHERE fromsid = 898 AND fromtype = 'node'
Query DELETE FROM search_total WHERE word = '12j?hrige'

I hope that this will go away sometime soon, but if you have to migrate several sites, I'd recommend to do this one site a day, or so.

* On the damaged site, *all* content after Umlauts and/or non-ASCII characters was cut off, e.g.

** in several blocks,
** in all taxonomy terms,
** in the menu,
** in all forum's titles, and descriptions,
** in the display of titles in the "book module",
** in the display of normal node's titles (e.g. in ./tracker), and
** even the site's title (which also included a special character).

Not damaged was (at least as it appears at the moment):

** the contents of the nodes itself,
** the node's titles itself,
** most of the rest of the site, if not mentioned before.

I'm not sure how to proceed with this site; as it appears, the messed stuff can be repaired in a few hours, and the site *does* now process non-latin characters (like Greece); however, would you recommend to fall back to the backup? Any suggestions are welcome!

Currenty, I can't perform any more tests since the sites aren't responding properly due to heavy CPU load; if the site starts to respond again, I'll do some further testing.

Regards, -asb

It would be really interesting to know something about your broken site - note that (as said in the documentation) the script doesn't solve everything what's possible to happen, it's only just for one of the possible cases. What's the collation on table columns in that database, how it all happened (i.e. imported from a file, or only just upgraded; any attempts to save content/do settings before database fix, and the like)? Having content cut on umlauts smells to me like having a *really* Latin1 content converted with the script, which is *not* what we want. Certainly, figuring out which of the described cases applies to you is a bit difficult part, still. Perhaps there were some items saved to database before the fix? Also big CPU load might mean some MySQL encoding conversion still going on.

> It would be really interesting to know something about your broken site

I'll try to provide whatever you need (and I'm able to figure out).

Overall characteristics: All five sites are hosted on the same server; besides five separate Drupal installations (no shared codebase), there are two other major apps running (MediaWiki, Gallery2); the server is a Dual-Core AMD Opteron Processor 1212 HE with 2 GB of RAM. Operating system is Debian GNU/Linux 3.1 "Etch", all services are directly from the Debian repositories (MySQL 5.0.32, PHP 5.2.0-8+etch7Apache 2.2.3), except the web applications (Drupal 5.5, MW, G2), which are installed from tarballs.

> What's the collation on table columns in that database, how it all happened (i.e. imported from a file, or only just upgraded;
> any attempts to save content/do settings before database fix, and the like)?

The Drupal database in the damaged site consists of 108 tables, according to PhpMyAdmin; now, all have the collation "utf8_general_ci", type: "MyISAM"; it's similar with the large site, it has 250 tables which now all have "utf8_general_ci". All sites are live data which I backed up before running yor migration script (one never knows... :-)

Yes, I made attempts to fix the encoding issue before, and there seem to have been also conversions done during the Debian upgrade precedure (both several months ago); I never fully figured out what they did and what they didn't do (e.g., Debian made a full switch to UTF8 during the upgrade from Sarge to Etch, but it doesn't convert any text documents like manual pages, because the conversion script wasn't ready when releasing Etch). This definitely applies for the large site and most probably also for the damaged one, which both were set up on Sarge and thus with MySQL 4.x. The three other sites are younger an might have been installed on "Etch" (thus from the beginning MySQL 5.x).

I scribbled down some experiences on drupal.org (http://drupal.org/node/135486) and on my site (http://www.kefk.org/portal/asb/upgrade.von.sarge.auf.etch.erfahrungsberi..., in German). When I researched about this, I stumbled over http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_lati... and stopped working on this topic, which I discovered to be beyond by understanding. The migration from Sarge to Etch was, in my case, accompanied with a switch of the hardware; the migration was done with help of the "Site backup script" (http://drupal.org/node/59370).

> Perhaps there were some items saved to database before the fix?

Sorry, I'm not sure waht you mean by this.

> Also big CPU load might mean some MySQL encoding conversion still going on.

The 100% load on both CPU core remained for ~12 hrs; recently, I stopped and restarted mysqld, currently the load appears normal.

An indication that there is a "lag" of encoding conversions after finishing the script is that on the large site menu entries start to become cut off, as in the one (much smaller) damaged site; however, so far, taxonomy in not (yet) affected.

While I'm writing this, the restarted mysqld started spreading processes again an reached a pretty heavy CPU load (100%/98%) again, but still is accessible; in mytop, I tried to inspect these processes, e.g.:

SELECT t.word AS realword, i.word FROM search_total t LEFT JOIN search_index i ON t.word = i.word WHERE i.word IS NULL

I don't know much about SQl, but this doesn't look like an encoding conversion to me...?

Regarding the garbeling of the contents of the tables from "Amazontools module" (which is broken in many ways an unmaintained for months): The module pulls in data from Amazon's REST API (or whatever this is called); I have absolutely no idea in which encoding this data arrives or is processed; noteworthy here is that (a) while the node's titles are garbled, (b) the node's content seems not to be damaged. It remains to be seen, *if* conversion processes are still running, *if* this will change.

However, this seem to be two completely different things: (a) content cut off after Umlauts ("Ernährung" becomes "Ern"), and (b) characters garbled ("Wörterbuch" becomes "w�rterbuch").

Greetings & regards, -asb

A small followup: I have used Method #2 (mysqldump from 4.0 server, create and modify DB on 4.1 server, populate DB) to good success. Data isn't garbled and the collation is correct. I have done this with 3 sites, one over 90MB. The one issue I did run into was when populating the database I had some "duplicate entry" errors in the index_total table (which halted the DB repopulation), specifically words with special characters like sauté and jalapeño, so I had to manually remove these entries from the database. I start my vacation this weekend so I will have more time to go through the (excellent!) instructions.

I had duplicates too, but that's entirely different isue (hopefully fixed by a 6.x patch now). I was only able to repopulate my DB through MySQL Dumper when dumped via that one too.

asb: Sorry for no response from me yet, but I can't figure anything useful from the info provided, in my currently limited time. In fact the point (one of...) in this issue is, that way too many things may happen, so we attempted to document it so that the user may figure things himself, having full access to his system. I hope to see more sense in that later, in less hurry.

Hi Jirka,

I know that I'm pretty much on my own with this ;)

Currently, the excessive stress for over 36 hrs. seems to have killed mysqld; it refuses to restart, and one process that is running for over 10 hrs can't be killed with a SIGTERM (had to send a SIGKILL). Very spooky.

Dec 20 13:25:00 <host> mysqld[16080]: Per second averages calculated from the last 16 seconds
Dec 20 13:25:00 <host> mysqld[16080]: ----------
Dec 20 13:25:00 <host> mysqld[16080]: SEMAPHORES
Dec 20 13:25:00 <host> mysqld[16080]: ----------
Dec 20 13:25:00 <host> mysqld[16080]: OS WAIT ARRAY INFO: reservation count 183614, signal count 148986
Dec 20 13:58:52 <host> mysqld[16080]: --Thread 2898451376 has waited at btr0cur.c line 366 for 2191.00 seconds the semaphore:
Dec 20 13:58:52 <host> mysqld[16080]: S-lock on RW-latch at 0xad5d574c created in file dict0dict.c line 3689
Dec 20 13:58:52 <host> mysqld[16080]: a writer (thread id 2968525744) has reserved it in mode  exclusive
Dec 20 14:10:41 <host> mysqld[16080]: number of readers 0, waiters flag 1
Dec 20 14:10:41 <host> mysqld[16080]: Last time read locked in file btr0cur.c line 366

As it seems, I'll have to restart mysqld on a regual basis, until this is over.

BTW: The three so-far-unaffected smaller Drupal sites continue to remain "clean" and unharmed.

Greetings, -asb

asb: Keep in mind that I know so little about this, I shouldn't even say anything. However, this:

SELECT t.word AS realword, i.word FROM search_total t LEFT JOIN search_index i ON t.word = i.word WHERE i.word IS NULL

comes from search module, as far as I know:

// Find words that were deleted from search_index, but are still in
// search_total. We use a LEFT JOIN between the two tables and keep only the
// rows which fail to join.
$result = db_query("SELECT t.word AS realword, i.word FROM {search_total} t LEFT JOIN {search_index} i ON t.word = i.word WHERE i.word IS NULL");

The comment here suggests to me that this is caused by reindexing your content -- perhaps you'd be better off starting your reindexing from scratch as opposed to letting it rebuild? Have you tried disabling the search module to see if this traffic goes away?

Also #52 mentions error messages were present on the convertion process, which were not recorded for our analysis here. I would love to see these, before trying to make some statements here. Perhaps some tables failed to be converted? (Might be visible through PhpMyAdmin as collation on the tables/columns being not utf8_general_ci somewhere)

And then again, acert93 in #55 had yet another problem with search tables having duplicates (which I have too on my production site, strictly taken, but MySQL Dumper seems to overcome that somehow). I saw in the issue queue, that the search indexing was changed entirely not so long ago (6.x), and the duplicates were mentioned in there too. If I remember correctly, it was said to be results of broken cron runs in the past (possibly violating MySQL rules for some table keys or the like - can't find the link now).

As far as I know, you can safely Truncate (empty) the search indexing tables (three of them - do I remember correctly?), and then force your site to reindex after you're done with fixes. That will take a while to complete, though, depending of size of the site. Temporarily disabling search module is a good test, indeed.

Hi Keith,

> "SELECT t.word AS realword [...] comes from search module [...]
> The comment here suggests to me that this is caused by reindexing your content

It's very well possible, even if I don't understand, why this could cause problems on _two_ sites while leaving _three_ other sites completely unaffected.

However, I followed your suggestion:

* Disabling search module on the two weird sites: CPU load decreases significantly (one core at 100%, the other betwen ~20 and 90%, no need to restart mysqld every few hours anymore - good).

* Starting a full reindex on site A, finished after a few days; CPU load is at it's peak again, continously (didn't try site B which is large, and reindexing will take weeks) - not good.

Since this shouldn't be related to encoding conversions anymore, I filed an issue (http://drupal.org/node/203806); so far, one reply:

> I can't really suggest anything just note that Drupal 6 solves this.

Whatever this means, there might be issues with the search module (as http://drupal.org/node/134670 suggests) that might be independent of the migrations script. Or maybe I should have truncated the tables "search_index", "search_dataset", and "search_total" in PhpMyAdmin instead of just reindexing (via Drupal)?

Regards, -asb

What if you try the SQL command in #11 of this issue: http://drupal.org/node/143160 (after reading through the issue to see if it sounds related to your problem)?

> What if you try the SQL command in #11 of this issue: http://drupal.org/node/143160 (after reading through the issue to see
> if it sounds related to your problem)?

I ran the SQL statement

mysql> select word, sid, type, count(*) from search_index where fromsid=0 and sid<10000 group by word, sid, type having count(*) > 1;

on all my five Drupal sites, and all five *did* return rows, so basically the bug "search_index has duplicate (sid, word, type, fromsid=0) entries" appears to exist. However, I'm not sure if *this* bug is causing my problems with the search index (Queries from search module are running for hours, http://drupal.org/node/203806); at least the queries in question already start popping up again. I'll run a backup and then truncate (empty) the search tables in PhpMyAdmin, I think.

Regards, -asb

subscribing.

I just wrote a little module for Chinese and Japanese users to recover garbled text after mysql migration.

Only now do I find this issue.

Status:Active» Fixed

The documentation is there. Further support requests should go in the forum, chat or other support areas.

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

Try this to recover UTF-8 strings stored in non UTF-8 columns:
SELECT CONVERT(BINARY `broken_column_name` USING utf8) FROM `table_name`

Just wanted to give people another scripting option to convert their tables from latin to utf8.