Hi,

I'm not exactly sure how to phrase this appropriately: Either "Debian GNU/Linux Etch corrupts Drupal 5.1", or "Drupal 5.1 is potentially incompatible with Etch", at least for non-English websites under some circumstances; I'm trying to find out, what these exact circumstances might be, and how to fix the failures caused by the upgrade.

_The Problem_

After upgrading Sarge to Etch, Debian includes PHP 5, MySQL 5, and Apache 2.2. With this upgrade, some 3rd party modules became disfunctional in Drupal 5.1, and all special characters, all German Umlaute, and all texts with non-English characters broke/got corrupted/were displayed wrongly. This applies _only_ to Drupal sites, other pages (static via Apache, or LAMP systems like MediaWiki or Gallery2) are not affected. Also, Drupal lost some configuration information due to the upgrade (e.g. site name, site description) which could be entered afterwards again. As it seems, Drupal is accepting Umlaute and special characters again (e.g. in the site name; currently I can't test it on nodes due to failures of 3rd party modules when previewing). Because of this behaviour, I suspect a problem with the transition/conversion from MySQL4 to MySQL5 which only affects _existing_ data (in my case, some 15.000 nodes).

Example 1 with some umlauts and typographic characters (after the corruption):

Cf. Bassam Tibi: „Im sunnitischen Islam gilt ‚kutman/Verheimlichung der eigenen Position’ als Sünde“, cf. Tibi 2005: 365 f. Nach Ansicht des Islamwissenschaftlers Hans Peter Raddatz hätten die Sunniten die Praxis der Taqiyya übernommen, „um durch Glaubensverleugnung bzw. –verwässerung den Eindruck der Liberalisierung und Verwestlichung zu erzeugen“; unter dem „Schutzschirm dieser Verschleierungstaktik“ habe sich die „Strategie der schleichenden Islamisierung“ ungestört verfolgen lassen; Hans-Peter Raddatz: Von Allah zum Terror. Der Djihad und die Deformierung des Westens. München 2002: 179.

That should read:

Cf. Bassam Tibi: „Im sunnitischen Islam gilt ‚kutman/Verheimlichung der eigenen Position’ als Sünde“, cf. Tibi 2005: 365 f. Nach Ansicht des Islamwissenschaftlers Hans Peter Raddatz hätten die Sunniten die Praxis der Taqiyya übernommen, „um durch Glaubensverleugnung bzw. –verwässerung den Eindruck der Liberalisierung und Verwestlichung zu erzeugen“; unter dem „Schutzschirm dieser Verschleierungstaktik“ habe sich die „Strategie der schleichenden Islamisierung“ ungestört verfolgen lassen; Hans-Peter Raddatz: Von Allah zum Terror. Der Djihad und die Deformierung des Westens. München 2002: 179.

Example 2 with some Arabic and German text (also after the corruption):

Îhâm (ايهام, „Täuschung gegenüber Ungläubigen“). Gestützt werden sollen diese Taktiken durch Sure 3,54 („Und sie schmiedeten Ränke. Aber (auch) Gott schmiedete Ränke. Er kann es am besten“, Paret; „Und sie schmiedeten eine List, und Allah schmiedete eine List; und Allah ist der beste Listenschmied“, Rassoul).

That should read:

Îhâm (ايهام, „Täuschung gegenüber Ungläubigen“). Gestützt werden sollen diese Taktiken durch Sure 3,54 („Und sie schmiedeten Ränke. Aber (auch) Gott schmiedete Ränke. Er kann es am besten“, Paret; „Und sie schmiedeten eine List, und Allah schmiedete eine List; und Allah ist der beste Listenschmied“, Rassoul).

That applies to (a) one fresh installation of Drupal 5 on one site, and (b) on one many-times upgraded Drupal 5 (coming from 4.x); thus, the problem is definitely _not_ derived from an old setup.

_Possible Causes_

I'm suspecting, that changes in MySQL 5 are causing the trouble. Basically, this is neither a merely Debian, nor a merely MySQL problem, since other applications on the same host are working fine; e.g., MediaWiki running on the same host is displaying/delivering umlaute, special characters, and non-English languages perfectly. So this seems to be a Drupal problem.

According to phpMyAdmin - 2.9.1.1-Debian-3, the databases are configured as follows (relevant sections extracted):

* Server Version: 5.0.32-Debian_7etch1-log
* MySQL-Zeichensatz: UTF-8 Unicode (utf8)
* character set client: utf8
* (Globaler Wert): latin1
* character set connection: utf8
* (Globaler Wert): latin1
* character set database: latin1
* character set filesystem: binary
* character set results: utf8
* (Globaler Wert): latin1
* character set server: latin1
* character set system: utf8
* collation connection: utf8_unicode_ci
* (Globaler Wert): latin1_swedish_ci
* collation database: latin1_swedish_ci
* collation server: latin1_swedish_ci

* Drupal database: Typ: "MyISAM"; Kollation: "latin1_swedish_ci" (all tables, except table "access", with collation "utf8_general_ci")
* MediaWiki database: Typ: mostly "InnoDB", some "MyISAM"; Kollation: "latin1_swedish_ci" (without exception)

Possibly the database server is somewhow misconfigured; this is a rootserver, and I had to configure it myself without being a skilled database admin. Maybe, other non-English hosts don't have this problem. However, since MediaWiki is running fine, not everything can be wrong.

With some Googling, I discoverd a description of similar problems with another LAMP system [1]; there are some hints ("Upgrading to utf-8 Before" with "iconv -f iso-8859-1 -t utf-8 iso-db.sql > utf8-db.sql"); and: "I had problems with collations because my sysadmin forgot to specify the default charset when Mysql was originally installed. If you have collation problems and your columns are set to swedish [...]": approach: "ALTER TABLE {table} CONVERT TO character set latin1 COLLATE latin1_bin;". This might solve the described problems, or might cause even worse ones.

Maybe someone with better knowledge of MySQL and Drupal can commt on this?

Thanks & Greetings, -asb

[1] http://meta.wikimedia.org/wiki/Help:Upgrading_MediaWiki

Comments

Standart’s picture

There's no connection to Debian. Since MySQL 4.1 the charset matters. You'll find a lot of posts about this if you search a bit. Basically, as I understood Drupal always used UTF8 but MySQL didn't care. Starting with MySQL 4.1 Drupal (or any other web app) has to specify the charset for the connection if different from default (often latin1 in MySQL). So what you have is UTF8 data in a database and a database management system that thinks it's latin1. You need to change the charset settings of your data like this: http://drupal.org/node/84214#comment-154188 or maybe in a smarter way. You'll find one by searching...

asbdpl’s picture

> Basically, as I understood Drupal always used UTF8 but MySQL didn't care. Starting with MySQL 4.1 Drupal (or any other web app)
> has to specify the charset for the connection if different from default (often latin1 in MySQL). So what you have is UTF8 data in a
> database and a database management system that thinks it's latin1.

That sounds possible, but I don't think that it applies to the problems caused by Etch. Your explanation does _not_ explain

* why segments of the site's configuration were lost during the upgrade from Sarge to Etch (e.g. site name and description), and
* why I can enter _now_ Umlaute, special characters, and non-English languange, while the old content is corrupted.

If Drupal since MySQL 4.1 is supposed to specify the charset for the connection, as you said, but it obviously does not, this would be a mjor flaw in Drupal.

Also, if Drupal does not specify the charset for the connection, as you implied, this would not change due to a upgrade of the operating system, since Drupal 5.1 isn't being upgraded by Etch. If I really had UTF8 data in the MySQL database and a database management system - are you referring to MySQL or Drupal? - that thinks it's latin1, I wouldn't be able now to enter non-English languange correctly: If those charsets are not specified, then every non-English content I enter _now_ should still be corrupted, like the old content - right.

Thus I'm pretty sure, that Etch has done something else; what additionally makes me suspect this are entys in /var/log/syslos occuring during the upgrade to Etch:

[...]
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.nodewords                                   OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.notify                                      OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.old_revisions                               OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.page_title                                  OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.path                                        OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.permission                                  OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.poll                                        OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.poll_choices                                OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.poll_votes                                  OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.profile_fields                              OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.profile_values                              OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.project_comments                            OK
Apr 11 00:05:03 h652544 /etc/mysql/debian-start[2200]: drupal.project_issue_state                         OK
[...]

I don't know yet what these entries do mean, but definitely Etch touched the corrupted tables during the upgrade.

Greeting, -asb

Standart’s picture

I also use Etch with various versions of Drupal (4.7.6 and 5.1).

DBMS (database management system) is MySQL opposed to the database that is basically some files.

Drupal tells MySQL >= 4.1 the collatin of the connection (UTF8) since Drupal 4.6.10 or so.

As you upgraded to MySQL >= 4.1 it _now_ understands Drupal's setting of the connection collation and thus you can use umlauts correctly _now_.

Some of your settings are screwed 'cause Drupal stores these settings in the DB -- mostly serialized -- so these are possibly also corrupted.

I did a conversion like it is described in the link above. I heard there's also an option to change the collation of each column by "converting" to a binary format first and to UTF8 afterwards. Thus you may prevent the data to be converted 'cause you already have UTF8-encoded data in your DB. Only your DBMS interprets it as latin1.

No idea about your logs. I don't think it means anything...

asbdpl’s picture

However I tried the steps described at http://textsnippets.com/posts/show/84 as far as possible. THESE STEPS ARE DAGEROUS!

* chgrep doesn't exist on Debian
* sed -i "" 's/latin1/utf8/g' dump.sql doesn't work ("sed: couldn`t read s/latin1/utf8/g : No such file or directory" respectively "sed: kann s/latin1/utf8/g nicht lesen: Datei oder Verzeichnis nich
t gefunden")
* sed -r 's/latin1/utf8/g' dump.sql > dump_utf.sql creates dumps that can't be imported ("ERROR 1062 (23000) at line 1481: Duplicate entry 'verständlich' for key 1")

Also, the _unmodified_ dump can't be imported: "ERROR 1062 (23000) at line 5244: Duplicate entry '???' for key 1"

Just great. Now the databaase is not only corrupted, but dropped and the backup can't be imported. Just geat :-(

ANYBODY TRYING THIS: MAKE SURE MYSQL IMPORTS THE DUMP FILE PROPERLY BEFORE DROPPING THE OLD DATABASE!!!

Regards, -asb

asbdpl’s picture

It's even worse than I thought. The comman sed -r 's/latin1/utf8/g' dump.sql > dump_utf.sql does not only corrupt the database dump even more, but does not work at all.

When I do a forced import with

mysql --user=username --password=password --default-character-set=utf8 --force dbname < dump_utf.sql

the database wasn't converted/repaired. All broken Umlaute, special caharacters, and broken non-English characters are still there.

I don't believe this :-(

Regards, -asb

d242’s picture

Has this been solved yet?

I have run into exactly the same problem as the initial poster (non-English site). I upgraded from Debian Sarge to Debian Etch, which in turn updated the MySQL database from 4.0 to 5.0.

I am running Drupal 4.7.4.

We have umlauts in usernames etc, which means some users cannot log into their accounts due to charset incompatibility.

I have tried dropping, importing, re-importing, converting etc to no avail.

I would appreciate any pointers to a working solution.

asbdpl’s picture

> I have run into a similar UTF-8 issue with Drupal 4.7 and Etch MySQL 5.0.
> Did you manage to solve the problem?

I worked around the problem partially with dumping, dropping, and recreating the database. Since the Drupal dumps seem to be messed up by 3rd party modules regularily, they usually can't be reimported properly; I tried this on several Drupal installations (some several years old, and others just set up a few weeks ago) and always ran into the same troubles: I had to force (--force) the reimport of the dumps which basically is not a good idead but some kind of last resort before losing your site(s) completely.

Result: On one site, I lost about 10,000 nodes completely which had non-standard node types (meaning: nodes, that were created and/or altered by 3rd party modules likes Amazontools, Bookreview, Moviereview, etc.); they are simply gone. Also, over the hole sites, I lost _all_ quotes in greek and arabic languages as well as most special typographical characters (ligaturen, n-dashes, m-dashes, etc.). German umlaute and some non-US-characters of the remaining 14,000+ nodes seem to be OK now.

Also I'd like to point out that - in my opinion - not Etch is the problem but the Unicode support in Drupal which imho sucks; Etch simply made me notice problems which simply didn't show up before. By default, the MySQL tables seem not be be handled properly by Drupal, and as we al know, it doesn't support non-US-characters in URLs at all (opposed to, fo example, MedaiWiki). In my MediaWIki setups, the upgrade from Sarge to Etch caused _no_ problems, even if the Unicode usage in MediaWiki is far more extensive (including _working_ interwiki links to >100 exotic languages, and any kind of weird characters in URLs).

I scribbled down some of my experiences at http://www.kefk.org/portal/asb/upgrade.von.sarge.auf.etch.erfahrungsberi... also, there you will also find references to other approaches which didn't work for me, but might work for other setups; namely "Turning MySQL data in latin1 to utf8 utf-8", http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_lati..., will explain, why all salvation attempts of average-skilled sysadmins with limited time and knowledge are doomed to fail, as long the Unicode support inside Drupal is not extended. However, the main problem are imho the 3rd party modules.

Greetings & regards, asb

d242’s picture

Thanks for your prompt reply, asb!

Luckily I am only using a few 3rd party modules. I managed to correct the issue with the help of the utf8upgrade module. In the end the solution was quite straightforward. The module can be found at http://aymanh.com/mysql-4-0-to-4-1-encoding-problem-and-solution

While searching for an answer I read the O'ReillyNet article you mentioned in your reply and bit my lip!

Good luck!