By jochenh on
HI,
I am not quite sure how to explain this one but somehow the body text and in some cases the titles of my postings are seeing some strange punctuation. I believe this might have started to happen after upgrading from 4.69 to 4.74 but I am not 100% sure. Basically a title that used to be INDEX: LifeStraw - cleans dirty water anywhere is now suddenly INDEX: LifeStraw‚Ñ¢ cleans dirty water anywhere.
I checked out the database tables and the encoding for the node_revisions table is the same for both 4.69 and 4.74 so I really have no clue how this changed...
If anybody has any advice on what could be causing this I would appreciate it
thanks
- jochen
Comments
I believe this might have
Did you do this update recently? Did you run /update.php after replacing the files?
What encoding is that and what is the collation?
Also,
- Did your MySQL version change lately (form 4.0 or earlier to 4.1 or newer)?
- Did you do any backup/restore lately?
I am having the same problem
I am having the same problem - on my upgrade test runs.
No change in MySQL version (4.1 before and after), but yes I did backup / restore. Backed up original DB, created a test DB using restore and got this problem after running update on it.
Any suggestions ???
These appear to be Unicode
These appear to be Unicode multi-byte characters displayed as multiple non-Unicode single-byte characters. It could happen (for example)
- when editing your database dump with a text editor not capable of Unicode
- when moving data between databases with different default character set/collation settings
- when using different character set/collation settings for dumping and for loading the backup
If it not the first case above, check your database tables to see whether the character set is utf8 and the collation is utf8_general_ci.
Very strange...
Looked at my 4.6.9 database, the encoding was latin1 / latin1_swedish_ci (how did THAT happen?). When I run SELECT * FROM node I do see node titles encoded strangely.
Once conversion to 4.7.3 is complete encoding is utf8 / utf8_general_ci as you say is needed. Data looks the same on SELECT. I used backup / restore through webmin before running update.
Interestingly after changing collation on my 4.7.3 tables (using MySQL Admin) there are no changes to how Drupal displays data. I changed node and node_revision table (back to latin1) and still have weird chars in "Recent blog posts" block. Questions:
1) How should I change the backup procedure on final conversion ?
2) Can I do anything to fix my test conversion / upgrade DB?
Stranger still...
OK, I tried another conversion using mysqldump, as suggested here: http://drupal.org/node/108052
The problem is the same. Looks like the update.php converts latin1 into utf8 without fixing the special characters. Can I fix (convert?) the encoding before running update.php script?
Do I have UTF-8 text stored in latin1 tables?
To be sure I understand, when I created my Drupal 4.6 DB on latin1 database, it got populated (by Drupal 4.6 operation) with text that is UTF encoded and this is source of the problem during upgrade?
I found a solution for converting the data (http://drupal.org/node/105151), but looks like it is being debated whether it is a good one. Who is right? Anyone can help?
Drupal stores text in UTF-8
Drupal stores text in UTF-8 even if the database says otherwise, so that MySQL 4.0 or earlier without Unocode support can be used. Problems are often caused in Fantastico installations. The latin1_swedish_ci collation is the default collation for the latin1 character set, and it is wrong. The collation has to do with character ordering and matching, so it affects searches and comparisons.
The trick in http://drupal.org/node/105151 for exporting the dump as if it was latin1 and then treating it as utf8 is often used for taking dumps from old MySQL versions. The disagreements ("didn't work") usually come from using a method for a "similar" but different problem. It wouldn't harm to test it by loading this dump in a new testing utf8 database.
Also try the script in http://drupal.org/node/107720, comment #4 (on a testing database) which applies "ALTER TABLE ... CHARACTER SET utf8" etc to all tables and columns in your database.
My knowledge of this is rather empirical, so I can not make an exact diagnosis.
Bad Unicode Scanner
I have had this problem as well, after my 4.6 -> 4.7 conversion. It looks to me like the conversion from unicode stored in latin happened twice. After reading this article: http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_lati... I decided to write an after-the-fact converter. This is complicated by the fact that either my version of php or my editor can't handle the charcters in question. But my browser and phpmyadmin can, so I ran SELECT HEX('–') on all the bad triples, to get the hex values. I put this is a helper module, but it could be run from a page. Also, I just stripped out the unicode, but it shouldn't be too hard to turn this into a bad unicode to good unicode converter.
didn't work for me
this code just doesn't seem to work for me.
not sure why.
i've tried it as a page and as a block. it doesn't return "Done with Scan" any idea what might be wrong?
php DOES work, just this script doesn't work...
trouble more or less fixed
i'm not sure where the above code is supposed to go, but it was too complex for me and never worked.
http://drupal.org/node/89210
has a workable solution that can be pasted into a page.