HI,

I am not quite sure how to explain this one but somehow the body text and in some cases the titles of my postings are seeing some strange punctuation. I believe this might have started to happen after upgrading from 4.69 to 4.74 but I am not 100% sure. Basically a title that used to be INDEX: LifeStraw - cleans dirty water anywhere is now suddenly INDEX: LifeStraw‚Ñ¢ cleans dirty water anywhere.

I checked out the database tables and the encoding for the node_revisions table is the same for both 4.69 and 4.74 so I really have no clue how this changed...

If anybody has any advice on what could be causing this I would appreciate it

thanks
- jochen

Comments

cog.rusty’s picture

I believe this might have started to happen after upgrading from 4.69 to 4.74 but I am not 100% sure.

Did you do this update recently? Did you run /update.php after replacing the files?

I checked out the database tables and the encoding for the node_revisions table is the same for both 4.69 and 4.74

What encoding is that and what is the collation?

Also,
- Did your MySQL version change lately (form 4.0 or earlier to 4.1 or newer)?
- Did you do any backup/restore lately?

dkruglyak’s picture

I am having the same problem - on my upgrade test runs.

No change in MySQL version (4.1 before and after), but yes I did backup / restore. Backed up original DB, created a test DB using restore and got this problem after running update on it.

Any suggestions ???

cog.rusty’s picture

These appear to be Unicode multi-byte characters displayed as multiple non-Unicode single-byte characters. It could happen (for example)
- when editing your database dump with a text editor not capable of Unicode
- when moving data between databases with different default character set/collation settings
- when using different character set/collation settings for dumping and for loading the backup

If it not the first case above, check your database tables to see whether the character set is utf8 and the collation is utf8_general_ci.

dkruglyak’s picture

Looked at my 4.6.9 database, the encoding was latin1 / latin1_swedish_ci (how did THAT happen?). When I run SELECT * FROM node I do see node titles encoded strangely.

Once conversion to 4.7.3 is complete encoding is utf8 / utf8_general_ci as you say is needed. Data looks the same on SELECT. I used backup / restore through webmin before running update.

Interestingly after changing collation on my 4.7.3 tables (using MySQL Admin) there are no changes to how Drupal displays data. I changed node and node_revision table (back to latin1) and still have weird chars in "Recent blog posts" block. Questions:

1) How should I change the backup procedure on final conversion ?

2) Can I do anything to fix my test conversion / upgrade DB?

dkruglyak’s picture

OK, I tried another conversion using mysqldump, as suggested here: http://drupal.org/node/108052

mysqldump --user myusername --password=mypassword opt nameofmydatabase > sql.dump
mysql -u myusername nameofmydatabase -p < path/to/database/dump

The problem is the same. Looks like the update.php converts latin1 into utf8 without fixing the special characters. Can I fix (convert?) the encoding before running update.php script?

dkruglyak’s picture

To be sure I understand, when I created my Drupal 4.6 DB on latin1 database, it got populated (by Drupal 4.6 operation) with text that is UTF encoded and this is source of the problem during upgrade?

I found a solution for converting the data (http://drupal.org/node/105151), but looks like it is being debated whether it is a good one. Who is right? Anyone can help?

cog.rusty’s picture

Drupal stores text in UTF-8 even if the database says otherwise, so that MySQL 4.0 or earlier without Unocode support can be used. Problems are often caused in Fantastico installations. The latin1_swedish_ci collation is the default collation for the latin1 character set, and it is wrong. The collation has to do with character ordering and matching, so it affects searches and comparisons.

The trick in http://drupal.org/node/105151 for exporting the dump as if it was latin1 and then treating it as utf8 is often used for taking dumps from old MySQL versions. The disagreements ("didn't work") usually come from using a method for a "similar" but different problem. It wouldn't harm to test it by loading this dump in a new testing utf8 database.

Also try the script in http://drupal.org/node/107720, comment #4 (on a testing database) which applies "ALTER TABLE ... CHARACTER SET utf8" etc to all tables and columns in your database.

My knowledge of this is rather empirical, so I can not make an exact diagnosis.

starbow’s picture

I have had this problem as well, after my 4.6 -> 4.7 conversion. It looks to me like the conversion from unicode stored in latin happened twice. After reading this article: http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_lati... I decided to write an after-the-fact converter. This is complicated by the fact that either my version of php or my editor can't handle the charcters in question. But my browser and phpmyadmin can, so I ran SELECT HEX('–') on all the bad triples, to get the hex values. I put this is a helper module, but it could be run from a page. Also, I just stripped out the unicode, but it shouldn't be too hard to turn this into a bad unicode to good unicode converter.

function bad_unicode_triples_scan() {
  $bad_uni_triples_map = array ( 
    'C3A2E282ACE2809C' => "\'", // ` -> '
    'C3A2E282ACE2809D' => "\'", // ' -> '
    'C3A2E282ACE284A2' => "\'", // ’ -> '
    'C3A2E282ACCB9C' => "\'",   // ‘ -> '
    'C3A2E282ACC593' => "\"",   // “ -> "
    'C3A2E282ACC29D' => "\"",   // ” -> "
    'C382C2BB' => '>>',         // » -> >>
    'C3A2E282ACC2A6' => '...',   // … -> ...
    'C3A2E282ACC2A2' => '*',   // • -> *
  );

  $tables_map = array ();
  $tables_map[] = array( 'table' => 'node', 'field' => 'title' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'title' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'body' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'teaser' ); 
  $tables_map[] = array( 'table' => 'flexinode_data', 'field' => 'textual_data' ); 

  foreach( $bad_uni_triples_map as $badhex => $good ) {
    foreach( $tables_map as $map ) {
      $table = $map['table'];
      $field = $map['field'];
      $sql = "UPDATE {$table} SET $field=REPLACE($field, UNHEX('$badhex'), '$good') WHERE $field REGEXP UNHEX('$badhex')";
//      drupal_set_message( $sql );
      db_query( $sql );
    }     
  }
  return "Done with scan";
}
kvarnelis@drupal.org’s picture

this code just doesn't seem to work for me.

not sure why.

i've tried it as a page and as a block. it doesn't return "Done with Scan" any idea what might be wrong?

php DOES work, just this script doesn't work...

kvarnelis@drupal.org’s picture

i'm not sure where the above code is supposed to go, but it was too complex for me and never worked.

http://drupal.org/node/89210

has a workable solution that can be pasted into a page.