Errors with punctuation in node body and titles

By jochenh on 9 Jan 2007 at 18:38 UTC

HI,

I am not quite sure how to explain this one but somehow the body text and in some cases the titles of my postings are seeing some strange punctuation. I believe this might have started to happen after upgrading from 4.69 to 4.74 but I am not 100% sure. Basically a title that used to be INDEX: LifeStraw - cleans dirty water anywhere is now suddenly INDEX: LifeStraw‚Ñ¢ cleans dirty water anywhere.

I checked out the database tables and the encoding for the node_revisions table is the same for both 4.69 and 4.74 so I really have no clue how this changed...

If anybody has any advice on what could be causing this I would appreciate it

thanks
- jochen

Comments

I believe this might have

cog.rusty commented 9 January 2007 at 19:53

I believe this might have started to happen after upgrading from 4.69 to 4.74 but I am not 100% sure.

Did you do this update recently? Did you run /update.php after replacing the files?

I checked out the database tables and the encoding for the node_revisions table is the same for both 4.69 and 4.74

What encoding is that and what is the collation?

Also,
- Did your MySQL version change lately (form 4.0 or earlier to 4.1 or newer)?
- Did you do any backup/restore lately?

I am having the same problem

dkruglyak commented 9 January 2007 at 22:39

I am having the same problem - on my upgrade test runs.

No change in MySQL version (4.1 before and after), but yes I did backup / restore. Backed up original DB, created a test DB using restore and got this problem after running update on it.

Any suggestions ???

These appear to be Unicode

cog.rusty commented 9 January 2007 at 23:33

These appear to be Unicode multi-byte characters displayed as multiple non-Unicode single-byte characters. It could happen (for example)
- when editing your database dump with a text editor not capable of Unicode
- when moving data between databases with different default character set/collation settings
- when using different character set/collation settings for dumping and for loading the backup

If it not the first case above, check your database tables to see whether the character set is utf8 and the collation is utf8_general_ci.

Very strange...

dkruglyak commented 10 January 2007 at 07:15

Looked at my 4.6.9 database, the encoding was latin1 / latin1_swedish_ci (how did THAT happen?). When I run SELECT * FROM node I do see node titles encoded strangely.

Once conversion to 4.7.3 is complete encoding is utf8 / utf8_general_ci as you say is needed. Data looks the same on SELECT. I used backup / restore through webmin before running update.

Interestingly after changing collation on my 4.7.3 tables (using MySQL Admin) there are no changes to how Drupal displays data. I changed node and node_revision table (back to latin1) and still have weird chars in "Recent blog posts" block. Questions:

1) How should I change the backup procedure on final conversion ?

2) Can I do anything to fix my test conversion / upgrade DB?

Stranger still...

dkruglyak commented 10 January 2007 at 08:50

OK, I tried another conversion using mysqldump, as suggested here: http://drupal.org/node/108052

mysqldump --user myusername --password=mypassword opt nameofmydatabase > sql.dump
mysql -u myusername nameofmydatabase -p < path/to/database/dump

The problem is the same. Looks like the update.php converts latin1 into utf8 without fixing the special characters. Can I fix (convert?) the encoding before running update.php script?

Do I have UTF-8 text stored in latin1 tables?

dkruglyak commented 10 January 2007 at 09:58

To be sure I understand, when I created my Drupal 4.6 DB on latin1 database, it got populated (by Drupal 4.6 operation) with text that is UTF encoded and this is source of the problem during upgrade?

I found a solution for converting the data (http://drupal.org/node/105151), but looks like it is being debated whether it is a good one. Who is right? Anyone can help?

Drupal stores text in UTF-8

cog.rusty commented 10 January 2007 at 15:19

Drupal stores text in UTF-8 even if the database says otherwise, so that MySQL 4.0 or earlier without Unocode support can be used. Problems are often caused in Fantastico installations. The latin1_swedish_ci collation is the default collation for the latin1 character set, and it is wrong. The collation has to do with character ordering and matching, so it affects searches and comparisons.

The trick in http://drupal.org/node/105151 for exporting the dump as if it was latin1 and then treating it as utf8 is often used for taking dumps from old MySQL versions. The disagreements ("didn't work") usually come from using a method for a "similar" but different problem. It wouldn't harm to test it by loading this dump in a new testing utf8 database.

Also try the script in http://drupal.org/node/107720, comment #4 (on a testing database) which applies "ALTER TABLE ... CHARACTER SET utf8" etc to all tables and columns in your database.

My knowledge of this is rather empirical, so I can not make an exact diagnosis.

Bad Unicode Scanner

starbow commented 25 January 2007 at 18:34

I have had this problem as well, after my 4.6 -> 4.7 conversion. It looks to me like the conversion from unicode stored in latin happened twice. After reading this article: http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_lati... I decided to write an after-the-fact converter. This is complicated by the fact that either my version of php or my editor can't handle the charcters in question. But my browser and phpmyadmin can, so I ran SELECT HEX('â€“') on all the bad triples, to get the hex values. I put this is a helper module, but it could be run from a page. Also, I just stripped out the unicode, but it shouldn't be too hard to turn this into a bad unicode to good unicode converter.

function bad_unicode_triples_scan() {
  $bad_uni_triples_map = array ( 
    'C3A2E282ACE2809C' => "\'", // ` -> '
    'C3A2E282ACE2809D' => "\'", // ' -> '
    'C3A2E282ACE284A2' => "\'", // ’ -> '
    'C3A2E282ACCB9C' => "\'",   // ‘ -> '
    'C3A2E282ACC593' => "\"",   // “ -> "
    'C3A2E282ACC29D' => "\"",   // ” -> "
    'C382C2BB' => '>>',         // » -> >>
    'C3A2E282ACC2A6' => '...',   // … -> ...
    'C3A2E282ACC2A2' => '*',   // • -> *
  );

  $tables_map = array ();
  $tables_map[] = array( 'table' => 'node', 'field' => 'title' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'title' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'body' ); 
  $tables_map[] = array( 'table' => 'node_revisions', 'field' => 'teaser' ); 
  $tables_map[] = array( 'table' => 'flexinode_data', 'field' => 'textual_data' ); 

  foreach( $bad_uni_triples_map as $badhex => $good ) {
    foreach( $tables_map as $map ) {
      $table = $map['table'];
      $field = $map['field'];
      $sql = "UPDATE {$table} SET $field=REPLACE($field, UNHEX('$badhex'), '$good') WHERE $field REGEXP UNHEX('$badhex')";
//      drupal_set_message( $sql );
      db_query( $sql );
    }     
  }
  return "Done with scan";
}

didn't work for me

kvarnelis@drupal.org commented 3 March 2007 at 17:20

this code just doesn't seem to work for me.

not sure why.

i've tried it as a page and as a block. it doesn't return "Done with Scan" any idea what might be wrong?

php DOES work, just this script doesn't work...

trouble more or less fixed

kvarnelis@drupal.org commented 3 March 2007 at 19:49

i'm not sure where the above code is supposed to go, but it was too complex for me and never worked.

http://drupal.org/node/89210

has a workable solution that can be pasted into a page.

Errors with punctuation in node body and titles

Comments

I believe this might have

I am having the same problem

These appear to be Unicode

Very strange...

Stranger still...

Do I have UTF-8 text stored in latin1 tables?

Drupal stores text in UTF-8

Bad Unicode Scanner

didn't work for me

trouble more or less fixed

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community