Hi all

I have had trouble with unusual characters in my articles - probably down to issues in my TinyMCE editor - and unfortunately I now have a whole series of errors in my large number of articles. What would be the most convenient way to search and replace eg "’" for an apostrophe?

I am using MySQL and PhpMyAdmin, but if there is a module that might assist, that'd be great.

Thanks :)

Comments

halfer’s picture

Hi, just pinging this thread. I am a new Drupaller, and could use some assistance :)

Is the best approach for me here to write a standalone PHP snippet that reads in records from the relevant Drupal table, modifies them, and writes them back?

halfer’s picture

Well, I went ahead and created a standalone snippet. Perhaps this might be of use to someone. It's not particularly good at dealing with db difficulties, but of course anyone is welcome to tidy it up. Recommend that users backup their node_revisions before use. Be aware that more specific searches should be run before less specific ones, lest the wrong thing be replaced!


// License - do what you will!
// 22 Oct 2006

// Settings
$host = 'localhost';
$user = '--';
$db = '--';
$pwd = '--';

// Connect to db
$err = false;
$link = mysql_connect($host, $user, $pwd) or $err = true;
if ($err) die('Could not connect: ' . mysql_error());
mysql_select_db($db) or $err = true;
if ($err) die('Could not select database');

// Set up search and replaces - this is an associative array with the
// key as the 'needle' to search for, and the value as the replacer to
// use if the 'haystack' contains a match.
//
// Items here can be deleted if required. These were used to replace special characters
// that became corrupted (probably down to not using UTF-8, and exacerbated
// by TinyMCE).
//
$s = array('’' => '’',         // Right-apostrophe (eg in I'm)
  '“' => '“',                  // Opening speech mark
  '–' => '—',                  // Long dash
  'â€' => '”',                         // Closing speech mark
  'é' => 'é',                       // e acute accent
  chr(226) . chr(128) . chr(153) => '’',          // Right-apostrophe again
  chr(226) . chr(128) . chr(147) => '—',          // Long dash again
  chr(226) . chr(128) . chr(156) => '“',	      // Opening speech mark
  chr(226) . chr(128) . chr(148) => '—',	      // M dash again
  chr(226) . chr(128) => '”',                     // Right speech mark
  chr(195) . chr(169) => 'é',                    // e acute again
  );

// Set up sql
$query = 'SELECT nid, vid, title, body, teaser FROM node_revisions';
$result = mysql_query($query) or die('Query failed: ' . mysql_error());

// Set up loop to read items
while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
  $b = $line['body'];
  $t = $line['teaser'];
  $n = $line['nid'];
  $v = $line['vid'];

  $c = 0;
  foreach ($s as $needle => $replace)
  {
    echo "$needle / ";
    $newB = str_replace($needle, $replace, $b);
    $newT = str_replace($needle, $replace, $t);
    $change = (($newB != $b) || ($newT != $t));
    if ($change)
    {
      $c++;
      $b = $newB;
      $t = $newT;
    }
  }

  if ($c > 0)
  {
    $b = mysql_real_escape_string($b);
    $t = mysql_real_escape_string($t);
    $sql = "UPDATE node_revisions SET body='$b', teaser='$t' WHERE nid=$n AND vid=$v";
    $ok = true;
    $r2 = mysql_query($sql) or $ok = false;

    echo ": ";

    if ($ok)
    {
      echo "$c change types successful in " . $line['title'] . ' (' . $line['nid'] . ")\n";
    }
    else
    {
      echo 'Update failed (' . mysql_error() . ')';
      die();
    }
  }
  else
  {
    echo "\n";
  }
}

// Close db
mysql_close($link);

willwade’s picture

I think its actually a collation problem rather than tinyMCE or someother module..
No idea why. Its just happened on mine... have a feeling its because the collation is all mixed - mine is in latin1_swedish - no idea why..

halfer’s picture

You're absolutely right, it was that. Should have reset my database to UTF-8, and after having to amend a lot of documents, including some hand correction, I've learnt my lesson ;)

My search-n-replace code should be useful for other things too, I hope - and perhaps one day this feature will be offered as a Drupal module. I know I'd find it useful.

kvarnelis@drupal.org’s picture

in my case, the characters were even more messed up. to this end, i've made some modifications.

in some cases (fantastico installs?) the password in your sites file will be scrambled. in that case make sure that the pwd is pasted from that file or it will not run.

<?php
// License - do what you will!
// 22 Oct 2006

// Settings
$host = '';
$user = '';
$db = '';
$pwd = '';

// Connect to db
$err = false;
$link = mysql_connect($host, $user, $pwd) or $err = true;
if ($err) die('Could not connect: ' . mysql_error());
mysql_select_db($db) or $err = true;
if ($err) die('Could not select database');

// Set up search and replaces - this is an associative array with the
// key as the 'needle' to search for, and the value as the replacer to
// use if the 'haystack' contains a match.
//
// Items here can be deleted if required. These were used to replace special characters
// that became corrupted (probably down to not using UTF-8, and exacerbated
// by TinyMCE).
//
$s = array('&acirc;&euro;&trade;' => '&rsquo;',         // Right-apostrophe (eg in I'm)
  '&acirc;&euro;&oelig;' => '&ldquo;',                  // Opening speech mark
  '&acirc;&euro;&ldquo;' => '&mdash;',                  // Long dash
  '&acirc;&euro;' => '&rdquo;',                         // Closing speech mark
  '&Atilde;&copy;' => '&eacute;',                       // e acute accent
  chr(226) . chr(128) . chr(153) => '&rsquo;',          // Right-apostrophe again
  chr(226) . chr(128) . chr(147) => '&mdash;',          // Long dash again
  chr(226) . chr(128) . chr(156) => '&ldquo;',          // Opening speech mark
  chr(226) . chr(128) . chr(148) => '&mdash;',          // M dash again
  chr(226) . chr(128) => '&rdquo;',                     // Right speech mark
  chr(195) . chr(169) => '&eacute;',                    // e acute again
  '’' => '&rsquo;',									// Right-apostrophe again
  '“' => '&ldquo;',									// Left-smart quote
  'â€' =>  '&rdquo;',									// Right-smart quote
  '…' => '&hellip;',									// ellipse
  'é' => '&eacute;',									// e acute accent
  'ï' => '&iuml;'									// i acute accent
  
  
  
  
  
  );

// Set up sql
$query = 'SELECT nid, vid, title, body, teaser FROM node_revisions';
$result = mysql_query($query) or die('Query failed: ' . mysql_error());

// Set up loop to read items
while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
  $b = $line['body'];
  $t = $line['teaser'];
  $n = $line['nid'];
  $v = $line['vid'];

  $c = 0;
  foreach ($s as $needle => $replace)
  {
    echo "$needle / ";
    $newB = str_replace($needle, $replace, $b);
    $newT = str_replace($needle, $replace, $t);
    $change = (($newB != $b) || ($newT != $t));
    if ($change)
    {
      $c++;
      $b = $newB;
      $t = $newT;
    }
  }

  if ($c > 0)
  {
    $b = mysql_real_escape_string($b);
    $t = mysql_real_escape_string($t);
    $sql = "UPDATE node_revisions SET body='$b', teaser='$t' WHERE nid=$n AND vid=$v";
    $ok = true;
    $r2 = mysql_query($sql) or $ok = false;

    echo ": ";

    if ($ok)
    {
      echo "$c change types successful in " . $line['title'] . ' (' . $line['nid'] . ")\n";
    }
    else
    {
      echo 'Update failed (' . mysql_error() . ')';
      die();
    }
  }
  else
  {
    echo "\n";
  }
}

// Close db
mysql_close($link);
?>
burnbright’s picture

After updating to the newest version of mysql, I ran this script to fix all of the whacked out characters that were inserted into my content. The script ran successfully, but it only appeared to replace the characters on five pages of my site (out of hundreds).

Any suggestions on why this could have happened and how to apply the script to my entire site?

Thanks.

dg

halfer’s picture

It might be that the problematic character sequences you are experiencing are not catered for in the scripts here. Try identifying each sequence of 'bad characters' and add in your own search/replace elements to deal with them. If you are unsure of the consequences, comment out the database write operations until you are confident that it will make good changes (and back up your database in any case).

waltWright’s picture

SQL script example can be found here

http://drupal.org/node/369961

this replaces ’ – — “ †¿ á ó ñ ň é ú Ã
with appropriate UTF8 replacements

I found that by SEARCHing from phpMyAdmin, one can find the affected tables and columns, then one can modify the script easily and run it from phpMyAdmin SQL tab.

dan90’s picture

I somehow didn't find this thread when i was running into charset issue on my sites. But for my jobs, which was combining different nodes with different content encodings, i found it much easier to do this from the command line using character set conversion utilities, such as the unix command iconv, and from php using the mb_convert_encoding function:
http://www.php.net/manual/en/function.mb-convert-encoding.php

Be careful though - I also got my character sets extra mangled by running mysqldump on a host (WestHost) whose command-line utilities are very unfriendly to many character sets - their version of sed clobbered some of the high-ascii characters. Bad news. Doing it in PHP avoids that.

possibly easiest are the character set conversion tools inside MySQL itself - http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html - these work if you are converting from tables all in one character set.. That wasn't the case for me, sadly.

herb’s picture

Thanks to all for the answers in this post. It helped me a lot.

A few issues I had that might help the next person with the same problem are:

1. The Right Smart Quote appears in this post missing the last character.
'”' => '"', is missing a third symbol that you need to find to get it to work.

2. If you are updating CCK nodes, the changes made to the database will not be reflected unless the CCK content cache is cleared. I used cache_clear_all('content:'.$n.':'.$n, 'cache_content'); // Clears the CCK cache so update works

Hope this helps.