When searching for a word in Finnish ("äiti") the following error occurs.

--

user warning: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' query: SELECT sid, SUM(score) score, SUM(completeness) completeness, SUM(total) total, n.title, nr.teaser FROM search_fuzzy_index f LEFT JOIN node n ON (n.nid = f.sid) LEFT JOIN node_revisions nr ON (n.vid = nr.vid) WHERE trigram = 'äi' OR trigram = '?it' OR trigram = 'iti' GROUP BY sid ORDER BY score DESC in [path in dev server]includes/database.mysql.inc on line 172.

--

Comments

BlakeLucchesi’s picture

I've tracked down the bug to be occuring when i use substr to split the words into ngrams, this doesn't seem to be working well with utf8 characters. I've investigated the truncate_utf8() function provided by drupal core but I don't think this will work because there is no offset parameter which I need to create ngrams throughout the word. I will post up more information as I find it.

agentrickard’s picture

Isn't the problem caused by the database storage format (LATIN1 vs. UTF-8) and not the code?

I noticed that the install file doesn't explicitly enforce UTF-8. Take, for example, this section from system.install:

      db_query("CREATE TABLE {boxes} (
        bid int NOT NULL auto_increment,
        body longtext,
        info varchar(128) NOT NULL default '',
        format int NOT NULL default '0',
        PRIMARY KEY (bid),
        UNIQUE KEY info (info)
      ) /*!40100 DEFAULT CHARACTER SET UTF8 */ ");

Against:

      db_query("CREATE TABLE search_fuzzy_index (
        id int(11) NOT NULL auto_increment,
        sid int(11) NOT NULL,
        trigram varchar(6) NOT NULL,
        completeness double NOT NULL,
        score decimal(8,2) NOT NULL,
        total decimal(8,2) NOT NULL,
        PRIMARY KEY  (`id`)
      )");

If the storage format is not specified, MySQL will use the database default.

BlakeLucchesi’s picture

I think it has do to with a bit of both, its not indexing the content right because when the code calls substr to split the word it doesn't handle the utf8 characters properly. Fortunately, I found this function on php.net and so I'll give this a try and also try out the database schema change you suggested. Thanks

function utf8_substr($str,$from,$len){
// utf8 substr
// yeap.lv
  return preg_replace('#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$from.'}'.
                       '((?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$len.'}).*#s',
                       '$1',$str);
}
agentrickard’s picture

I had this problem once before. It was caused by running a JOIN across two tables where one was LATIN1 and the other was UTF8. So it may be that the function splitter is irrelevant. I don't think the LATIN table knows how to process UTF8 lookups.

BlakeLucchesi’s picture

Status: Active » Closed (fixed)

Support for UTF8 characters is now available. The issues were in a few places. First thing that was causing problems was that I had programmed the word/score collection array to use the words in the node as the array keys and a running count of the scores for the words as the values. This was fixed by having two separate arrays with matching keys. Before adding a new word to the words array, we search through the words and if a match is found we only update the score for that word. Doing this allows the size of the index to stay low and makes the completeness metric much more accurate per word.

Other things fixed in this update were support for utf8 string splitting using a modified substr function, and utf8 string length function to accurately count the number of characters in each string. I've also implemented the unwanted characters reguar expression from the search.module so that instead of filtering all non a-zA-Z0-9 characters from the words before indexing, we strip all unwanted utf8 characters.

Checkout the latest dev snapshot to download the fixed code.