Posted by Gábor Hojtsy on February 6, 2008 at 10:07am
| Project: | Drupal core |
| Version: | 6.9 |
| Component: | search.module |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Just upgraded drupal.hu to Drupal 6 to eat some dog food, and it looks like the search module is miserably misbehaving. I get boatloads of these *errors* in dblog:
Duplicate entry 'nevet-33-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevet', 33, 'node', 0.401937290034) - [***]/root/modules/search/search.module - 574. sor.
Comments
#1
search_wipe() is called before doing these inserts.
<?php
# line 567
search_wipe($sid, $type, TRUE);
// Insert cleaned up data into dataset
db_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);
// Insert results into search index
foreach ($results[0] as $word => $score) {
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
search_dirty($word);
}
?>
And search_wipe should delete all that stuff:
<?php/**
* Wipes a part of or the entire search index.
*
* @param $sid
* (optional) The SID of the item to wipe. If specified, $type must be passed
* too.
* @param $type
* (optional) The type of item to wipe.
*/
function search_wipe($sid = NULL, $type = NULL, $reindex = FALSE) {
if ($type == NULL && $sid == NULL) {
module_invoke_all('search', 'reset');
}
else {
db_query("DELETE FROM {search_dataset} WHERE sid = %d AND type = '%s'", $sid, $type);
db_query("DELETE FROM {search_index} WHERE sid = %d AND type = '%s'", $sid, $type);
// Don't remove links if re-indexing.
if (!$reindex) {
db_query("DELETE FROM {search_node_links} WHERE sid = %d AND type = '%s'", $sid, $type);
}
}
}
?>
Why isn't it doing that?
#2
Addendum: Also, the array is keyed by words, so those should be unique for a single re-indexing.
#3
Hm, now that you provided this info, I think it could easily be a collation problem. "word" is defined to be a simple varchar column, so whatever collides in the used collation will collide there as well.
#4
I get it: PHP's array keys collate differently, allowing multiple strings that would collide in MySQL's collation.
That's nasty. Is there any way to avoid having to search for each word before adding it? Or would it be simpler to force a binary collation on this column?
#5
To elaborate on that, depending on the UTF collation used, several chars might be the same for the index, like é and e for Hungarian. It is a bad MySQL bug which we encountered before (http://bugs.mysql.com/bug.php?id=22337). Are case-insensitive searches supported relying on MySQL's collations? Depending on the answer we can convert this field to binary or not.
#6
Hm, tested to set the word column to
varbinary(10)and then tovarchar(10) character set utf8 collate utf8_bin NOT NULL default ''and none of that worked to fix this. It might not be a collation issue after all. Hm.#7
Varbinary() would not be good anyway, since it does not work on the character level. So the binary collation was the only contender. With that in place, I got *even more* duplicate key errors for the indexing. Not promising. (Now I did revoke permission for anonymous and authenticated users to use the search, since I was wiping the search index on the SQL level by hand to be able to do these changes so it would have been a less then satisfactory experience).
#8
Unfortunately I can't reproduce this with a Lorem Ipsum on utf8_general_ci. The search indexer just runs right through without errors. Can you post the collation and specific node text that this is happening with?
Perhaps vardumping the $results array before inserting it would be helpful information, too.
#9
We have a similar issue in D5 which was marked 'fixed'. Not sure if this is a dupe of not. See http://drupal.org/node/143160
#10
Well, the older issue was fixed by adding a unique index on the table. So instead of having duplicate values in the table, we now get errors about trying to insert duplicate values.
#11
Hm, I looked into actual data causing this. For the error mentioned in the topic starter, it is http://drupal.hu/node/33 (as shown in the error message). The node and its comments contain "nevet" and "nevét", and this obviously collides with the collation used. My search_index table is:
mysql> show create table search_index;...
| search_index | CREATE TABLE `search_index` (
`word` varchar(50) NOT NULL default '',
`sid` int(10) unsigned NOT NULL default '0',
`type` varchar(16) default '',
`score` float default NULL,
UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
KEY `sid_type` (`sid`,`type`),
KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
...
So to try to reproduce, for example, just include "nevet" and "nevét" in a node and see what happens on its indexing. These two words are definitely not colliding in PHP arrays, but they are in MySQL's above setup, as Drupal uses it. This way, search module is unusable as-is with sites using accented chars.
#12
Yep. A Drupal 6 install (with search enabled) containing "nevet" and "nevét" in a node (and nothing else) will trigger a:
user warning: Duplicate entry 'nevét-3-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevét', 3, 'node', 1) in /web/test/modules/search/search.module on line 574.on running cron.
#13
So why won't the binary collation work? Is the unique key on word-sid-type is stripping high-bit characters regardless of collation?
#14
Is there any reason someone would want to index accented and unaccented characters separately with search module? Very late in the cycle for this but I'm wondering if something similar to accents module might be an option? Does varchar to binary have performance/space implications here?
#15
What happens to search if it is used in a non-Latin character set like Japanese or Hebrew?
But for that problem I would indeed suggest stripping out or replacing high-bit characters, and doing the same when processing a search query.
#16
I can also reproduce this error.
responding to #1, search_wipe is working fine. It wipes {search_index} for the sid/type combination, then starts inserting the word/sid/type values found, and if this list included the example text "nevet" and "nevét", the first word gets inserted, and the word creates the duplicate error.
So to summarize the above, php recognizes a difference between these two words, but mysql does not.
#17
How can PHP ensure uniqueness without messing up non-Latin alphabets?
#18
While the root cause is a tricky one to solve properly, I know a trivial solution to the error problem:
<?phpforeach ($results[0] as $word => $score) {
if ((float)db_result(db_query("SELECT score FROM {search_index} WHERE word = '%s' AND sid = %d AND type = '%s'", $word, $sid, $type)) < $score) {
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
search_dirty($word);
}
}
?>
This would choose only the word with the maximum score for each set of duplicates. If this sounds like a good solution, I'll make a patch.
#19
Arancaytar: if no better solution comes around, this would slow down the reindexing but solve the problem for foreign languages. If this way is chosen, code comments would be great to be added to describe why do we do this (since people might think that the wipe removes the word occurrences already).
Note that we are on track for an (again hopefully final) RC4 tonight (in around 13 hours), so it would be great to solve this as well.
#20
I know. :/ That's why I hate to be satisfied with such a primitive solution. But this is one of six criticals still in the queue, and if the next release candidate can downgrade it to a performance issue, that would be an improvement.
#21
fwiw I just did a very rudimentary test with some Japanese text and minimum length one character and didn't run into any issues - looks like it's a bug specific to particular accents and nothing more than that.
#22
Most accents used in our text (we use accents common to Central Europe) cause bugs. One indexing round with 100 nodes resulted in around 5 pages of errors in watchdog, that's why I turned the module off for now on the live site, and only test it on my private test site.
#23
Sorry particular accents should've been "accents on roman characters", cafe and café causes the same error with utf8_general_ci for example.
More on this bug/behaviour at the mysql forums, doesn't look all that helpful.
What about a strtr to strip accents before insertion? Non-trivial but probably more performant than #18. However I guess this could really mess things up in some languages.
As mentioned, accents module does this but I've not installed it, and first look at the code was five minutes ago.
#24
Well, "nevet" and "nevét" definitely means two different things. "nevet" == "laughs" or "that name" depending on the context :), while "nevét" == "her name" or "his name". I know from practice that most of the time, Hungarians are fine writing without accents, but if you remove accents, the results are sometimes funny/unexpected. Eg. "szár" means "stalk", while "szar" means "shit", so when you search for the first, you definitely don't want results on the second to appear.
It could be a fine decision that Drupal's built in search module will not be perfectly fitting for these rules. In fact, if you try http://www.google.com/search?q=szár you will see that both accented and non-accented results appear. So it would not be an unusual or (from user perspective) inherently unexpected path to take. For this to work, we would need to have a list of accented chars to replace with the non-accented versions.
#25
Also with many sites, even ones not using locale/content translate, we'll see "nevet" and "nevét" alongside "café" and "cafe", right here for example - so stripping diacritics makes sense to me for core.
Looked at utf8_decode() but of course that completely destroys non-latin characters, so useless.
Ideally this should happen with both indexing and searching - it currently does explicit searches for e and é with different result sets. If we strip when indexing, we have to do the same with searching. Google wasn't my friend for a nice clean function to do this unfortunately.
#26
@Gábor#7: this is clearly a problem related to collations. I ran into it several weeks ago will testing the migration of http://drupalfr.org to Drupal 6, but failed to report it due to severe time constraints.
One way to solve that bug is to set the
{search_index}.wordcolumn toutf8_bin_ci. I just validated on a test site that it solve the problem.But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).
@catch#25: In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all
'utf8_general_ci'is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.In the meanwhile, because scores are additive, I suggest we do that:
<?php// The database will collate similar words (accented and non-accented forms, etc.),
// and the score is additive, so first add and then insert.
db_query("UPDATE {search_index} SET score = score + %d WHERE word = '%s' AND sid = '%d' AND type = '%s'", $score, $word, $sid, $type);
if (!db_affected_rows()) {
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
}
?>
Enclosed patch does exactly that.
#27
Applies, fixes the log warnings. Re-indexing a node containing "café cafe" will log errors without the patch, but not afterward.
A second review would be good.
#28
Patch removes the error. If I have two nodes like this:
1. café cafe
2. cafe
Searching for either café or cafe gives me the same result. However, the highlighting is still dependent on the specific string used.
This seems ok to me really, might even be a feature since it'd show results with lazy typing, but still differentiate in the nevet nevét case on display. Very nice!
#29
Looks like mine was the second review.
#30
I am testing this patch live on Drupal.hu now, thanks! Will commit later today if all looks fine.
#31
Did not solve it on my site. Looks like there is a problem on word length in the unique key as well:
Duplicate entry 'modositott-91-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('modositottam', 91, 'node', 0.56910543131) - .../modules/search/search.module - 578.
Duplicate entry 'informatik-108-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('informatika', 108, 'node', 4) - .../modules/search/search.module - 578.
...
See, that the key entry has a shorter string then the inserted word, so longer words can result in duplicates in the key. Anyone else can reproduce this problem? (Note that these words do not contain any non-English chars).
#32
@Gábor: It looks like a different issue, perhaps related to the upgrade path. Could you dump the schema of the
{search_index}table in your current database?The default one from my empty D6 installation is:
CREATE TABLE IF NOT EXISTS `search_index` (`word` varchar(50) NOT NULL default '',
`sid` int(10) unsigned NOT NULL default '0',
`type` varchar(16) default NULL,
`score` float default NULL,
UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
KEY `sid_type` (`sid`,`type`),
KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Is yours any different?
#33
Also cannot reproduce on fresh installation. I agree it's probably an update path problem where the unique key has to be reset to the full length of the field.
#34
Here's the key-adding code in system.install.
<?phpfunction system_update_6036() {
... // if mysql:
$ret[] = update_sql("ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY word_sid_type (word, sid, type)");
... // otherwise
db_add_unique_key($ret, 'search_index', 'word_sid_type', array('word', 'sid', 'type'));
?>
I was unable to find any relevant update on search_index that changed the length of the word field. So the cause is not yet identified.
Edit: Just for reference, my table structure is identical:
CREATE TABLE `search_index` (`word` varchar(50) NOT NULL default '',
`sid` int(10) unsigned NOT NULL default '0',
`type` varchar(16) default NULL,
`score` float default NULL,
UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
KEY `sid_type` (`sid`,`type`),
KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
#35
I can't reproduce this key length issue either on a clean install.
CREATE TABLE `search_index` (`word` varchar(50) NOT NULL default '',
`sid` int(10) unsigned NOT NULL default '0',
`type` varchar(16) default NULL,
`score` float default NULL,
UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
KEY `sid_type` (`sid`,`type`),
KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
My live D5 database which has been around since 4.5 also doesn't have any explicit key lengths - just checked in case there might be some sticking around:
CREATE TABLE `search_index` (`word` varchar(50) NOT NULL default '',
`sid` int(10) unsigned NOT NULL default '0',
`type` varchar(16) default NULL,
`fromsid` int(10) unsigned NOT NULL default '0',
`fromtype` varchar(16) default NULL,
`score` float default NULL,
KEY `sid_type` (`sid`,`type`),
KEY `from_sid_type` (`fromsid`,`fromtype`),
KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
#36
If Gabor's post in #11 is current, the only difference I see is the line:
`type` varchar(16) default '',
#37
Didn't notice that. Well, with VARCHAR(50) and no explicit key length, I have no idea why the key gets truncated to 10 characters. Weird.
#38
Hold on, I broke my own database while trying out fixes. I set it to varchar(10). Let me see a better test now with a correct table. Sorry for taking your time, I should have been more alert to what I change. Excuse me.
#39
Works perfectly on drupal.hu as well, superb. Thanks for the patch. Committed to 6.x and RTBC for 7.x.
#40
This is still missing for 7.x.
#41
I've committed this to CVS HEAD. Thanks guys.
#42
Automatically closed -- issue fixed for two weeks with no activity.
#43
I am seeing this problems with my 6.2 install. Has the fix that is discussed in this posting been applied to the 6.2 core?
#44
This patch went in before the release of Drupal 6.0. Please open another issue with a detailed description of your problems.
#45
...Need to open this ticket again because I don't believe this has been fixed. I have a new Drupal 6.2 install, and all was running fine for the past week. However, since a few days ago I'm getting numerous errors in my log, dozens of them, all logged within the same minute:
Duplicate entry 'personal-13-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('personal', 13, 'node', 0.949597315436) in /home/bcims/public_html/modules/search/search.module on line 579.I could definitely use some help in solving this. I was able to construct the template for Drupal 6.2 and am a novice at php and mysql. I can do some rudimentary work on the error but some assistance from more experienced programmers would be very appreciated.
There is a fellow who wrote this article on the problem:
http://cmsreport.com/node/1189#46
RajP. Please open a new issue rather than re-opening this old one. It's likely your duplicate is caused by a slightly different version of the bug so will need to be fixed in a different issue. A quick fix for this is to truncate your search tables and reindex (you can do this via the ui by disabling and uninstalling search module then re-enabling it).
#47
Is this one:
Duplicate entry '1107-node' for key 1 query: INSERT INTO search_dataset (sid, type, data, reindex) VALUES (1107, 'node', ' monter poszycia kadłubów samolotów zagranica referencja job1107 ocena oceń kiepsko ok dobra super extra twoja ocena brak ', 0) w pliku /home/sites/co.uk/public_html/modules/search/search.module, linia 571.Is related to this issue?
// Insert cleaned up data into datasetdb_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);
#48
Duplicates:
#143160: search_index has duplicate (sid, word, type, fromsid=0) entries
#143160: search_index has duplicate (sid, word, type, fromsid=0) entries
http://drupal.org/node/180032
#49
#218403: Duplicate entry errors in search indexer