search_index should have only one entry per (sid, word, type, fromsid=0). However, I'm finding duplicates on almost every large Drupal site I've checked. It's happening on d.o (Robert gave me an export of the search tables); it's happening on the NY Observer site; it's happening on the Media Consortium Crosslinks project (about 100k nodes); and I've seen it on several smaller sites.

Diagnose the problem with the following SQL, limit the query to sid's under 10000 just so the query doesn't run forever. If any rows are returned, then you've got the same problem.

mysql> select word, sid, type, count(*) from search_index where fromsid=0 and sid<10000 group by word, sid, type having count(*) > 1;

Can you confirm that this is an issue? Is there a flaw in my SQL or my logic? Does this happen on your site?

Why is this important?

  1. impacting search_total results, doubling the value of some words.
  2. impacts performance, as the search query needs to look at more records than necessary.
  3. I'm working on new MUCH faster search query for views_fastsearch that relies on unique (sid, word, type, fromsid=0) values in order to work (see below)

Why is this happening, and what do we do to fix it?

I don't see how this is happening. The old entries should be deleted just before inserting new ones. And as of 5.x, cron should lock itself to one process at a time. Is this only happening for older nodes (pre-5.x)? Does it happen when the cron job aborts abnormally? Should we add an index to the table to prevent this?

Also, what do we do to fix all of our existing sites. It goes away when re-indexing smaller sites, but re-indexing a large site is unpractical. And even if we re-index a larger site, it's quite possible that re-indexing will result in the same problems. The SQL to fix it is pretty easy, but VERY slow. Do we need a fixer module that uses cron?

I'm reporting against 6.x because that's where it's going to be fixed, but all of my test sites are either 5.x or 4.7.x.

This is what I'm working towards.

Here is the a simplification of the query I'm looking at using for faster searches (it has some variation when AND or exclusionary terms are used, the subjoin is needed by views_fastsearch, I'm not sure if it's needed by search.module):

mysql> SELECT n2.nid FROM node n2 LEFT JOIN (
 SELECT n.nid FROM node n LEFT join search_index x on n.nid=x.sid 
  WHERE x.fromsid=0 AND (x.word='bush' OR x.word='cheney' OR x.word='kerry' OR x.word='gore' OR x.word='reid') 
  GROUP BY n.nid HAVING COUNT(*) > 5) ns on ns.nid=n2.nid 
WHERE ns.nid IS NOT NULL;

It is my hope that the search improvements in views_fastsearch might one day make it into core.

Comments

Steven’s picture

On drupal.org, it seems only old nodes are affected. The last nid I can find with this problem is 81557. I also don't see any word that is repeated more than twice. This suggests that this was caused by cron running two instances at the same time. The last time we re-indexed d.o, we didn't yet have a cron semaphore, I think.

It could also have something to do with any of the other node indexing tweaks we made during development.

In any case, I think the fact that there is such a hard cut-off point for duplicates suggests that whatever the issue was, it has been addressed.

As for fixing the problem... we can split it up into a chunked update, it can take as long as is needed (however, note that SQL time is actually not counted towards the execution time limit). Obviously, we should find the fastest possible query to fix this problem. We could also consider using a table (in update.php, we can't use temporary tables): fill it up using one slow, mega-query (10-15 secs), and then process its content in batches until empty. This is not as hard as it sounds with the update.php framework.

Ideally, sites without the problem would just have one relatively fast query to look for duplicates first, after which they would skip all the hassle.

moshe weitzman’s picture

Thanks for working on this, folks. I agree that multiple simultaneous cron instances is a likely cause.

I reindexed the Observer site just recently and we still have dupes. I don't actually think this is solved in 5. I do recall having to delete the semaphore a few times when i saw the cron "cron has been running for an hour ..."

Couldn't we solve this by adding a UNIQUE index on the search_index table. Then we would enhance the indexer such that any cron that tries to add a dupe will fail to insert and then quit or move on (perhaps no code change is needed for this?).

m3avrck’s picture

subscribing

robertdouglass’s picture

Steven, I'm not so sure the problem is as limited as you describe. On my copy of the d.o. search index, I get these results:

mysql> select word, sid, `type`, count(*) from search_index group by word, sid, `type` having count(*) > 2 order by sid desc limit 100;
+--------------+--------+------+----------+
| word         | sid    | type | count(*) |
+--------------+--------+------+----------+
| viewbookmark | 128120 | node | 3        |
| ajax         | 128120 | node | 3        |
| style        | 128120 | node | 3        |
| clean        | 128116 | node | 6        |
| url          | 128116 | node | 6        |
| and          | 128116 | node | 6        |
| ii6          | 128116 | node | 6        |
| not          | 128109 | node | 3        |
| doe          | 128109 | node | 3        |
| integ        | 128109 | node | 3        |
| fromunixtim  | 128109 | node | 3        |
| function     | 128109 | node | 3        |
| exist        | 128109 | node | 3        |
| drupal       | 128104 | node | 3        |
| for          | 128104 | node | 3        |
| commerci     | 128104 | node | 3        |
| page         | 128095 | node | 5        |

With as many as 15 dupes for some word/sid/type combinations.

robertdouglass’s picture

that last query needs the fromsid = 0 clause. Please ignore.

douggreen’s picture

So, the d.o problems seem to stop on a certain date. Moshe’s response that he manually removed the semaphore explains a lot. I'm pretty sure I did the same thing on my big site. So, if the problem was semaphore related, then we don’t appear to have an ongoing problem.

We do, however, have a cleanup problem and possibly a need for monitoring. What's the Drupal way? I'd like to have something that did some monitoring and fixed problems as it finds them. I've added a simple patch to a custom module to do exactly that IRT the comment date problem. What do others think of a new custom module that basically monitors the search indexes and fixes them when things are wrong? If you like the module idea, please suggest module names as well. I'll write it.

moshe weitzman’s picture

We are deleting the semaphore becuase we know of no other way to get cron kickstarted. This is part of normal drupal maintainence. This is not bad behavior on admin part.

Any resistance to adding in a UNIQUE index? WHat will cron do when it comes across it? I would think that if it does is OK because this will only happen during a dual cron run situation.

douggreen’s picture

I'll run a test this weekend on what happens if we add the UNIQUE index and two crons are running. I suspect that one will just write a bunch of errors to the log, but I'll run the test and report back.

But before we can add the UNIQUE index, we need to make all of the values unique. A module or a script?

  • The module has the advantage that it can work in spurts using cron and can fix the problem slowly, and once done, create the UNIQUE index and set a variable saying things are done.
  • A script has the advantage of being easier to write, but has the disadvantage of requiring shell access and could take several days to complete.
  • I don't think that this can be done using the .install system because the update will simply take too long. All we can do in the .install is get the process started. And if that's all we're doing, I don't see a reason not to have a one time use module. The module can even disable itself when it's done!
robertdouglass’s picture

I'd say write it as a module for D5 so that people can start repairing their databases in advance of D6 being released. It's a pretty touchy topic, though. Maybe the module can repair in cron batches until it reaches a state where there are no dupes and then update the db with the UNIQUE. D6 could potentially ship with this module as an upgrade module (unprecedented, but it would work).

douggreen’s picture

I'm leaning against the UNIQUE key for the following reasons:

  1. The UNIQUE key on {sid, word, type, fromsid} will be quite large
  2. It will add extra processing (and time) to all INSERT's
  3. we think the problem is solved by the 5.x cron semaphore

I'd like to fix the existing data first and monitor whether we do indeed have an ongoing dup problem. If we do, we can always rerun the fix data routine again, and then either (a) try to fix the root cause, or (b) insert the UNIQUE key as a temporary solution.

moshe weitzman’s picture

I found a really fast way to fix this on MYSQL. Note the IGNORE keyword below:

ALTER TABLE IGNORE search_index ADD UNIQUE INDEX (sid, word, type, fromsid)

We do have an ongoing problem - the cron gets stuck and admin has to take corrective action. And then the index gets corrupted.

kbahey’s picture

I can confirm this on a 5.x site.

select word, sid, type, count(*) from search_index where fromsid=0 and sid<10000 group by word, sid, type having count(*) > 1;
+----------+-----+------+----------+
| word     | sid | type | count(*) |
+----------+-----+------+----------+
| almaden  | 196 | node |        2 |
| arabes   |  99 | node |        2 |
| forsk�l  | 215 | node |        2 |
| ice      | 195 | node |        2 |
| is�      | 195 | node |        2 |
| j�bir    | 195 | node |        2 |
| �beda    | 196 | node |        2 |
| �bercart | 456 | node |        2 |
+----------+-----+------+----------+
8 rows in set (5.11 sec)

As well as an older but much larger site, too much rows to post here:

17178 rows in set (26.16 sec)

douggreen’s picture

That's a lot simpler than my perl script, and a little faster too! We need to confirm that mysql 4.1 supports the IGNORE option.

What will happen when the sid, word, type, fromsid is unique, but the score is different?

Does update.php handle long running queries? Can we run this from the system.install if we know it can timeout? The php could die, but wouldn't the mysql engine continue to process to SQL? Not exactly a great UI, but does it get the job done?

The mysql syntax is:

mysql> ALTER IGNORE TABLE search_index ADD UNIQUE INDEX (sid, word, type, fromsid);
Query OK, 11114911 rows affected (29 min 47.09 sec)
Records: 11114911  Duplicates: 1505  Warnings: 0

Also, is something similar available for pgsql? The document is a little lacking.

ALTER TABLE search_index ADD CONSTRAINT UNIQUE (sid, word, type, fromsid);

Or will we need the perl script as a fall back for older instances of mysql and/or pgsql?

Steven’s picture

AFAIK the only way the cron semaphore can get stuck is if a fatal error occurs (one that prevents register_shutdown_function from working) or if the database connection is lost during cron.

Consider adding some logging to the cron.php to make sure it isn't just running very slowly (compared to your cron interval). In that case, resetting the semaphore would cause exactly this problem.

m3avrck’s picture

If you want to talk cron, talk here: http://drupal.org/node/87528 -- that will solve these cron related issues.

moshe weitzman’s picture

@Steven - given that sites can and do get into a duplicate state, are you ok with adding the proposed UNIQUE index into the schema and in an update function for existign sites?

Steven’s picture

Status: Active » Postponed (maintainer needs more info)

I'm not comfortable with adding any workaround without at least the cause being known, sorry. It could be a very silly bug, or it could be a large cron problem we haven't thought of yet. It needs more investigation, because whatever will be committed will be the last time anyone will look at this for a very long time.

moshe weitzman’s picture

fair enough, but adding this index will cause SQL insert errors and thus we will find the cause quicker. right now there are no errors to track down. we don't hacve ebough info to find the source.

owen barton’s picture

Subscribing

Adding the UNIQUE index as Moshe suggests looks good to me.

moshe weitzman’s picture

Status: Postponed (maintainer needs more info) » Active

AFAIK the only way the cron semaphore can get stuck is if a fatal error occurs (one that prevents register_shutdown_function from working) or if the database connection is lost during cron.

Well, those are two valid situations which can lead to this duplicate problem. This index will *help* resolve any lingering cron problem.

@Steven - please reconsider adding it.

wim leers’s picture

Subscribing.

ashearer’s picture

Search was returning duplicate results for certain nodes, and it turned out that there were 257 duplicate row pairs in search_index. This is a 5.x-only site.

I removed those, but it turned out that they weren't causing the duplicate results. search_dataset doesn't have a unique index either, and it had 6 duplicate rows. Removing those fixed the problem.

Here is the SQL code I used to remove duplicate rows from both search_index and search_dataset without rebuilding the entire index. (Have a backup handy before doing anything like this, of course.)

create table search_index_temp select distinct * from search_index;
select (select count(*) from search_index)
 - (select count(*) from search_index_temp) as index_duplicates_removed;
truncate search_index;
insert into search_index select * from search_index_temp;
drop table search_index_temp;

create table search_dataset_temp select distinct * from search_dataset;
select (select count(*) from search_dataset)
  -(select count(*) from search_dataset_temp) as dataset_duplicates_removed;
truncate search_dataset;
insert into search_dataset select * from search_dataset_temp;
drop table search_dataset_temp;
catch’s picture

subscribing.

douggreen’s picture

Or the much simpler SQL already pointed out in comment #11 above (admittedly may only work on MySql):

mysql>ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY sid_word_type (sid, word, type);
mysql>ALTER IGNORE TABLE {search_dataset} ADD UNIQUE KEY sid_type (sid, type);
hswong3i’s picture

Subscribing. I have not much idea about if pgsql support IGNORE, but it is for sure that Oracle don't. Seems IGNORE is a MySQL-specific SQL function, which we should escape from using it in order to increase the cross database compatibility?

catch’s picture

Status: Active » Fixed

99% sure this was fixed in the temp tables patch. Thanks again Doug!

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

asb’s picture

Version: 6.x-dev » 5.5
Category: bug » support
Status: Closed (fixed) » Closed (won't fix)

Hi,

mysql>ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY sid_word_type (sid, word, type);
mysql>ALTER IGNORE TABLE {search_dataset} ADD UNIQUE KEY sid_type (sid, type);

on my Drupal Database results in:

mysql> ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY sid_word_type (sid, word, type);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '{search_index} ADD UNIQUE KEY sid_word_type (sid, word, type)' at line 1
mysql> ALTER IGNORE TABLE {search_dataset} ADD UNIQUE KEY sid_type (sid, type);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '{search_dataset} ADD UNIQUE KEY sid_type (sid, type)' at line 1
mysql>

Instead, that seems to be working somehow:

mysql> ALTER IGNORE TABLE search_index ADD UNIQUE INDEX (sid, word, type, fromsid);

Results in:

ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IGNORE search_index ADD UNIQUE INDEX (sid, word, type, fromsid)' at line 1
mysql> ALTER IGNORE TABLE search_index ADD UNIQUE INDEX (sid, word, type, fromsid);
Query OK, 71179 rows affected (2.53 sec)
Records: 71179  Duplicates: 33  Warnings: 0

But

ALTER IGNORE TABLE search_dataset ADD UNIQUE INDEX (sid, word, type, fromsid);

results in an error, again:

mysql> ALTER IGNORE TABLE search_dataset ADD UNIQUE INDEX (sid, word, type, fromsid);
ERROR 1072 (42000): Key column 'word' doesn't exist in table

However, the SQL statement doesn't return duplicates anymore.

Altering search_index results in sporadic error messages from Drupal, e.g.:

user warning: Duplicate entry '802-schön-node-0' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('schön', 802, 'node', 0.77169749728) in /var/www/drupal/includes/database.mysql.inc on line 172.

However, my problem with Drupal's search module ist different (Queries from search module are running for hours, http://drupal.org/node/203806), but I also can confirm that the SQL statement

mysql> select word, sid, type, count(*) from search_index where fromsid=0 and sid<10000 group by word, sid, type having count(*) > 1;

returnes between a few hundred to a few thousand rows on all my five Drupal production sites.

Drupal 5.5, MySQL 5.0.32, PHP 5.2.0-8+etch7, Apache 2.2.3.

What am I doing wrong? Is altering the search tables not recommenaded in Drupal 5.5? Should I truncate the three search tables, and/or rebuild the index?

Thanks, -asb

catch’s picture

Version: 5.5 » 6.x-dev
Category: support » bug
Status: Closed (won't fix) » Closed (fixed)

Please don't hijack issues with support requests, you need to open a new issue for this.

gábor hojtsy’s picture

I'd suggest anyone interested to check out this issue: http://drupal.org/node/218403 which deals with a similar problem.