I've been looking into using Drupal with millions of nodes. One problem: the search indexer is slow. This tends to cause CPU quota errors on shared hosting every time cron runs.

One of the slow queries is in node_update_index():

SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;

Which does a full table scan on the {node} table, and also has the painful "Using temporary; Using filesort"

It's possible to fix this with two changes:

1. Adding an index to search_dataset to avoid the filesort
2. Ensuring that every node in the database has a corresponding row in search_dataset

Change #2 would have the disadvantage of an extra database write every time a node is created. That may or may not be an acceptable tradeoff to the core maintainers.

CommentFileSizeAuthor
#10 312393-node-update-index-slow.patch2.33 KBbrianV
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

robertDouglass’s picture

Sub.

geerlingguy’s picture

Version: 6.4 » 6.12

Subscribe... but I don't know if this will ever be fixed.

dgarciad’s picture

Hi

Could any of you give me more details on how to fix this problem? I have little experience with mysql and the following two actions are not clear enough for me.

>1. Adding an index to search_dataset to avoid the filesort
>2. Ensuring that every node in the database has a corresponding row in search_dataset

Thanks in advance

Wesley Tanaka’s picture

You can find instructions at
http://wtanaka.com/drupal/million-nodes-6
under the bug 312393 heading

That section header was previously mislabeled as bug 312939, but it's labeled correctly now.

dgarciad’s picture

Hi

A few days ago I I already tried to fix bug 312939, according to the instructions at http://wtanaka.com/drupal/million-nodes-6.

As I posted to the blog at that page, at step 3, I got the following error:

INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node

MySQL said:

#1062 - Duplicate entry '1-node' for key 1

I have no idea on what's wrong. Any clue?

I am really interested in solving this problem because, since a few weeks ago, the search capabilities of my system are blocked (more than 10,000 files attached. I also commented the problem at http://wtanaka.com/drupal/million-nodes-6).

I am considering other solutions different to core search (plus search files module, such as apache solr + solr attachment, but I have not been able to make it work correctly yet.

Thanks in advance and regards

Wesley Tanaka’s picture

I see it now and have answered you there: http://wtanaka.com/drupal/million-nodes-6#comment-5941

brianV’s picture

Version: 6.12 » 7.x-dev

Here is a first pass at a patch for D7. Sooner or later, the tests will fail, because it uses a hook_update_70XX with the same number as the patch in #336483: Performance: SELECT MAX(comment_count) FROM node_comment_statistics does full table scan.

brianV’s picture

Status: Active » Needs review
brianV’s picture

Title: node_update_index() slow with large numbers of nodes » Performance: node_update_index() slow with large numbers of nodes
Issue tags: +Performance

Just tagging appropriately

brianV’s picture

*facepalms*

... and I forgot to attach the actual patch.

Status: Needs review » Needs work

The last submitted patch failed testing.

jhodgdon’s picture

Adding every node to search_dataset is not OK, in my opinion.

Adding an index to the search_dataset table is an excellent idea. It needs to go into the schema as well as the update function, however.

jhodgdon’s picture

jhodgdon’s picture

The latest comments on that issue above suggest adding more than just this one index...

jhodgdon’s picture

Status: Needs work » Closed (duplicate)

Actually, these two issues are basically reporting problems on very similar queries, so I am going to close this as a duplicate of #312395: Queries on search admin and node indexing are slow for many-node sites and change the title of that one so it's a bit more broad.

Damien Tournoud’s picture

I like the patch in #10: because it adds every node in search_dataset when they are saved, no nodes will ever be indexed (except when explicitly marked for reindexing). That's a way to solve the problem, but I can't help suggesting a much more elegant way: disable the search module.

jhodgdon’s picture

No need to be snarky about it... To put it more clearly:

The patch in #10 has an error -- the reindex bit should have been set to a large number (I think?) rather than 0, or the nodes won't ever be indexed.