Download & Extend

Performance: node_update_index() slow with large numbers of nodes

Project:Drupal core
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed (duplicate)
Issue tags:Performance

Issue Summary

I've been looking into using Drupal with millions of nodes. One problem: the search indexer is slow. This tends to cause CPU quota errors on shared hosting every time cron runs.

One of the slow queries is in node_update_index():

SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;

Which does a full table scan on the {node} table, and also has the painful "Using temporary; Using filesort"

It's possible to fix this with two changes:

1. Adding an index to search_dataset to avoid the filesort
2. Ensuring that every node in the database has a corresponding row in search_dataset

Change #2 would have the disadvantage of an extra database write every time a node is created. That may or may not be an acceptable tradeoff to the core maintainers.

Comments

#1

Sub.

#2

Version:6.4» 6.12

Subscribe... but I don't know if this will ever be fixed.

#3

Hi

Could any of you give me more details on how to fix this problem? I have little experience with mysql and the following two actions are not clear enough for me.

>1. Adding an index to search_dataset to avoid the filesort
>2. Ensuring that every node in the database has a corresponding row in search_dataset

Thanks in advance

#4

You can find instructions at
http://wtanaka.com/drupal/million-nodes-6
under the bug 312393 heading

That section header was previously mislabeled as bug 312939, but it's labeled correctly now.

#5

Hi

A few days ago I I already tried to fix bug 312939, according to the instructions at http://wtanaka.com/drupal/million-nodes-6.

As I posted to the blog at that page, at step 3, I got the following error:

INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node

MySQL said:

#1062 - Duplicate entry '1-node' for key 1

I have no idea on what's wrong. Any clue?

I am really interested in solving this problem because, since a few weeks ago, the search capabilities of my system are blocked (more than 10,000 files attached. I also commented the problem at http://wtanaka.com/drupal/million-nodes-6).

I am considering other solutions different to core search (plus search files module, such as apache solr + solr attachment, but I have not been able to make it work correctly yet.

Thanks in advance and regards

#6

I see it now and have answered you there: http://wtanaka.com/drupal/million-nodes-6#comment-5941

#7

Version:6.12» 7.x-dev

Here is a first pass at a patch for D7. Sooner or later, the tests will fail, because it uses a hook_update_70XX with the same number as the patch in #336483: Performance: SELECT MAX(comment_count) FROM node_comment_statistics does full table scan.

#8

Status:active» needs review

#9

Title:node_update_index() slow with large numbers of nodes» Performance: node_update_index() slow with large numbers of nodes

Just tagging appropriately

#10

*facepalms*

... and I forgot to attach the actual patch.

AttachmentSizeStatusTest resultOperations
312393-node-update-index-slow.patch2.33 KBIdleFailed: 11864 passes, 12 fails, 12 exceptionsView details | Re-test

#11

Status:needs review» needs work

The last submitted patch failed testing.

#12

Adding every node to search_dataset is not OK, in my opinion.

Adding an index to the search_dataset table is an excellent idea. It needs to go into the schema as well as the update function, however.

#13

#14

The latest comments on that issue above suggest adding more than just this one index...

#15

Status:needs work» closed (duplicate)

Actually, these two issues are basically reporting problems on very similar queries, so I am going to close this as a duplicate of #312395: Queries on search admin and node indexing are slow for many-node sites and change the title of that one so it's a bit more broad.

#16

I like the patch in #10: because it adds every node in search_dataset when they are saved, no nodes will ever be indexed (except when explicitly marked for reindexing). That's a way to solve the problem, but I can't help suggesting a much more elegant way: disable the search module.

#17

No need to be snarky about it... To put it more clearly:

The patch in #10 has an error -- the reindex bit should have been set to a large number (I think?) rather than 0, or the nodes won't ever be indexed.

nobody click here