Performance: node_update_index() slow with large numbers of nodes [#312393]

I've been looking into using Drupal with millions of nodes. One problem: the search indexer is slow. This tends to cause CPU quota errors on shared hosting every time cron runs.

One of the slow queries is in node_update_index():

SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;

Which does a full table scan on the {node} table, and also has the painful "Using temporary; Using filesort"

It's possible to fix this with two changes:

1. Adding an index to search_dataset to avoid the filesort
2. Ensuring that every node in the database has a corresponding row in search_dataset

Change #2 would have the disadvantage of an extra database write every time a node is created. That may or may not be an acceptable tradeoff to the core maintainers.

Comment	File	Size	Author
#10	312393-node-update-index-slow.patch	2.33 KB	brianV

Comments

Comment #1

robertdouglass commented 25 September 2008 at 19:02

Sub.

Comment #2

geerlingguy commented 14 May 2009 at 15:00

Version:

6.4

» 6.12

Subscribe... but I don't know if this will ever be fixed.

Comment #3

dgarciad commented 26 May 2009 at 21:09

Could any of you give me more details on how to fix this problem? I have little experience with mysql and the following two actions are not clear enough for me.

>1. Adding an index to search_dataset to avoid the filesort
>2. Ensuring that every node in the database has a corresponding row in search_dataset

Thanks in advance

Comment #4

Wesley Tanaka commented 27 May 2009 at 10:56

You can find instructions at
http://wtanaka.com/drupal/million-nodes-6
under the bug 312393 heading

That section header was previously mislabeled as bug 312939, but it's labeled correctly now.

Comment #5

dgarciad commented 27 May 2009 at 14:24

A few days ago I I already tried to fix bug 312939, according to the instructions at http://wtanaka.com/drupal/million-nodes-6.

As I posted to the blog at that page, at step 3, I got the following error:

INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node

MySQL said:

#1062 - Duplicate entry '1-node' for key 1

I have no idea on what's wrong. Any clue?

I am really interested in solving this problem because, since a few weeks ago, the search capabilities of my system are blocked (more than 10,000 files attached. I also commented the problem at http://wtanaka.com/drupal/million-nodes-6).

I am considering other solutions different to core search (plus search files module, such as apache solr + solr attachment, but I have not been able to make it work correctly yet.

Thanks in advance and regards

Comment #6

Wesley Tanaka commented 3 June 2009 at 08:04

I see it now and have answered you there: http://wtanaka.com/drupal/million-nodes-6#comment-5941

Comment #7

brianV commented 29 July 2009 at 19:27

Version:

6.12

» 7.x-dev

Here is a first pass at a patch for D7. Sooner or later, the tests will fail, because it uses a hook_update_70XX with the same number as the patch in #336483: Performance: SELECT MAX(comment_count) FROM node_comment_statistics does full table scan.

Comment #8

brianV commented 29 July 2009 at 19:27

Status:

Active

» Needs review

Comment #9

brianV commented 29 July 2009 at 19:33

Title:	node_update_index() slow with large numbers of nodes	» Performance: node_update_index() slow with large numbers of nodes
Issue tags:		+Performance

Just tagging appropriately

Comment #10

brianV commented 29 July 2009 at 19:33

Status	File	Size
new	312393-node-update-index-slow.patch	2.33 KB

*facepalms*

... and I forgot to attach the actual patch.

Comment #11

29 July 2009 at 19:50

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #12

jhodgdon

she/her

English

commented 18 February 2010 at 22:00

Adding every node to search_dataset is not OK, in my opinion.

Adding an index to the search_dataset table is an excellent idea. It needs to go into the schema as well as the update function, however.

Comment #13

jhodgdon

she/her

English

commented 5 March 2010 at 18:46

See also #312395: Queries on search admin and node indexing are slow for many-node sites, which is a related issue.

Comment #14

jhodgdon

she/her

English

commented 6 March 2010 at 00:55

The latest comments on that issue above suggest adding more than just this one index...

Comment #15

jhodgdon

she/her

English

commented 6 March 2010 at 16:20

Status:

Needs work

» Closed (duplicate)

Actually, these two issues are basically reporting problems on very similar queries, so I am going to close this as a duplicate of #312395: Queries on search admin and node indexing are slow for many-node sites and change the title of that one so it's a bit more broad.

Comment #16

damien tournoud commented 6 March 2010 at 16:54

I like the patch in #10: because it adds every node in search_dataset when they are saved, no nodes will ever be indexed (except when explicitly marked for reindexing). That's a way to solve the problem, but I can't help suggesting a much more elegant way: disable the search module.

Comment #17

jhodgdon

she/her

English

commented 6 March 2010 at 17:31

No need to be snarky about it... To put it more clearly:

The patch in #10 has an error -- the reindex bit should have been set to a large number (I think?) rather than 0, or the nodes won't ever be indexed.

Performance: node_update_index() slow with large numbers of nodes

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

News items

Our community

Documentation

Drupal code base

Governance of community