Performance: node_update_index() slow with large numbers of nodes

Wesley Tanaka - September 23, 2008 - 18:17
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work
Issue tags:Performance
Description

I've been looking into using Drupal with millions of nodes. One problem: the search indexer is slow. This tends to cause CPU quota errors on shared hosting every time cron runs.

One of the slow queries is in node_update_index():

SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;

Which does a full table scan on the {node} table, and also has the painful "Using temporary; Using filesort"

It's possible to fix this with two changes:

1. Adding an index to search_dataset to avoid the filesort
2. Ensuring that every node in the database has a corresponding row in search_dataset

Change #2 would have the disadvantage of an extra database write every time a node is created. That may or may not be an acceptable tradeoff to the core maintainers.

#1

robertDouglass - September 25, 2008 - 19:02

Sub.

#2

geerlingguy - May 14, 2009 - 15:00
Version:6.4» 6.12

Subscribe... but I don't know if this will ever be fixed.

#3

dgarciad - May 26, 2009 - 21:09

Hi

Could any of you give me more details on how to fix this problem? I have little experience with mysql and the following two actions are not clear enough for me.

>1. Adding an index to search_dataset to avoid the filesort
>2. Ensuring that every node in the database has a corresponding row in search_dataset

Thanks in advance

#4

Wesley Tanaka - May 27, 2009 - 10:56

You can find instructions at
http://wtanaka.com/drupal/million-nodes-6
under the bug 312393 heading

That section header was previously mislabeled as bug 312939, but it's labeled correctly now.

#5

dgarciad - May 27, 2009 - 14:24

Hi

A few days ago I I already tried to fix bug 312939, according to the instructions at http://wtanaka.com/drupal/million-nodes-6.

As I posted to the blog at that page, at step 3, I got the following error:

INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node

MySQL said:

#1062 - Duplicate entry '1-node' for key 1

I have no idea on what's wrong. Any clue?

I am really interested in solving this problem because, since a few weeks ago, the search capabilities of my system are blocked (more than 10,000 files attached. I also commented the problem at http://wtanaka.com/drupal/million-nodes-6).

I am considering other solutions different to core search (plus search files module, such as apache solr + solr attachment, but I have not been able to make it work correctly yet.

Thanks in advance and regards

#6

Wesley Tanaka - June 3, 2009 - 08:04

I see it now and have answered you there: http://wtanaka.com/drupal/million-nodes-6#comment-5941

#7

brianV - July 29, 2009 - 19:27
Version:6.12» 7.x-dev

Here is a first pass at a patch for D7. Sooner or later, the tests will fail, because it uses a hook_update_70XX with the same number as the patch in #336483: Performance: SELECT MAX(comment_count) FROM node_comment_statistics does full table scan.

#8

brianV - July 29, 2009 - 19:27
Status:active» needs review

#9

brianV - July 29, 2009 - 19:33
Title:node_update_index() slow with large numbers of nodes» Performance: node_update_index() slow with large numbers of nodes

Just tagging appropriately

#10

brianV - July 29, 2009 - 19:33

*facepalms*

... and I forgot to attach the actual patch.

AttachmentSize
312393-node-update-index-slow.patch 2.33 KB
Testbed results
312393-node-update-index-slow.patchfailedFailed: 11864 passes, 12 fails, 12 exceptions Detailed results

#11

System Message - July 29, 2009 - 19:50
Status:needs review» needs work

The last submitted patch failed testing.

 
 

Drupal is a registered trademark of Dries Buytaert.