I've been looking into using Drupal with millions of nodes. One problem: the search indexer is slow. This tends to cause CPU quota errors on shared hosting every time cron runs.
One of the slow queries is in node_update_index():
SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;
Which does a full table scan on the {node} table, and also has the painful "Using temporary; Using filesort"
It's possible to fix this with two changes:
1. Adding an index to search_dataset to avoid the filesort
2. Ensuring that every node in the database has a corresponding row in search_dataset
Change #2 would have the disadvantage of an extra database write every time a node is created. That may or may not be an acceptable tradeoff to the core maintainers.
Comment | File | Size | Author |
---|---|---|---|
#10 | 312393-node-update-index-slow.patch | 2.33 KB | brianV |
Comments
Comment #1
robertDouglass CreditAttribution: robertDouglass commentedSub.
Comment #2
geerlingguy CreditAttribution: geerlingguy commentedSubscribe... but I don't know if this will ever be fixed.
Comment #3
dgarciad CreditAttribution: dgarciad commentedHi
Could any of you give me more details on how to fix this problem? I have little experience with mysql and the following two actions are not clear enough for me.
>1. Adding an index to search_dataset to avoid the filesort
>2. Ensuring that every node in the database has a corresponding row in search_dataset
Thanks in advance
Comment #4
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedYou can find instructions at
http://wtanaka.com/drupal/million-nodes-6
under the bug 312393 heading
That section header was previously mislabeled as bug 312939, but it's labeled correctly now.
Comment #5
dgarciad CreditAttribution: dgarciad commentedHi
A few days ago I I already tried to fix bug 312939, according to the instructions at http://wtanaka.com/drupal/million-nodes-6.
As I posted to the blog at that page, at step 3, I got the following error:
INSERT INTO drupal_search_dataset( sid,
TYPE , reindex )
SELECT nid AS sid, 'node' AS
TYPE , 2147483647 AS reindex
FROM drupal_node
MySQL said:
#1062 - Duplicate entry '1-node' for key 1
I have no idea on what's wrong. Any clue?
I am really interested in solving this problem because, since a few weeks ago, the search capabilities of my system are blocked (more than 10,000 files attached. I also commented the problem at http://wtanaka.com/drupal/million-nodes-6).
I am considering other solutions different to core search (plus search files module, such as apache solr + solr attachment, but I have not been able to make it work correctly yet.
Thanks in advance and regards
Comment #6
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedI see it now and have answered you there: http://wtanaka.com/drupal/million-nodes-6#comment-5941
Comment #7
brianV CreditAttribution: brianV commentedHere is a first pass at a patch for D7. Sooner or later, the tests will fail, because it uses a hook_update_70XX with the same number as the patch in #336483: Performance: SELECT MAX(comment_count) FROM node_comment_statistics does full table scan.
Comment #8
brianV CreditAttribution: brianV commentedComment #9
brianV CreditAttribution: brianV commentedJust tagging appropriately
Comment #10
brianV CreditAttribution: brianV commented*facepalms*
... and I forgot to attach the actual patch.
Comment #12
jhodgdonAdding every node to search_dataset is not OK, in my opinion.
Adding an index to the search_dataset table is an excellent idea. It needs to go into the schema as well as the update function, however.
Comment #13
jhodgdonSee also #312395: Queries on search admin and node indexing are slow for many-node sites, which is a related issue.
Comment #14
jhodgdonThe latest comments on that issue above suggest adding more than just this one index...
Comment #15
jhodgdonActually, these two issues are basically reporting problems on very similar queries, so I am going to close this as a duplicate of #312395: Queries on search admin and node indexing are slow for many-node sites and change the title of that one so it's a bit more broad.
Comment #16
Damien Tournoud CreditAttribution: Damien Tournoud commentedI like the patch in #10: because it adds every node in search_dataset when they are saved, no nodes will ever be indexed (except when explicitly marked for reindexing). That's a way to solve the problem, but I can't help suggesting a much more elegant way: disable the search module.
Comment #17
jhodgdonNo need to be snarky about it... To put it more clearly:
The patch in #10 has an error -- the reindex bit should have been set to a large number (I think?) rather than 0, or the nodes won't ever be indexed.