Posted by Wesley Tanaka on September 23, 2008 at 6:21pm
Jump to:
| Project: | Drupal core |
| Version: | 7.x-dev |
| Component: | search.module |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
| Issue tags: | Administration, cron, Performance |
Issue Summary
I've been looking into using Drupal with millions of nodes. One problem is that the admin/settings/search page is very slow. This tends to cause CPU quota errors on shared hosting.
The query causing the problem is this one:
SELECT COUNT(*) FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND (d.sid IS NULL OR d.reindex <> 0)
which has roughly the same problems as the query in node_update_index(), and could be fixed with a similar approach.
Comments
#1
Sub.
#2
Have you tried converting your tables to INNODB ? Just wondering if there would be any performance gain doing that.
#3
I haven't tried INNODB
#4
subscribe
#5
This needs to be fixed in Drupal 7 first, then backported to Drupal 6.
#6
I looked into this for Drupal 7, where the path is now admin/config/search/settings -- just to see what it was doing and verify it's the same in D7... (it is):
In both D6 and D7, the respective path's page callback is search_admin_settings(), and the offending queries are called from code that looks like this in D7 (very similar in D6 except the active modules bit and it's hook_search($op = 'status') instead of hook_search_status()):
foreach(variable_get('search_active_modules', array('node', 'user')) as $module) {if ($status = module_invoke($module, 'search_status')) {
$remaining += $status['remaining'];
$total += $status['total'];
}
}
node_search_status() [D7] or node_search($op='status') [D6] contains the slow query.
#7
So there are a couple of things to consider here:
a) We'll probably be removing the WHERE -- see #239196: Indexing status shown on search settings page is incorrect
b) There is no index in the node table on field status -- there are just a couple of combined ones that include status and some other fields. Not sure why?
c) The only index in search.dataset is the primary key (combination of sid and type).
Maybe we could/should add indexes:
- status field for node
- sid, reindex, type for search_dataset (three different indexes)
I don't know enough about indexes and databases to know whether this would be a good idea or not?
#8
I've just closed #312393: Performance: node_update_index() slow with large numbers of nodes as a duplicate of this issue, because it's pointing out that nearly the same query that happens during node_update_index() is also slow. The query in question is:
SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 100;
This is the same query as above except that the one mentioned here for the Search admin page counts the entries, and this query finds the first N to index.
#9
adding tags
#10
Counting will ever be slow, even if you add an index, as long as the dataset you are counting is large.
Those query do two things: select (or count the number of) nodes marked as needing reindexing (in search_dataset), select (or count the number of) nodes that have never been indexed (ie. are not yet in search_dataset).
There is only one real way to improve that, it is to make sure that every node has an entry in search_dataset (as identified by Wesley Tanaka in the other issue). This way you can limit the query to one table, and you can add the indexes to make it reasonably fast. This is basically what the Apachesolr integration module does.
#11
Really, there is only one way to improve it? I was under the impression that adding indices to database tables generally improves speed, even if there are joins involved? But I'm not any kind of expert in database optimization...
#12
@jhodgdon: in the current situation, the table scan cannot be avoided: we ask the database to list all the rows in {node} that doesn't have an entry in {search_dataset}. The only way to do that is to scan {node}, lookup the value in {search_dataset} ([sid, type] is a primary key, so the database engine will use that) and return the row if no value is found. No index can improve the situation here. Adding a random index will only do one thing: reduce the insert performance.
#13
Subscribe.
#14
Damien Tournoud: Thanks for the explanation... I thought that maybe having those indexes might help with the WHERE part of the operation, but I see your point that the main slowdown is in the join-with-nulls.
So.... Back to the idea of having the node module add each node to the search dataset table when it's first created....
Let's see.
node_update_index() currently finds nodes either never added to {search_dataset} or with their {search_dataset}.reindex bit set to something non-zero, and orders them ascending by .reindex. search_node_insert() and other functions that call search_touch_node() to indicate "this node needs a reindex" are currently putting REQUEST_TIME in .reindex whenever a node is edited.
So if during node creation, the node was added to search_dataset with .reindex=1, that could indicate "new node, high priority for reindex". The way to do that would be to make a search_node_update() function -- i.e. use hook_node_insert().
We would also need to modify the query in node_search_status(), where it calculates $remaining, to remove the join and make it look for .reindex > 1 instead of NULL or non-zero.
That should work, I think?
#15
Here's a separate issue that I noticed while thinking about this issue:
#735154: search_touch_node() should not update if already touched
#16
One big performance improvement can be done if the ORDER BY criterias can be dropped.
At least the node.nid part isn't required from my perspective.
The reindex orderning itself makes sense.
#17