This query does not scale to large sets:
SELECT n.*, u.name FROM node n INNER JOIN users u ON n.uid = u.uid ORDER BY n.changed DESC LIMIT 0, 50;
50 rows in set (3 min 22.24 sec)
It is because it has to do the join and a sort - David Strauss can explain it much better. In any case, on a 2 core relatively modern machine and 4G of RAM, this query takes 3 minutes with 4 million nodes.
I suggest that we reduce the query to this:
SELECT * FROM node ORDER BY changed DESC LIMIT 0, 50;
50 rows in set (0.14 sec)
And then we grab the user name with 50 separate queries while building the table. This guarantees a relatively flat performance no matter what the size of the node table.
I've submitted this for D6 but chances are it affects D7.
| Comment | File | Size | Author |
|---|---|---|---|
| #50 | remove-author-sort.patch | 774 bytes | jbrown |
| #24 | node-indexes-D7.patch | 837 bytes | robertdouglass |
| #24 | node-indexes-D6.patch | 808 bytes | robertdouglass |
| #21 | node-indexes-D6.patch | 806 bytes | robertdouglass |
| #20 | node-indexes.patch | 835 bytes | robertdouglass |
Comments
Comment #1
robertdouglass commentedhere's a sample patch that follows the above suggestion and removes the 3 minute page load time I was experiencing.
Comment #2
gábor hojtsyIt would make sense to document this then in the code.
Comment #3
gerhard killesreiter commentedI get this explain on a copy of d.o, the query itself executes almost instantly.
Comment #4
robertdouglass commentedGerhard, have you added indexes that I don't seem to have?
Comment #5
robertdouglass commentedDoes EXPLAIN return different results depending on the size of memory available? ie. if I truncate my table to 100K rows, maybe I'd get the same EXPLAIN you do?
Comment #6
robertdouglass commentedSorry - I completely don't understand this. Why does your explain start from n as the first table and mine starts with u?
Comment #7
robertdouglass commentedComment #8
robertdouglass commentedmysqladmin Ver 8.41 Distrib 5.0.51a, for debian-linux-gnu on i486
Comment #9
Tresler commentedTook me a bit to find it online, Robert. Page 165 in this book http://www.scribd.com/doc/3380730/High-Performance-MySQL-Chapter-4 This is a case of the MySQL optimizer that is built into the MySQL preprocessor making 2 different decisions. I think that chapter (it's been a while since I read it' tells you how to override the MySQL optimizer to compare the same query /lookup/ on different architectures.
Comment #10
nicholasthompsonOn www.thingy-ma-jig.co.uk (although I dont even have 1,000 nodes or users) this is my EXPLAIN/Key results... I'm running DRUPAL-6-9...
MySQL Info...
Comment #11
misty3 commentedSubscribed.
Comment #12
gerhard killesreiter commentedA difference might be that d.o runs innoDB.
Comment #13
nicholasthompsonChanging it to a LEFT JOIN appears to be better
Comment #14
catchThis patch grabs the nids then does node_load_multiple() for the rest of the fields (because it needs the full node object for other reasons). #301902: Allow more users to see the node admin page
If a version of that goes into D7 at some point, then this should stay as D6.
Comment #15
robertdouglass commentedThis patch avoids duplicate user lookups.
Comment #16
Pedro Lozano commentedCould it be faster to store the first query results in a temporary table and do a join of that with the user table?
Comment #17
robertdouglass commentedIf using a temp table were faster MySQL would be doing it already (might be). Temp tables have earned a bad rep in Drupal.
Comment #18
david straussThis is not one of those cases, at least in the sense I've talked about. Cases where you split WHERE and ORDER BY criteria over a JOIN are impossible to fully optimize on MySQL because there's no data structure you can traverse to narrow your working set quickly.
On the contrary, this seems to be a case of MySQL's optimizer being not perfectly smart, though I can't blame it too much for not making this optimization.
For the case of
SELECT * FROM node ORDER BY changed DESC LIMIT 0, 50;, we have a fairly simple, well-optimized query type. MySQL can traverse the index on "changed", starting with the largest values, until it has 50 rows.For the case of
SELECT n.*, u.name FROM node n INNER JOIN users u ON n.uid = u.uid ORDER BY n.changed DESC LIMIT 0, 50;, we're dealing with something quite a bit more complex because INNER JOINs can increase or decrease the number of results rows returned for each row from the left-hand table. (No matching row in the right-hand table prunes the result row. Multiple matching rows in the right-hand table multiplies the result rows.)Because we're JOINing on the primary key of the users table, we can assume that the INNER JOIN won't increase the number of results in the query. MySQL may know this too, but it turns out not to matter because of the pruning problem.
Despite the fact that *we* know that every node author has a corresponding row in the users table, MySQL doesn't know that, nor could it from the schema alone. Without the knowledge that every node row has exactly one corresponding user row, MySQL could try fetching rows probabilistically.
But MySQL *will not* do this sort of probabilistic query execution plan:
The only alternative -- given that MySQL won't probabilistically pull rows and go back for more -- is for MySQL to get all the rows from the node table, perform the join, and then lop off the top 50.
This is why a LEFT JOIN could be much faster. With the knowledge that rows from the node table will *never* be pruned based on whether there's a user match (even though there always will be one in Drupal right now), MySQL can comfortably pull 50 rows from the node table, perform the LEFT JOIN against the users table, and return the results. There's no risk of having to go back and fetch more node rows. I'm not 100% sure MySQL makes this optimization, but the empirical results in #13 suggests that it does.
Other alternatives that could avoid hitting the INNER JOIN problem above:
* Selecting 50 node rows into a temporary table, and then joining the temp table against users. I think this is messy.
* Pulling just the node rows and fetching user records in separate queries. There's a patch above for this, but I don't like this approach because of the added latency of more queries.
* Fetching user records using correlated subqueries.
Comment #19
david straussAlso, temp tables aren't inherently bad. They're just *often* bad because they result in MySQL writing huge working set tables out to disk. If your tables are big enough that temp tables no longer fit in RAM, MySQL switches to using the disk, and it's like getting kicked when you're already down.
It's not terrible to use temp tables as small, fixed-size working sets, which is what one person is suggesting above.
Comment #20
robertdouglass commentedSo, once again, adding the right index solves this. Gerhard's EXPLAIN works better because Drupal.org is running tracker2, which apparently adds an index on {node} (uid, status, changed). Here's a patch for D7 that adds both the {node} language and the {node} (uid, status, changed) indexes. These two indexes make the content administration page nearly uniformly fast on my site with 100,000 nodes and my site with 4,000,000 nodes.
Comment #21
robertdouglass commentedHere's a D6 version.
Comment #22
gábor hojtsyIn both #21 and #20, this line does not seem to be right. It does not list actual field names:
db_add_index($ret, 'node', 'node_uid_status_changed', array('uid_status_changed'));Did you mean 'uid', 'status', 'changed'?
Comment #23
damien tournoud commentedThe content filter uses conditions on the following columns on the node table: promote, status, sticky, type and language. What is the minimal set of indexes we need to optimize all those queries?
Is there a way to compute that cleanly? Here is my gut feeling:
Comment #24
robertdouglass commentedOh, yes. Oops. Rerolled.
Here's the explain from above now with the new index:
Comment #25
robertdouglass commentedBut I'm not sure that this brings the performance I claimed. Note to self - turn query cache off when testing these things. I don't think the indexes help much. I haven't tried the temp table approach. The "many smaller queries" from #15 is the fastest option, imo.
Comment #26
robertdouglass commentedHere's the difference #15 makes:
Comment #27
david straussIt doesn't really make sense for an index on {node} (uid, status, changed) to help here, anyway. :-)
Such an index is useless unless uid is part of WHERE or ORDER BY criteria. MySQL cannot skip the first part of an index. Even the second part of the index wouldn't help because we're not filtering or sorting by published status. The third part of the index would be useful if MySQL could skip the first two parts, but we already have an index on the "changed" value.
The extra index on the node table for Drupal.org is not for Tracker 2. It was for an earlier attempt at a fast Tracker.
Comment #28
david strauss@robertDouglass How does a LEFT JOIN affect your results?
Comment #29
mikey_p commentedJust checking, but won't this break the tablesort in D7?
Comment #30
robertdouglass commented@David Strauss - LEFT JOIN didn't speed anything up. It appeared to slow things down at smaller table sizes, and was similar at larger. I haven't tried temporary tables yet.
Comment #31
robertdouglass commented@mikey_p: Hmm. Yes. You'd no longer be able to sort the table on user. I'd argue that getting the page to load in under 1 minute is more important than sorting on user (you only get the beginning or the end of the list of users... table sorting is only marginally useful, imo). This will definitely make my initial approach more difficult to get committed, though.
Comment #32
catchCan't see anyone actually using sort by user on this page to be honest. Did we add that in Drupal 7? If so having this load nice and snappy (and improving the filters) seems like a much bigger win.
Comment #33
david straussThis would be the "correlated subquery" option:
SELECT n.*, (SELECT u.name FROM users u WHERE u.uid = n.uid) AS name FROM node n ORDER BY n.changed DESC LIMIT 0, 50How does that test on your system?
Comment #34
mikey_p commentedTo be honest, I could care less about the tablesort on author as well, I was just throwing that out for consideration. I would much rather get the filters fixed to be able to do all kinds of nifty filters (such as a filter on author with author name as an autocomplete field) than worry about keeping the tablesort for every field.
Also keep in mind, to get the better filters this should be ported to use a dynamic query as soon as #299267: Add "Extender" support to SELECT query builder Assigned to: Crell lands.
Will porting this to a dynamic query affect the work done optimizing this query here?
Comment #35
andreiashu commentedsubscribing.
about #33: is mysql caching subqueries by default ?
Comment #36
david straussNo.
MySQL caches entire result sets for full (non-sub) queries. It does not care how you generated said results. I do not believe it performs any specific subquery-level caching.
Comment #37
nerkn commentedThis query locks tables and other queries wait. Wonderful life saver for me. After applying patch every thing is fine now.
Comment #38
killes@www.drop.org commentedChanging title and status (to get a retest).
Comment #39
catchWhy can't we do one query to get the nodes, loop to get uids, then a query to get the names based on those uids? That'd save the lots of little queries complaint above.
Comment #40
robertdouglass commentedI like catch's suggestion in #39. It seems like a sensible refinement of the approach in #15 which gave the best performance results. I'm changing the title again because it's not to be assumed that adding indexes is the right solution here.
Comment #41
david straussWhy aren't we using the correlated subquery option?
Comment #42
damien tournoud commentedOther question: can we compare MySQL version of Drupal.org and the one Robert is using? That's the only reasonable explanation that I have.
Comment #43
david strauss@Damien
For Drupal.org:
Server version: 5.0.70-log Gentoo Linux mysql-5.0.70-r1Comment #44
robertdouglass commentedsorry - that setup is no longer available to me.
Comment #45
noahterp commentedWhen testing on one server (Drupal 5.18, MySQL 5.1.x), adding a simple
WHERE n.changed > 0to that node.module line solved my problem. The original query would run ad-nauseam and simply lock-up MySQL. But adding that extraneousWHEREmust have altered how MySQL sorts the temp table, and it runs the query in a few hundred milliseconds.But, this solution worked on only one version of my site. Our development server copy with a different MySQL 5.0.x configuration was not affected -- although the original query only took 8-10 seconds there versus forever.
Comment #46
damien tournoud commentedOk, apparently the MySQL query planner is being really dumb in that case. Because this query uses a INNER JOIN on a primary key, the planner can choose to execute it starting from node or starting from users. The problem that some people here are facing is that it sometimes chooses to start from users.
This is apparently the case (looking at #4) when the users table has a very low cardinality (in Robert example, there are only 2 users). It looks like the planner is not taking the cost of the sort into account.
What explains #45 is that when you are adding a condition on the node table, the query becomes asymetric again, and the planner has no choice then to start from the node table.
The max_seeks_for_key MySQL configuration parameter could help the planner make a smarter choice, but it looks like it's time to open a bug against MySQL.
Comment #47
Tresler commentedBefore filing a mysql bug can we confirm that this is an optimizer problem by forcing the optimizer on a slow machine to see if it fixes the problem? To me that would confirm the issue as being in MySQL's pre-processor (supposedly better in MySQL 6...)
http://dev.mysql.com/doc/refman/5.0/en/index-hints.html
For that matter, I have no idea if index hints are built into the database abstraction layer? are they?
Comment #48
catchWe now do a straight query to get nids, then node_load_multiple() in node_admin_nodes() since each row of that table is now checked for node_access(). Is there anything else needed here?
Comment #49
jbrown commentedYes - you can still sort by author, which results in:
PDOException: SQLSTATE[42S22]: Column not found: 1054 Unknown column 'u.name' in 'order clause': SELECT n.nid AS nid FROM {node} n WHERE (n.type = :db_condition_placeholder_0) ORDER BY u.name ASC LIMIT 50 OFFSET 0; Array ( [:db_condition_placeholder_0] => page ) in PagerDefault->execute() (line 93 of /home/jonny/sites/drupal-7.x-dev/includes/pager.inc).Comment #50
jbrown commentedComment #51
catchOoh, nice find.
Comment #52
jbrown commentedcan we get this committed and out of the way?
Comment #53
webchickOk.