The current implementation of a pager query always runs TWO queries against actual data. MySQL allows you to grab the total number of rows that would have been returned if no LIMIT was present by adding the "SQL_CALC_FOUND_ROWS" statement in your SELECT query.
This causes MySQL to return the normal result set and store the full number of matching records in a variable, which can then easily be retrieved via the FOUND_ROWS() function.
Ie:
SELECT SQL_CALC_FOUND_ROWS nid, title from {node} LIMIT 500,10;
SELECT FOUND_ROWS();
As it stands, the database layer doesn't really allow us to make use of this sort of functionality, but it would be nice if it did. For large tables there is a potentially large performance gain, especially when many changes are made to the data being paged through, as the count query can't be cached in such a case.
I don't doubt that other databases have their own versions of these sorts of tricks, for various kinds of queries and I wonder if it might be worth modifying the database layer to be able to make use of them.
Comment | File | Size | Author |
---|---|---|---|
#56 | pager.inc_.txt | 3.98 KB | sinasalek |
#55 | query.inc_.txt | 2.99 KB | sinasalek |
#53 | database_hint_page_default_pluggeable.patch | 4.19 KB | sinasalek |
#50 | 785782-8.patch | 4.97 KB | cafuego |
#44 | 785782-7.patch | 8.92 KB | cafuego |
Comments
Comment #1
Crell CreditAttribution: Crell commentedSo I think this will be possible to do in D7. If you have a PagerDefault_mysql class, I am pretty sure that will get used automagically instead when running on MySQL. It can then do whatever alternate approach it wants to, as long as the API (which includes the globals) doesn't change.
I'm OK with doing that in D7 as long as it happens before we hit beta.
Comment #2
Damien Tournoud CreditAttribution: Damien Tournoud commentedRefocusing. Pagers are already database-specific in D7.
I'm not completely sure this is a good idea. Naranyan linked me to:
http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or...
Comment #3
cafuego CreditAttribution: cafuego commentedI'll see if I can verify whether that also holds for InnoDB tables. They do *not* keep track of he number of rows they contain like MyISAM does. The perfblog entry is from 2007, when InnoDB was not nearly as widely used as it is these days.
Comment #4
cafuego CreditAttribution: cafuego commentedI now have a workable Drupal 7.x-dev that includes SelectQuery_mysql and PagerDefault_mysql classes.
What I've done is added two protected array variables on the SelectQuery_mysql class, flags and comments. These have associated addComment() and addFlag() methods. In __toString() these are flattened out so that the query would read:
I've also overriden the countQuery method in PagerDefault_mysql as there should never be a separate count query
Then I added $this->setFlag('SQL_CALC_FOUND_ROWS'); in PagerDefault_mysql->execute() so that the flag is set only just as that pager query is about to be run. I no longer call $this->getCountQuery()->execute()->fetchField() but instead execute the query with flag first and then call "SELECT FOUND_ROWS()" and repeat 2 lines of the page numbering calculation.
That's all fine for MySQL.
I was told my a friend that Oracle also supports in-line /* comments */ in queries that that these can be used to affect the query plan. It works fine for SQLite too. I don't have a postgres instance, but I wouldn't be surprised if that supported it too.
So should I initially make a patch that implements the updated SelectQuery_mysql changes with only the query flags and after that perhaps add support for query comments to the base Query class? I realise it would involve changes to __toString() in all query subclasses, which is a bit of a pita I guess. I've not really with with OOPHP before, but would there be an equivalent to the java super() method in __toString() that would sort out the query comments on at the start of the $query variable before adding additional stuff in each subclass?
Comment #5
Crell CreditAttribution: Crell commentedThis issue is just for PagerDefault_mysql. Let's see a patch! No patch, no review.
There's a separate issue for comments: #785782: Support comments in built queries. It should be a separate patch as the two are unrelated to each other.
Comment #6
cafuego CreditAttribution: cafuego commentedYeah, the work was on my laptop, which was booting. Patch is attached now.
... or see http://github.com/cafuego/drupal/tree/flags
Comment #8
cafuego CreditAttribution: cafuego commentedArrgh. is the switch to git done yet? ;-)
Comment #9
cafuego CreditAttribution: cafuego commentedUpdate status to get updated patch tested.
Comment #11
cafuego CreditAttribution: cafuego commentedForgot to remove . from .= after moving the query comments to a different branch, so $query wasn't initialised. D'oh. Submitting updated patch for re-test.
Comment #13
cafuego CreditAttribution: cafuego commentedPoked it and prodded it. Now does correct page number calculations, but do need to run a second query if a user fiddles with $page in the query string.
Comment #14
BerdirDocumentation strings and style needs to be improved a bit. (Also in other places, "Win!" imho doesn't belong into comments that will be there for years :) )
Also, Is there maybe a nicer way to implement this, instead of simply assuming in PagerQuery_mysql that we have a SelectQuery_mysql instance which implements addFlag()?
Do other databases have flags too, or can we re-use something existing?
88 critical left. Go review some!
Comment #15
cafuego CreditAttribution: cafuego commentedMaybe, you tell me ;-)
I'm told Oracle can use inline /* comments */ to twiddle with server internals when doing a query. I have no idea about pgsql and sqlite, so offhand this would seem to be a mysql-specific feature (hence the mysql specific class overrides).
I've looked at SelectQuery::__toString() and I didn't notice any way to leverage what was in there to make this work.
Comment #16
cafuego CreditAttribution: cafuego commentedDo other databases have flags too, or can we re-use something existing?
Yes. Postgres and Oracle (I know Oracle isn't supported - yet) both use a comment-style syntax like:
Oracle: http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/sql_eleme...
PgSQL: http://www.enterprisedb.com/docs/en/8.3R2/perf/Postgres_Plus_Advanced_Se...
I note MySQL and Oracle allow hinting on INSERT as well, So perhaps there is a case to be made for adding a hints or flags variable to the base Query class even.
Comment #17
cafuego CreditAttribution: cafuego commentedI've added a hints string variable to the Query class and a non-chainable setHints() method to set it. This means that all derivative query classes can access the variable and implement hints. In this case I've added them for the MySQL query classes only and they're only used in the PagerDefault_mysql class.
There is no getHints() method, mainly because I can see no case in which a hook of any sort should change the hints string.
The PagerDefault_mysql class::execute() method adds the 'SQL_CALC_FOUND_ROWS' hint to its query, so it no longer needs to run a 'SELECT COUNT(*)' query. It does run 'FOUND_ROWS()', but that only returns a session variable that was set by the preceding SELECT query, it doesn't actually need to look at any data.
The only time at which the MySQL pager now needs to run two queries is when user manually sets the page variable in the query string to a larger number of pages than there is data for. In that case 'FOUND_ROWS()' still returns sane number, but I need to rerun the initial SELECT with a newly calculated range, so the final page of data is returned instead of a blank page.
Comments have been updated.
Comment #18
Crell CreditAttribution: Crell commentedWhy is "hints" a string rather than an array the way comments are? Shouldn't it also be an array?
Why are we only using hints on INSERT and SELECT queries? Does MySQL have no useful hints available on other query types?
The comment blocks in the MySQL pager query (which are awesomely complete) should not have an extra blank comment line before the code they describe.
Given the way the autoloader works I'm slightly concerned about putting the MySQL pager implementation in that file. I guess it's getting picked up by the registry autoloader rather than the DB autoloader? Are we sure that's not going to break somewhere?
Comment #19
cafuego CreditAttribution: cafuego commentedWhy is "hints" a string rather than an array the way comments are? Shouldn't it also be an array?
I was contemplating that. For MySQL hints are invariably one (maybe two) all-uppercase keywords. On PgSQL it seems they can be quite complex expressions, from what I gather and I'm not sure if concatenating more than one of them together woulnd't horribly break queries. By having a single string to set, I avoid potentially having that happen.
There's no way to accidentally add multiple hints, you have to explicitly add them all at the same time.
Why are we only using hints on INSERT and SELECT queries? Does MySQL have no useful hints available on other query types?
I thought I'd done MERGE as well... must've forgotten. It was my aim to stick them in *all* query types.
The comment blocks in the MySQL pager query (which are awesomely complete) should not have an extra blank comment line before the code they describe.
Easy fixed.
Given the way the autoloader works I'm slightly concerned about putting the MySQL pager implementation in that file. I guess it's getting picked up by the registry autoloader rather than the DB autoloader? Are we sure that's not going to break somewhere?
Uh. What's the autoloader? ;-)
Should I keep the DefaultPager_mysql class in the mysql/query file or somesuch?
Comment #20
Crell CreditAttribution: Crell commentedSince I cannot think of any case where you'd be loading an extender class without first loading Query, yeah, let's just put it in mysql/query.inc. The alternative is driver-specific autoloaders, which I don't want to deal with.
Regarding string vs. array for the hints, Hrm. Fair point. That should be documented then that only one "hint" is allowed due to lack of standardization or something. Probably in the docblock for the method, to note that calling it a second time will override rather than add to, the way it would with most other things.
I'd skip merge query hints right now. Merge queries are getting seriously screwy and it looks like they won't be single-query on anything soon: #715108: Make Merge queries more consistent and robust.
Comment #21
cafuego CreditAttribution: cafuego commentedSince I cannot think of any case where you'd be loading an extender class without first loading Query, yeah, let's just put it in mysql/query.inc. The alternative is driver-specific autoloaders, which I don't want to deal with.
Done.
Regarding string vs. array for the hints, Hrm. Fair point. That should be documented then that only one "hint" is allowed due to lack of standardization or something. Probably in the docblock for the method, to note that calling it a second time will override rather than add to, the way it would with most other things.
Done. That's also why the method is setHints() rather than addHint() :-)
I've added DeleteQuery_mysql, TruncateQuery_mysql and UpdateQuery_mysql. Left MergeQuery_mysql as it was in the previous version of this patch.
Comment #22
cafuego CreditAttribution: cafuego commentedPoke test bot.
Comment #24
cafuego CreditAttribution: cafuego commentedSince I cannot think of any case where you'd be loading an extender class without first loading Query, yeah, let's just put it in mysql/query.inc. The alternative is driver-specific autoloaders, which I don't want to deal with.
Turns out this is an issue in install.php. It loads query.inc but *not* pager.inc, which means that it can't find the PagerDefault class and thus exits with a lovely parse error:
Fatal error: Class 'PagerDefault' not found in /srv/www/.../includes/database/mysql/query.inc on line 413
Comment #25
Damien Tournoud CreditAttribution: Damien Tournoud commentedOk, I suggest we refocus this issue on getting the underlying infrastructure to do that done.
I'm -1 on changing the default way we do count on MySQL, especially not with a *lot* of testing. We know how the current solution scales, we don't know anything about SQL_CALC_FOUND_ROWS, except that it had a lot of issues in some older versions of MySQL.
Let's simply build the infrastructure we need to make that possible in a contrib driver.
Comment #26
cafuego CreditAttribution: cafuego commentedI'm -1 on changing the default way we do count on MySQL, especially not with a *lot* of testing. We know how the current solution scales, we don't know anything about SQL_CALC_FOUND_ROWS, except that it had a lot of issues in some older versions of MySQL.
I assume the issues you refer to aren't so much problems as supposed speed decreases over COUNT(*). All benchmarks I've seen are pretty old and only test performance on MyISAM, which has a distinct advantage when doing COUNT(*) without a WHERE clause. MyISAM tables store the number of rows they contain in a magic variable, which is simply read out. The moment a WHERE clause is added it has do a table scan.
I'm not convinced most pager queries in Drupal lack a WHERE clause. Even if they do, D7 will default to InnoDB, not MyISAM. That means it'll do a table scan each time anyway.
There is a potential performance issue where Drupal is run with MySQL servers on Linux, which use the default distro-provided configuration file. The issue there is not whether SQL_CALC_FOUND_ROWS is faster than COUNT(*) but the fact that no Linux distro I'm aware of actually ships MySQL with a default configuration file that does anything but simply enable InnoDB. They don't configure it at all.
I've spent some time writing a test script that does a set of pager queries on the node table. I've generated 5000+ nodes using Devel Generate and this script does a defined number of COUNT and SELECT queries to retrieve nids in 50 block increments as per the admin/content page.
Using my own fairly sensibly configured 5.0.87 I find that SQL_CALC_FOUND_ROWS + FOUND_ROWS() is approximately twice as fast as doing a COUNT(*) + SELECT. That's regardless of whether MyISAM or InnoDB is used.
It's also worth noting that this is on an otherwise idle machine where all data easily fits in RAM and the OS file cache. On a real-world production machine this is unlikely to be the case and I/O contention and other processes will likely push the figures higher.
I want to do some more testing on more realistic infrastructure and grab more numbers. I'll write that up as a blog post as well.
Comment #27
Crell CreditAttribution: Crell commentedWhich older versions? Remember that we're requiring MySQL 5 GA at minimum, so if SQL_CALC_FOUND_ROWS was buggy in MySQL 4.x, we don't care.
cafuego, a blog post with solid data sounds great. Please link it here when you have it ready.
Comment #28
Damien Tournoud CreditAttribution: Damien Tournoud commentedThat's exactly what I'm saying. We don't know. This needs to be field-tested (for example here, on drupal.org if someone makes a backport to D6) before considering getting it in, IMO.
Comment #29
Crell CreditAttribution: Crell commentedThen on what grounds are you saying it's buggy on some versions? Do you have links to MySQL bug tracker issues, blog posts, something to back up that statement? I'm not willing to just assume that MySQL functionality is broken until scientifically proven otherwise, despite MySQL's often unimpressive track record. ;-)
Comment #30
Damien Tournoud CreditAttribution: Damien Tournoud commentedOne data point is:
http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or...
Quite old (2007), but remember that 5.0.15 (the lowest version Drupal 7 supports) was released on October 19, 2005.
Comment #31
Crell CreditAttribution: Crell commentedInteresting. Then why is cafuego getting exact opposite results?
I really wish databases would be more cooperative and predictable... *sigh*
Comment #32
cafuego CreditAttribution: cafuego commentedThen why is cafuego getting exact opposite results?
Arjen says Alexey often runs test on pretty big hardware, with lots of cores and lots of RAM and no other workload during the test. He doesn't say in his blog post what those specs are, nor what MySQL version he has used. He's also tested with ORDER BY and I have not as yet done that.
Sofar the only tests I've run are on a table with 5000 rows on both an idle server with 2 cores and sufficient RAM and a single core VM under load. Under load I get up to a factor 2 speed increase. That is with the query cache enabled and set to *sane* values.
On the idle server I do get slightly more varying results. Not massive differences like Alexey, but on tiny datasets where statistical variance and client-server round trip is likely to play a role there's a small speed decrease or no difference.
By tiny datasets I mean 10 queries that run with LIMIT 0,50 through to LIMIT 450,50. Larger datasets are 200 consecutive queries of the same size, hopping through the table from rows 0 to 4950.
I'm running tests on 19 million rows today. That's 1.5GB of node table and converting that to and from InnoDB and MyISAM is taking a while between each run :-)
Comment #33
cafuego CreditAttribution: cafuego commentedRight, the results are in: http://cafuego.net/2010/05/26/fast-paging-real-world
Does it scale? Pretty well, akshelly ;-)
Comment #34
Damien Tournoud CreditAttribution: Damien Tournoud commentedInteresting results, but you are running a non-indexed query. As such, it's perfectly logical that doing two queries takes about twice the time of doing one. It's not exactly twice the time because the first query warm the data structures for the second one. Can you repeat your testing with a properly indexed query?
My guess is (in line of the findings of Alexey) that SQL_CALC_FOUND_ROWS is only useful on badly indexed queries. On properly indexed queries, it's probably a net loss, especially as the size of the dataset increase.
Another thing to consider: as soon as you enable the query cache, the count query gets cached for *every* pagination window. So properly indexed queries using the query cache will pay the additional cost several times.
All things considered, I'm pretty sure this change is a bad idea.
Comment #35
cafuego CreditAttribution: cafuego commentedI'm re-running the test after adding an index on node.language (I suggest that should be filed as bug against D7 anyway) and allowing the count(*) to be cached. I initially forgot to change offset increases by the same amount of old style and new style queries - new style were doing 40k increases, old style only 10k.
However, the initial results suggest there is no difference in performance once the query is properly indexed.
On MyISAM new style is consistently faster, but the we're talking small percentages that may as well be down to round trip latency than anything else.
InnoDB seems slower with new style over old style but that difference all but disappears when I run batches of more queries, so that would appear to be due to randomness in client/server latency too.
InnoDB appears slower than MyISAM again due to the initial data load, but after that first query it's as fast.
So I guess a case for actually using SQL_CALC_FOUND_ROWS is that it will make non-indexed queries (hello, CCK/Views) about twice as fast as otherwise whilst not affecting indexed queries either way.
Comment #37
Damien Tournoud CreditAttribution: Damien Tournoud commentedNot exactly, the real case (as described in #34) is:
Why the "generally" in the first point: because when the database engine doesn't have to do a sort (which, when not indexed, requires the whole dataset to be read and sorted, regardless of the limit) the optimizations we have in
countQuery()
(remove fields, expressions, sort) might make counting separately faster then scanning the whole dataset.Comment #38
Damien Tournoud CreditAttribution: Damien Tournoud commentedI suggest we stick to #25: add the underlying infrastructure so that we could experiment and implement more advanced drivers in contrib easily.
Comment #39
cafuego CreditAttribution: cafuego commentedYeah, but I'm not seeing that. With the query cache enabled and the index on language added I see no significant difference in speed between the old select and the new select.
The average time the queries take differs randomly between old faster and new faster on shorter runs (10-50) and pretty much goes to 0 on longer runs (100-200). The actual measured times in all those cases are 0.01 or 0 ... and that's the best resolution the mysql client will give me. I'm pretty sure that most of that time is latency, as opposed to anything significant in query speed.
Comment #40
cafuego CreditAttribution: cafuego commented... having said that ... there's no reason to implement hints first anyway ;-)
Attached is a version of that patch that does not include PagerDefault_mysql but that does implement hints on the existing MySQL and PgSQL query classes. I guess I should re-implement the remaining PgSQL classes in pgsql/query.inc as well?
Comment #42
cafuego CreditAttribution: cafuego commentedFix syntax error. Try again.
Comment #44
cafuego CreditAttribution: cafuego commentedOk, renamed patch file so firefox can't upload the wrong one again :-P
Comment #45
Crell CreditAttribution: Crell commentedWait, I'm confused now. I see Damien saying one thing, and cafuego's benchmarks saying "none of this makes any difference anyway". Sometimes. :-) Can someone clarify here? cafuego, I don't get your comment in #40...
Comment #46
cafuego CreditAttribution: cafuego commentedMy comment in #40 is regarding removing PageDefault_mysql from this patch and implementing just hints on the existing Query classes for now. The patch I attached above does just that, it adds support for hints to all existing Query classes. However, in the includes/databases/pgsql/query.inc file I've only added hints to the InsertQuery_pgsql class sofar; I've not overriden __toString() in UpdateQuery_pgsql and SelectQuery_pgsql yet.
As for the SQL_CALC_FOUND_ROWS method not making a difference ... it seems that when the pager queries use properly indexed rows, there is not (as far as I can determine) a measurable speed difference between using it or COUNT(*) in a separate countQuery. Where it does make a difference is in queries that use rows that aren't indexed (in a WHERE, JOIN, LIMIT or ORDER clause).
The content list is such a query, but I've submitted a patch to add an index to the {node}.language field. The user listing seems to only use indexed fields, so there will not be a speed gain there either.
Where there will be a potential speed gain is Views doing pager queries; I just had a look as a text field I added via fieldapi, and there's no index on the value column... so if this is included in a pager query the SQL_CALC_FOUND_ROWS would outperform the COUNT(*) one.
Comment #47
Crell CreditAttribution: Crell commentedWell given how frequently Views gets used, it makes sense to try and make that faster. However, this is also related there: #423888: Use subqueries for ->countQuery(), at least for MySQL. I'd rather get one of the Viewsies in here for a second opinion, though.
As long as there is no case where the new method ends up being slower, I'm OK with moving forward after we get some input from the Views folks.
Comment #48
retester2010 CreditAttribution: retester2010 commented#44: 785782-7.patch queued for re-testing.
Comment #50
cafuego CreditAttribution: cafuego commentedUpdated the patch to apply cleanly against the latest CVS.
How is the pager class going, can that be abstracted out yet, so I can make my poor server run less queries in the pager? ;-)
Comment #51
cafuego CreditAttribution: cafuego commentedComment #52
sinasalek CreditAttribution: sinasalek commentedFor enterprise applications with complex queries i think it's essential to be able to leverage each Database special capabilities to overcome performance issues. We're in such a situation and having this flexibility prevents the need to hack the core
Comment #53
sinasalek CreditAttribution: sinasalek commentedThe previous patch does not work for the latest version of Drupal.
Reroll for Drupal 7.32
Comment #55
sinasalek CreditAttribution: sinasalek commentedI also attached pager plugin for mysql driver,
However i do not know how it can replace the core pager, it does not seem to be doing anything.
Any help is apperiticated, i'm willing to make this land
Comment #56
sinasalek CreditAttribution: sinasalek commentedWrong file
Comment #57
dcam CreditAttribution: dcam commentedThis change will need to be made in Drupal 8 first. That's why the version was changed.
Comment #58
mgiffordJust unassigning so someone else can take this on.