Improve performance of the facetapi_get_*_date() callbacks [#1223592]

On a database with a large number of nodes the SELECT MAX($facet_field) and SELECT MIN($facet_field) queries in apachesolr_search_date_range() produce extremely slow queries.

The patch that follows address this.

Comment	File	Size	Author
#14	facetapi-get-date-performance-1223592-14.patch	1.23 KB	damontgomery
#2	1223592-6.x-1.x.patch	1.18 KB	elliotttf
#1	1223592-6.x-1.x.patch	234.67 KB	elliotttf
#1	1223592-6.x-2.x.patch	1.18 KB	elliotttf

Comments

Comment #1

elliotttf commented 19 July 2011 at 18:19

Status:

Active

» Needs review

Status	File	Size
new	1223592-6.x-2.x.patch	1.18 KB
new	1223592-6.x-1.x.patch	234.67 KB

Here are patches for the 6.x-1.x and 6.x-2.x branches. It appears this does not happen in 7.x-1.x.

Comment #2

elliotttf commented 19 July 2011 at 18:25

Status	File	Size
new	1223592-6.x-1.x.patch	1.18 KB

Sorry, 1.x patch was f'd. Here's a re-roll.

Comment #3

pwolanin commented 13 October 2011 at 00:43

Is that ORDER BY actually faster?

Better would be to cache the result possibly?

Comment #4

janusman commented 13 October 2011 at 21:30

Apparently, if there is no index, "it depends". See: http://stackoverflow.com/questions/426731/min-max-vs-order-by-and-limit

I think the problem here is that MySQL can't quickly answer this question because we either use the node_status index, or the node_created (or node_changed) index (see node_schema()).

A possible alternative:

Option A) drop the AND status = 1 from the query altogether. It wouldn't be such a great deal since we might already have cases where we're ignoring certain nodetypes, we have access control modules in place, etc.

Option B) Force the index to use. This is a short non-scientific benchmark of 4 variations on an installation with 75,000 nodes:

29ms: SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1
127ms: SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_status) WHERE status=1
40ms: SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1
140ms: SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_status) WHERE status=1 ORDER BY created DESC LIMIT 0,1

So, uhm, #1 seems to be the quickest on this non-scientific quick-and-dirty test!

However it's clear that picking the correct index does make a difference (in this case picking node_status is always slower, but that might depend on the site.. in my case 6% of nodes have status = 0 vs 94% = 1) Maybe someone can try to reproduce this =)

Option C) Any other suggesitons?

Comment #5

elliotttf commented 14 October 2011 at 18:22

Repeating your queries on a database with 392,397 nodes I got the following (non-scientific) results:

0.27 sec SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1;
0.68 sec SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_status) WHERE status=1;
0.00 sec SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1;
1.02 sec SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_status) WHERE status=1 ORDER BY created DESC LIMIT 0,1;

And as a control, here's a comparison of the original queries:

0.69 sec SELECT SQL_NO_CACHE MAX(created) FROM node WHERE status=1;
0.01 sec SELECT SQL_NO_CACHE created FROM node WHERE status=1 ORDER BY created DESC LIMIT 0,1;

Comment #6

janusman commented 16 October 2011 at 02:51

@eliotttf: yeah, after some testing I also came up with 0.000 seconds for #3! I had to switch between 3 differently-named copies of the databases, and the first time I used query #3 I came up with nonzero time (that's the time I posted in #4) but any subsequent queries came took a reported 0.000 seconds. So I'm guessing MySQL is somehow caching, either ignoring SQL_NO_CACHE... or that cache lives somewhere else...? Dunno =)

So... anyways... the MAX...ORDER BY pattern is sounding like a possible winner because it just works better in real-world tests? (And, perhaps, even better when we force the index to use?)

Do we need more research? I bet someone here knows a MySQL guru that can explain the 0.000-second weirdness, just to make sure we're not choosing something worse than the current query.

Comment #7

elliotttf commented 16 October 2011 at 12:41

Again, unscientific, but it looks like the MAX query isn't actually using a key and is doing a full table scan:

mysql> EXPLAIN SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1;
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
|  1 | SIMPLE      | node  | index | NULL          | node_created | 4       | NULL |    1 | Using where |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
1 row in set (0.06 sec)

mysql> EXPLAIN SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | node  | ALL  | NULL          | NULL | NULL    | NULL | 393569 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.11 sec)

I'm not too familiar with the innards of MySQL, but this result seems contrary to the answer supplied on Stack Overflow. It would be interesting to see if others are seeing the same kind of behavior, or as you say, getting someone in here that knows more about how MySQL does this kind of magic.

Comment #8

nick_vh

he/him

Ghent

commented 20 December 2011 at 19:01

Version:

6.x-2.x-dev

» 6.x-1.x-dev

The 6.x-2.x version has been deprecated. Moving this to 6.x-1.x-dev

Comment #9

20 December 2011 at 19:02

Status:

Needs review

» Needs work

The last submitted patch, 1223592-6.x-1.x.patch, failed testing.

Comment #10

nick_vh

he/him

Ghent

commented 18 January 2012 at 21:09

Project:	Apache Solr Search	» Facet API
Version:	6.x-1.x-dev	» 7.x-2.x-dev

Moving this to facetapi since facetapi_get_max_date and facetapi_get_min_date could use this!

Comment #11

cpliakas commented 19 January 2012 at 16:22

Category:

bug

» task

In agreement that this needs to improve, and I also agree that Facet API is probably the best candidate to help solve this. Since the code is working as is I am changing from a "bug report" to a "task", but I am definitely willing to get a viable solution into the 1.0 release.

Big thanks to janusman and elliotttf for the non-scientific benchmarks, which are still extremely helpful. The issue that I have with the SQL_NO_CACHE approach is that is is MySQL specific.

My opinion is that we should implement a more generic solution like the one proposed in #2 (if testing does prove it to be faster) and wrap a somewhat intelligent caching mechanism around it. For example, clear and re-cache the values in hook_node_*() implementations if a node is added with properties outside of the cached bounds.

In addition, I think that the Apache Solr Search Integration module should explore using it's own min / max callbacks getting the values from the StatsComponent. Coupled with somewhat intelligent caching this would take SQL out of the equation and eliminate the assumptions about which nodes are in the index. Caching would be more difficult, though, due to the delay of when documents are actually added to the index. Maybe setting a 15 or 30 minute time limit on the cache would eliminate most of the slow pages while making sure that un-sane values aren't kept for too long.

Thanks,
Chris

Comment #12

cpliakas commented 19 January 2012 at 16:34

Title:

Date range search produces slow queries

» Improve performance of the facetapi_get_*_date() callbacks

Changing title to better reflect the task as it related to Facet API.

Comment #13

msonnabaum commented 25 November 2013 at 19:32

Issue summary:

View changes

I was just about to post the exact same issue. Looking at a site that is getting crushed by these queries, and I came to the same conclusion that this should just be an ORDER BY changed LIMIT 1 instead of a MIN/MAX.

No idea why the optimizer isn't smart enough to figure that out on it's own, but I verified that with the current query, neither percona 5.5 or mariadb 5.5 pick the 'changed' index.

Since the fix is so simple, I'd say it's worth it to just go with the alternate query, especially considering this is the node table we're dealing with.

Comment #14

damontgomery commented 27 November 2013 at 00:05

Status	File	Size
new	facetapi-get-date-performance-1223592-14.patch	1.23 KB

Here is my attempt at #13 for D7.

Improve performance of the facetapi_get_*_date() callbacks