On a database with a large number of nodes the SELECT MAX($facet_field) and SELECT MIN($facet_field) queries in apachesolr_search_date_range() produce extremely slow queries.

The patch that follows address this.

Comments

elliotttf’s picture

Status: Active » Needs review
StatusFileSize
new1.18 KB
new234.67 KB

Here are patches for the 6.x-1.x and 6.x-2.x branches. It appears this does not happen in 7.x-1.x.

elliotttf’s picture

StatusFileSize
new1.18 KB

Sorry, 1.x patch was f'd. Here's a re-roll.

pwolanin’s picture

Is that ORDER BY actually faster?

Better would be to cache the result possibly?

janusman’s picture

Apparently, if there is no index, "it depends". See: http://stackoverflow.com/questions/426731/min-max-vs-order-by-and-limit

I think the problem here is that MySQL can't quickly answer this question because we either use the node_status index, or the node_created (or node_changed) index (see node_schema()).

A possible alternative:

Option A) drop the AND status = 1 from the query altogether. It wouldn't be such a great deal since we might already have cases where we're ignoring certain nodetypes, we have access control modules in place, etc.

Option B) Force the index to use. This is a short non-scientific benchmark of 4 variations on an installation with 75,000 nodes:

  1. 29ms: SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1
  2. 127ms: SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_status) WHERE status=1
  3. 40ms: SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1
  4. 140ms: SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_status) WHERE status=1 ORDER BY created DESC LIMIT 0,1

So, uhm, #1 seems to be the quickest on this non-scientific quick-and-dirty test!

However it's clear that picking the correct index does make a difference (in this case picking node_status is always slower, but that might depend on the site.. in my case 6% of nodes have status = 0 vs 94% = 1) Maybe someone can try to reproduce this =)

Option C) Any other suggesitons?

elliotttf’s picture

Repeating your queries on a database with 392,397 nodes I got the following (non-scientific) results:

  1. 0.27 sec SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1;
  2. 0.68 sec SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_status) WHERE status=1;
  3. 0.00 sec SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1;
  4. 1.02 sec SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_status) WHERE status=1 ORDER BY created DESC LIMIT 0,1;

And as a control, here's a comparison of the original queries:

  1. 0.69 sec SELECT SQL_NO_CACHE MAX(created) FROM node WHERE status=1;
  2. 0.01 sec SELECT SQL_NO_CACHE created FROM node WHERE status=1 ORDER BY created DESC LIMIT 0,1;
janusman’s picture

@eliotttf: yeah, after some testing I also came up with 0.000 seconds for #3! I had to switch between 3 differently-named copies of the databases, and the first time I used query #3 I came up with nonzero time (that's the time I posted in #4) but any subsequent queries came took a reported 0.000 seconds. So I'm guessing MySQL is somehow caching, either ignoring SQL_NO_CACHE... or that cache lives somewhere else...? Dunno =)

So... anyways... the MAX...ORDER BY pattern is sounding like a possible winner because it just works better in real-world tests? (And, perhaps, even better when we force the index to use?)

Do we need more research? I bet someone here knows a MySQL guru that can explain the 0.000-second weirdness, just to make sure we're not choosing something worse than the current query.

elliotttf’s picture

Again, unscientific, but it looks like the MAX query isn't actually using a key and is doing a full table scan:

mysql> EXPLAIN SELECT SQL_NO_CACHE created FROM node FORCE INDEX (node_created) WHERE status=1 ORDER BY created DESC LIMIT 0,1;
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
|  1 | SIMPLE      | node  | index | NULL          | node_created | 4       | NULL |    1 | Using where |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
1 row in set (0.06 sec)

mysql> EXPLAIN SELECT SQL_NO_CACHE MAX(created) FROM node FORCE INDEX (node_created) WHERE status=1;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | node  | ALL  | NULL          | NULL | NULL    | NULL | 393569 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.11 sec)

I'm not too familiar with the innards of MySQL, but this result seems contrary to the answer supplied on Stack Overflow. It would be interesting to see if others are seeing the same kind of behavior, or as you say, getting someone in here that knows more about how MySQL does this kind of magic.

nick_vh’s picture

Version: 6.x-2.x-dev » 6.x-1.x-dev

The 6.x-2.x version has been deprecated. Moving this to 6.x-1.x-dev

Status: Needs review » Needs work

The last submitted patch, 1223592-6.x-1.x.patch, failed testing.

nick_vh’s picture

Project: Apache Solr Search » Facet API
Version: 6.x-1.x-dev » 7.x-2.x-dev

Moving this to facetapi since facetapi_get_max_date and facetapi_get_min_date could use this!

cpliakas’s picture

Category: bug » task

In agreement that this needs to improve, and I also agree that Facet API is probably the best candidate to help solve this. Since the code is working as is I am changing from a "bug report" to a "task", but I am definitely willing to get a viable solution into the 1.0 release.

Big thanks to janusman and elliotttf for the non-scientific benchmarks, which are still extremely helpful. The issue that I have with the SQL_NO_CACHE approach is that is is MySQL specific.

My opinion is that we should implement a more generic solution like the one proposed in #2 (if testing does prove it to be faster) and wrap a somewhat intelligent caching mechanism around it. For example, clear and re-cache the values in hook_node_*() implementations if a node is added with properties outside of the cached bounds.

In addition, I think that the Apache Solr Search Integration module should explore using it's own min / max callbacks getting the values from the StatsComponent. Coupled with somewhat intelligent caching this would take SQL out of the equation and eliminate the assumptions about which nodes are in the index. Caching would be more difficult, though, due to the delay of when documents are actually added to the index. Maybe setting a 15 or 30 minute time limit on the cache would eliminate most of the slow pages while making sure that un-sane values aren't kept for too long.

Thanks,
Chris

cpliakas’s picture

Title: Date range search produces slow queries » Improve performance of the facetapi_get_*_date() callbacks

Changing title to better reflect the task as it related to Facet API.

msonnabaum’s picture

Issue summary: View changes

I was just about to post the exact same issue. Looking at a site that is getting crushed by these queries, and I came to the same conclusion that this should just be an ORDER BY changed LIMIT 1 instead of a MIN/MAX.

No idea why the optimizer isn't smart enough to figure that out on it's own, but I verified that with the current query, neither percona 5.5 or mariadb 5.5 pick the 'changed' index.

Since the fix is so simple, I'd say it's worth it to just go with the alternate query, especially considering this is the node table we're dealing with.

damontgomery’s picture

Here is my attempt at #13 for D7.