On a database with a large number of nodes the SELECT MAX($facet_field) and SELECT MIN($facet_field) queries in apachesolr_search_date_range() produce extremely slow queries.
The patch that follows address this.
| Comment | File | Size | Author |
|---|---|---|---|
| #14 | facetapi-get-date-performance-1223592-14.patch | 1.23 KB | damontgomery |
| #2 | 1223592-6.x-1.x.patch | 1.18 KB | elliotttf |
| #1 | 1223592-6.x-1.x.patch | 234.67 KB | elliotttf |
| #1 | 1223592-6.x-2.x.patch | 1.18 KB | elliotttf |
Comments
Comment #1
elliotttf commentedHere are patches for the 6.x-1.x and 6.x-2.x branches. It appears this does not happen in 7.x-1.x.
Comment #2
elliotttf commentedSorry, 1.x patch was f'd. Here's a re-roll.
Comment #3
pwolanin commentedIs that ORDER BY actually faster?
Better would be to cache the result possibly?
Comment #4
janusman commentedApparently, if there is no index, "it depends". See: http://stackoverflow.com/questions/426731/min-max-vs-order-by-and-limit
I think the problem here is that MySQL can't quickly answer this question because we either use the node_status index, or the node_created (or node_changed) index (see node_schema()).
A possible alternative:
Option A) drop the AND status = 1 from the query altogether. It wouldn't be such a great deal since we might already have cases where we're ignoring certain nodetypes, we have access control modules in place, etc.
Option B) Force the index to use. This is a short non-scientific benchmark of 4 variations on an installation with 75,000 nodes:
So, uhm, #1 seems to be the quickest on this non-scientific quick-and-dirty test!
However it's clear that picking the correct index does make a difference (in this case picking node_status is always slower, but that might depend on the site.. in my case 6% of nodes have status = 0 vs 94% = 1) Maybe someone can try to reproduce this =)
Option C) Any other suggesitons?
Comment #5
elliotttf commentedRepeating your queries on a database with 392,397 nodes I got the following (non-scientific) results:
And as a control, here's a comparison of the original queries:
Comment #6
janusman commented@eliotttf: yeah, after some testing I also came up with 0.000 seconds for #3! I had to switch between 3 differently-named copies of the databases, and the first time I used query #3 I came up with nonzero time (that's the time I posted in #4) but any subsequent queries came took a reported 0.000 seconds. So I'm guessing MySQL is somehow caching, either ignoring SQL_NO_CACHE... or that cache lives somewhere else...? Dunno =)
So... anyways... the MAX...ORDER BY pattern is sounding like a possible winner because it just works better in real-world tests? (And, perhaps, even better when we force the index to use?)
Do we need more research? I bet someone here knows a MySQL guru that can explain the 0.000-second weirdness, just to make sure we're not choosing something worse than the current query.
Comment #7
elliotttf commentedAgain, unscientific, but it looks like the MAX query isn't actually using a key and is doing a full table scan:
I'm not too familiar with the innards of MySQL, but this result seems contrary to the answer supplied on Stack Overflow. It would be interesting to see if others are seeing the same kind of behavior, or as you say, getting someone in here that knows more about how MySQL does this kind of magic.
Comment #8
nick_vhThe 6.x-2.x version has been deprecated. Moving this to 6.x-1.x-dev
Comment #10
nick_vhMoving this to facetapi since facetapi_get_max_date and facetapi_get_min_date could use this!
Comment #11
cpliakas commentedIn agreement that this needs to improve, and I also agree that Facet API is probably the best candidate to help solve this. Since the code is working as is I am changing from a "bug report" to a "task", but I am definitely willing to get a viable solution into the 1.0 release.
Big thanks to janusman and elliotttf for the non-scientific benchmarks, which are still extremely helpful. The issue that I have with the
SQL_NO_CACHEapproach is that is is MySQL specific.My opinion is that we should implement a more generic solution like the one proposed in #2 (if testing does prove it to be faster) and wrap a somewhat intelligent caching mechanism around it. For example, clear and re-cache the values in hook_node_*() implementations if a node is added with properties outside of the cached bounds.
In addition, I think that the Apache Solr Search Integration module should explore using it's own min / max callbacks getting the values from the StatsComponent. Coupled with somewhat intelligent caching this would take SQL out of the equation and eliminate the assumptions about which nodes are in the index. Caching would be more difficult, though, due to the delay of when documents are actually added to the index. Maybe setting a 15 or 30 minute time limit on the cache would eliminate most of the slow pages while making sure that un-sane values aren't kept for too long.
Thanks,
Chris
Comment #12
cpliakas commentedChanging title to better reflect the task as it related to Facet API.
Comment #13
msonnabaum commentedI was just about to post the exact same issue. Looking at a site that is getting crushed by these queries, and I came to the same conclusion that this should just be an ORDER BY changed LIMIT 1 instead of a MIN/MAX.
No idea why the optimizer isn't smart enough to figure that out on it's own, but I verified that with the current query, neither percona 5.5 or mariadb 5.5 pick the 'changed' index.
Since the fix is so simple, I'd say it's worth it to just go with the alternate query, especially considering this is the node table we're dealing with.
Comment #14
damontgomery commentedHere is my attempt at #13 for D7.