The problem we are trying to solve: block caching is not enabled by default in Drupal core, and block caching is always disabled when a node access module is enabled. This could result in lots of duplicate, expensive searches on Solr, especially for the MLT block if it's on a popular public page.
Probably a normal query the overhead of caching would outweigh the savings, however, since Solr has its own caching schemes.
Proposed strategy:
Create a new cache in apachesolr.install and in an update function table following standard model.
Override public function search() from Service.php in Drupal_Apache_Solr_Service.php
add an optional final parameter with final default value of 0
construct a cache ID from the $queryString. Since we have a 255 char limit, hash it w/ sha1() and add another identifier - e.g. $cid = 'query:' . sha1($queryString);
If the apachesolr_nodeaccess module is enabled, the $queryString will have parameters containing all the information about the node access grants of the user - hence we can safely cache it using this as a cid since only a user with the exact same node access grants will see the cached result.
If the cache entry does not exist, run the query and cache the result before returning it.
Set the cache lifetime to something reasonable like ~4-5 min using a variable that is not exposed via the UI.
Add to apachesolr_cron a call to clear this caceh = e.g. cache_clear_all(NULL, 'cache_apachesolr');'
| Comment | File | Size | Author |
|---|---|---|---|
| #11 | query-caching-368245-11.patch | 9.68 KB | pwolanin |
| #10 | query-caching-368245-10.patch | 10.53 KB | pwolanin |
| #9 | wip-query-caching-368245-9.patch | 6.77 KB | pwolanin |
| #3 | query-caching-368245-3.patch | 7.9 KB | pwolanin |
| #1 | query-caching-368245-1.patch | 6.9 KB | pwolanin |
Comments
Comment #1
pwolanin commentedComment #2
JacobSingh commentedArguments against this and for caching on the Solr Server:
1. No DB hit
2. Don't have to add more size to the DB
3. Less code
Cons:
1. More Voodoo, less control of how long they cache for (AFAIK)
2. Still puts load on the server, and network IO
3. Although a Query Result cache does not take a lot of RAM, it uses some, and this will degrade performance on the server.
4. Thread management? If more requests are hitting Solr, even if they are small on load, they will take up threads.
Given these, I'd say we can hold off on a Drupal caching layer until we get a better sense of the performance implications of doing it in Solr. That being said, it's really not a big hit to do it in Drupal, and if the end result is 100% identical, what's the real objection?
Comment #3
pwolanin commentedRe-roll - looks anyhow like I missed the change to the mlt module. This patch uses the new cache table for the Luke cache, but does not enable any query caching by default. Per discussion with Robert, it might be useful, however, to have that facility available, especially for sites able to use a high-speed cache like memcache.
Comment #4
mikejoconnor commentedOne of my drupal/apachesolr clients just received a LOT of traffic during the superbowl. I think their situation might give some insight into this discussion. To start with here's a little about their infrastructure and configuration.
1 front end web server & solr server 8gb ram 2 quad core xeon processors, sata hard drives(mirrored)
1 database server 8gb ram 2 quad core xeon processors, sata hard drives(mirrored)
I've only made two changes to the solr/tomcat defaults, increasing the available ram to 1gb, and enabling term vectors.
During the peak hour(18:00-19:00) of the superbowl they received 96,000 page views, most all page views contain apachesolr mlt recommendations. The anonymous user(7500) vs logged in user(120) during that time was definitely in our favor. Since normal page caching is enabled, our total number of mlt queries for the day(according to log files) only grew by 10%, even though our total page views grew by 250%. The major bottleneck was the database server, which was running with a cpu load of 7-9 during that hour. The web/solr server had an average utilization of 3-4/8.
Overall I'm not thoroughly convinced that the caching would provide a net reduction in resources for this client. If the current caching strategy were in place, the solr query would run, and the results would be cached to the db. Drupal would also cache the results, but as they are displayed, rather than the results of the query. By the time the anonymous users cache expires, the solr cache would have expired as well. Another thing to consider is there are 480 increments of 5 minutes throughout the day. On this site, the top 10 pages have an average of 1500 pv/day. This would put our hit rate at 60%, assuming no other caching is taking place.
Overall, I don't think general query caching is the way to go. It seems that most text searches will vary enough that they wouldn't be worth caching, and the mlt caching would require some knowledge of the situation. In my opinion we should make sure that it is easy to call custom queries and cache the results manually.
Comment #5
robertdouglass commentedAlso not to ignore is the overhead of writing to the db. The suggested caching strategy would guarantee 1 db write per node on a site, every time cache_clear_all gets run. On Drupal.org, for example, where there are lots of people writing stuff, this means a very high rate of writing new caches. In addition to probably not hitting this cache particularly often, I suspect we'll do a bunch of db writing that has to be factored into the overall performance gain.
That said... nothing is as conclusive as benchmarking and experimentation. Acquia will be in a position to actually test this caching method on a large set of sites and get empirical proof that it works or doesn't work.
Comment #6
pwolanin commentedRobert - well, the main question is whether it's worth committing. The last patch will actually cause zero extra caching - existing cache_* calls will just go to a different table.
Comment #7
pwolanin commentedComment #8
pwolanin commentedWe should implement this so it only caches when block and page caching is off (e.g. for logged-in users with a node access module turned on)
Comment #9
pwolanin commentedprobably not working yet, but starting to make progress.
Comment #10
pwolanin commentedmaybe worth looking at
Comment #11
pwolanin commenteda little simpler
Comment #12
pwolanin commentedDamien suggests he has an alternate approach based on http headers.
Comment #13
Scott Reynolds commentedSo I committed something like this to apachesolr_views project just now. I have mixed feelings on it. It uses the new Views caching plugins to write to the cache_views_data. Do I think there are better approaches, like using varnish http://nnewton.org/node/8 ? Ya probably is. But offering this isn't a bad thing either. Because it uses cache api, it can use memcached.
Commit: http://drupal.org/cvs?commit=224746
Comment #14
pwolanin commentedhere's Damien's code from: http://drupalbin.com/9269
for using Solr's if-modified-since header capability. Woudl could/should possibly combine this with a minimum cache lifetime. Again, this is mainly relevant for MLT or sother queries that might be made on every page.
Comment #15
robertdouglass commentedWe're to the point with this one that clarity can only be produced through benchmarks.
Comment #16
shalinmangar commentedIn my experience with Solr, the best way to scale Solr is to add an HTTP cache in front of it. Solr emits HTTP cache headers and supports If-Not-Modified so an HTTP cache knows when to expire the cache. YMMV ofcourse.
Comment #17
pwolanin commentedThis is a big part of what Damien's suggested code does - though using Drupal essentially instead of a HTTP cache
Comment #18
jpmckinney commentedComment #19
jpmckinney commentedUse an HTTP cache. The apachesolr module should not try to do http caching/load balancing/etc. when there exists software specifically designed to address these issues.
Comment #20
rjbrown99 commentedKicking a closet ticket with some additional details, in the event someone else finds this.
Archive of the post from comment #13 since the site is no longer available
HTTP caching in Solr
Solr and HTTP caching
Varnish: Backend conditional requests