Drupal's cache system has a number of problems, and I know several others have written extensively on this subject so I won't expound here, however I've developed a fix that helps significantly for high traffic sites that cannot afford to have locks on the cache table.
With many nodes and urls, the drupal cache can become very large -- on the order of 400M or more. Such a table will not necessarily be able to be stored in memory, and such a table will necessarily take a nontrivial amount of time to clear. Even a simple "DELETE FROM cache WHERE expire < X" can lock the table for 30-40 seconds on a MyISAM table when the table gets large enough. 30 seconds of queued connections on a high traffic site can hard to recover from if the hardware is not severely overprovisioned.
The solution to the locking problem is to use a database engine in MySQL that does not lock such as InnoDB or another database entirely such as Postgres.
With a very large table, however, the innodb storage engine takes a very long time to do the table scan necessary for a wildcard delete.
The solution to this is to move the blob out of the table and into another table.
I optimized it further by not caching data aliases in the cache data table. In the current cache system, all urls that generate the same content such as http://drupal.org/node/notfound1, http://drupal.org/node/notfound2, and http://drupal.org/node/notfound3 (or similar with query strings appended that do not change the page output) store separate data in the cache table.
By using a cache key as the md5 of the data in a separate table, updates to the main cache table can be fast without actually deleting data and locking the table since the main cache table is reduced to a set of pointers. A cron job can then garbage-collect the cache table's data during periods of low load.
My modified schema is as follows:
CREATE TABLE `cache` (
`cid` varchar(255) NOT NULL default '',
`data_md5` varchar(32) default NULL,
`expire` int(11) NOT NULL default '0',
`created` int(11) NOT NULL default '0',
`headers` text,
PRIMARY KEY (`cid`),
KEY `expire` (`expire`),
KEY `data_md5` (`data_md5`)
) ENGINE=InnoDB
CREATE TABLE `cache_data` (
`data_md5` varchar(32) NOT NULL default '',
`data` longblob,
PRIMARY KEY (`data_md5`)
) ENGINE=MyISAM
The changed code is very non-invasive because one can still delete items from the cache table as before (but they will execute more quickly because the table is much smaller).
Let me know if there is any community interest in this patch. If there is no interest because of a lack of high traffic sites that have a lot of contention on the cache table, I will not post it.
Comments
Comment #1
jhenry commentedthe less than sign killed a paragraph. This is the missing information:
Even a simple "DELETE FROM cache WHERE expire < X" can lock the table for 30-40 seconds on a MyISAM table when the table gets large enough. 30 seconds of queued connections on a high traffic site can hard to recover from if the hardware is not severely overprovisioned.
The solution to the locking problem is to use a database engine in MySQL that does not lock such as InnoDB or another database entirely such as Postgres.
With a very large table, however, the innodb storage engine takes a very long time to do the table scan necessary for a wildcard delete.
The solution to this is to move the blob out of the table and into another table.
I optimized it further by not caching data aliases in the cache data table. In the current cache system, all urls that generate the same content such as .... (comment continues above at url)
Comment #2
moshe weitzman commentedinteresting. recently we have been *reducing* our number of joins and here is proposal to increase. it makes sense when considering this delete operation. i guess the new join is not too expensive duringn selects because of proper indexes.
Comment #3
jhenry commentedSome additional information: on an active site with 300k cache entries, I have about 210k cache_data entries, indicating that mangled urls by bots or users or duplicate filter output is about 40% duplicated or aliased data.
Also, a cron thread to garbage collect unreferenced data asynchronously takes the bulk of the work for a cache clear out of the critical path to the user and defers it to later where it can be done in small chunks without necessitating long locks on the shared cache resource.
Comment #4
moshe weitzman commentedthat tells me that we should assure that those mangled url entries never appear in the cache table to begin with. in fact, i hate that we return a page at all. i would rather return web server's 404 and not consume more resources. i just did a test and we are actually adding rows in cache table for both 403 and 404 pages. not good.
what is duplicate filter output? can you give some examples?
Comment #5
jhenry commentedI actually implemented another fix to the duplicate 404 entries in the cache where I override REQUEST_URI to "not_found" in the not found handler before I changed the cache table, but that was months and months ago and the cache table is a more generic solution because it allows a custom 404 page to "guess" what the user was looking for and cache different content for those pages.
Potential filter data duplication could occur if there was identical filter output from two pieces of (eitehr identical or differing) text. Two rows would be cached in the cache in the old system but only one row would be cached if the data were keyed on an md5 of the data.
Users can enter urls such as drupal.org/url?x, drupal.org/url?y drupal.org/url?z and produce a separate cache object though those objects return the same data. You cannot "control" what users are typing in to the address bar, for instance when you have an external link.
Removing aliases is an easy optimization gained by solving the fundamental problem, though, and that is having a longblob inline with tiny integer and varchar fields that have to be read (along with the blob) when doing a table scan.
Comment #6
moshe weitzman commentedmaybe i am naive, but why does the DB care about the BLOB for a query such as DELETE FROM cache WHERE cid LIKE 'menu:%'. In that case, only the small varchar 'cid' matters.
Comment #7
jhenry commentedDepending on the db engine, different amounts of the BLOB are stored inline with the other fields. For InnoDB, for instance, i believe it is the first 512 bytes of the blob and the rest of it is stored elsewhere.
On a table scan that cannot use the index such as a prefix wildcard on innodb, it will do a strided table scan that touches much more memory (and disk) if there are an additional 512 bytes in each record that it has to read.
Also consider that the more memory that you touch with a delete, the longer the query will lock the table. If you can only fit 20 entries on a page and you have to do 5k random seeks to random pages to do update each of those records, it will take longer than if you can fit 100 entries on a page and you only have to do 1k random seeks.
Comment #8
Anonymous (not verified) commentedI'm being bitten by this right now, so any development on this front is welcome. I can't believe that there aren't more posts about this, and it makes me wonder if I'm just doing something wrong with our site... At around 1.5 million hits a month, I have to optimize the cache table about every hour to be safe, if it goes more than 6-10 hours without an optimize the table grows to several gigs and a query on the cache table can take anywhere from 1 minute to 15 minutes...
Example:
# Query_time: 1095 Lock_time: 0 Rows_sent: 0 Rows_examined: 0
SELECT data, created, headers FROM cache WHERE cid = 'variables';
That's impressive. Is no one else having this problem, or is everyone just optimizing and repairing tables constantly as well? I'm definately getting a lot of problems from search bots that are requesting all sorts of obscure URL's which return a 404 page, thus creating a cache entry for each... Grrrr.
Comment #9
killes@www.drop.org commenteddebtman: here on drupal.org, the cache table hardly gets 80000 entries before it is cleared again. You may want to support this patch: http://drupal.org/node/72290
Comment #10
moshe weitzman commentedone alternative to moving the BLOB to own table is to do as I suggest at http://drupal.org/node/72290#comment-116426. read the whole issue. i don't know if that would be good enough. would be nice if you guys with busy sites could test that patch.
Comment #11
Anonymous (not verified) commentedI may try out that patch, but I think the more immediate problem is due to 404's. I'm looking at the access log and cache table, and we've been getting hammered by search bots requesting all sorts of really weird url's (/event/2006/people/node/event/calendar/event/etc/etc). I think the reason our cache is getting huge is mainly due to all of these 404's, as a new cache entry is created for every 404 page, of which I'm getting thousands of an hour as the site is hit by various bots.
Splitting the cache table is definitely a good idea, but for me I think the most immediate benefit would come from disabling caching of 404 error pages. I was going to just go back to apache 404 error handling, but with the rewrite rules that's pretty much impossible unless I get rid of clean URL's which isn't really an option.
For the time being, I've modded bootstrap.inc so it doesn't cache any 404 pages (basically a strpos for the 404 page title in the page_set_cache function). I'll add that patch to my list of things to check out, but I'm running on a 4.6 codebase now so I can't easily apply the patch and I need to get things running smoothly before I can try things like that out.
Comment #12
Anonymous (not verified) commentedAllright the more I think about this, jhenry's patch is a pretty good solution.
I've disabled caching 404 pages, but this is not ideal since each bot that hits my site will have to have a full page generated for it. This is going to waste resources. Ideally a single cached 404 page should be stored and then sent to anyone who hits a 404 page.
The other problem I'm having is that drupal, by default, serves up a list of recent nodes to any url that starts with /node. I don't know where the bots are getting these url's, but I have entries in the cache table for fun things like node/node/node/12345/person/node/ and so on. All of these URL's, in my opinion, should return a 404, but since drupal by default sends a list of recent nodes to such a URL, each request is instead served a page of recent content. I'm getting thousands of these pages in the cache, all of which have the same content.
So now in order to make my site perform I've made two mods:
1) disable caching of 404 pages in boostrap.inc
2) modify node.module so that /node/* returns a drupal_not_found() rather than node_page_default()
This will keep my mysql server happy, with luck, but also means that each time a bot hits a 404 page it will have to generate from scratch. Seeing as mysql is what's causing me problems, I can live with that for now.
jhenry's patch, however, would address both of these issues. By using an md5 hash of the data in a smaller pointer table, you could generate a 404 page cache only once and have all subsequent entries point to that. You could do the same thing for any node/* url's, store a single cached page with a listing of new content and have any URL's called to that return the same data.
I think it's a workable solution.
Comment #13
AltaVida commentedI agree that Drupal returning page data (list of recent nodes) when it should return a 404 is a problem. Search engines will generate random urls to 'test' your server and make sure it is properly returning 404's when it should.
If the SE sees a page returned no matter what random url it requests then your site would likely suffer a penalty (for being badly configured and/or possibly 'autogenerated' spam).
Comment #14
moshe weitzman commenteddrupal is returning an http 404 code along with the page so there is no penalty with search crawlers.
Comment #15
moshe weitzman commentedanyone up for rolling a patch for this? i think the consensus is growing that it is a good idea. we'll just need to benchmark to make sure.
Comment #16
jhenry commentedI'll roll a patch against current CVS as soon as I get a chance to port the changes from my tree. Note that this cache method requires asynchronous garbage collection in order to take costly deletes off the critical path and I've not yet integrated my garbage collector into drupal.
One other thing to consider: in my above proposed schema (and in my current working copy), I'm keeping the headers in the cache table. It would be technically much better to keep those headers in another table. The reason I put them there is that it is theoretically possible for drupal to emit different headers for the same data object based on the URL. In order to do that, you'd have to have some conditional that did a drupal_set_header() based on something in the URL. I think in 99.9% of cases, it would be fine to put the headers along with the data object.
Does anyone have any thoughts on this?
Comment #17
moshe weitzman commentedi think that for now you can assume same content always has same headers .. .as for the garbage collection, i'd expect to see that the call for that in system_cron() function
Comment #18
Anonymous (not verified) commentedI'm not seeing how drupal is returning a 404 on such pages. Take a URL request of the form /node/node/12345.
in node_page this is going to have no known op so it goes to the default which calls node_page_default. node_page_default simply outputs a list of recent nodes. There's no 404, so there is a penalty for this, and it's worse than just the 404 problem since there could be potentially a much larger data chunk stored in the cache for these pages. But this is a seperate issue which isn't really related to this patch...
Anyway, I'm running 4.6.5 in production, so patches don't help, but I can implement the same modifications and report on performance differences. Right now, with no caching of 404 pages and returning a 404 page for node/* requests, performance is fine and my cache table hasn't gotten over 100 MB. But I'll undo that for testing this patch so we can get an idea of how much of an impact it can make, keeping in mind that with no modifications I was getting 2.1 GB of cache data and cache queries were taking 1-2 minutes typically, 5-15 minutes if I went days without an optimize on the table. We average 1.5 million hits a month and have around 6,000 nodes, so it should be a decent performance test.
Comment #19
dries commenteddebtman: if you get '/event/2006/people/node/event/calendar/event/etc/etc' in your watchdog, then maybe you are still using Drupal 4.6? That shouldn't happen with Drupal 4.7 unless some module is generating invalid relative URLs. Check the referrer URL and make sure that all URLs are relative to the top-level domain (all relative URLs should start with a '/').
Comment #20
dries commentedGarbage collection might be a problem. Cron support is not a hard requirement at this point.
Comment #21
moshe weitzman commentedtoday, if you do not have cron on your watchdog table, it will grow indefinately. at leat sthese tables periodically get a big cache_clear_all() and will get mostly wiped.
furthermore, if you don't have cron your don't have search. lastly, one can always do a 'delete from cache' at any time in order to prune these tables.
i think the cron issues are not very worrisome. my .02
Comment #22
Anonymous (not verified) commentedDries: I don't think anything in drupal is generating these URL's, as far as I can tell. I work for a large university, and the googlebot run by the university (one of those standalone enterprise search boxes you buy from google) is calling them. There are no referrers in the log file, so as far as I can tell, the search bot is just making them up. Yes, I'm running 4.6, but I don't think that's the issue. I have 1800 hits today alone for the URL 'node/counseling/node/counseling/counseling'...
Anyway, regardless of whether anyone else is bitten by this, I don't think it's proper behavior for drupal to return anything but a 404 for URL's such as this. Anyway this isn't really related to this patch, so I opened another issue about it - http://drupal.org/node/74347
Comment #23
moshe weitzman commentedand now i've submitted a patch. please help review it at http://drupal.org/node/74347
Comment #24
magico commentedAccording to http://drupal.org/node/72290 the cache system was splitted to improve performance.
I think that it's safe to mark this issue as fixed.
Comment #25
(not verified) commented