I was reviewing the Drupal caching code and it looks as though drupal will never really be able to scale under serious load. While it is capable of caching data to a database and returning that to a user, that still requires an apache process to fork off and start the php interpreter. Even with a php opcode cache, a person can't get the kind of performance available through static object caching such as available through Squid.

Some of the claims of scalability listed on this site list being able to push 60-100 hits per minute with Drupal. I need 300 hits/second on frequently accessed static objects in order to be able to consider Drupal as a platform, and that's not a hard number to obtain. With a couple of initial tests, I was able to get 38 hits/second out of drupal with the caching enabled.

My proposal would be to add a cache manager to interoperate with Squid allowing an administrator to set headers that would allow certain objects to be cached through a squid cache.

The current code does something like:

if(cache) { /*done*/ }
else {
      header("Expires: Sun, 19 Nov 1978 05:00:00 GMT");
      header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
      header("Cache-Control: no-store, no-cache, must-revalidate");
      header("Cache-Control: post-check=0, pre-check=0", false);
      header("Pragma: no-cache");
}

/* finish processing */

A better solution to integrate with squid would be to add another condition after the cache hit detection.

if($cache = page_get_cache()) {
     //same as existing
     exit();
}
else if(variable_get('squid_cache') && $seconds = page_squid_cache()) {
     header("Expires: <some number of seconds in the future specified by cache>");
     //other headers necessary to enable caching
     //note: no exit() call, fall through and process the page
     //other things to do here include possibly tracking a cache miss from squid for statistics
}
else {
      header("Expires: Sun, 19 Nov 1978 05:00:00 GMT");
      header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
      header("Cache-Control: no-store, no-cache, must-revalidate");
      header("Cache-Control: post-check=0, pre-check=0", false);
      header("Pragma: no-cache");
}

In the above code, the page_squid_cache() code would check to see if the request URI was entered into a table of cacheable pages, tunable by an administrator. So if there was heavy load anticipated on http://www.example.com/node/13536, "node/13536" could be associated in the squid cache manager, causing the object to be cached in the squid cache and boosting performance on that single object to ~2000 req/sec on modest hardware.

This solution doesn't change any aspect of drupal and doesn't cause a performance hit on the existing cache architecture. If for whatever reason the there is a cache miss on the squid cache and not on the drupal cache, the cached object (with cacheable headers) will be returned to the client.

Thoughts?

Comments

adrian’s picture

But adding squid proxies outside the Drupal installation is a very valid way to improve performance.

If you roll a patch for this functionality, and post it as an issue for Drupal core, it will be considered for addition, if not, documenting such an enhancement in the handbook would be the right step to take.

We're always looking to improve performance, but at the moment we are more concerned about performance issues in page generation for logged in users, and the granularity of our caching. The vast majority of sites that are built using Drupal
do not have access to squid proxies, and as such, no functionality to directly support them are in core.
--
The future is so Bryght, I have to wear shades.

jhenry’s picture

It's almost essential.

My point is basically that the current setup nullifies any attempt to use an http accellerator proxy. If no one has started any work on this already, I'll work on rolling a patch to add the functionality. It would even help users who have such caches locally at their ISP to associate objects for cache elligability, so I don't think that the effort to set appropriate headers is wasted.

walkah’s picture

as with the read-only databases, I'm more than willing to help with / review / test and offer my karma to said patches.

what may turn out to be an ideal approach - is to introduce some cache hooks so that various approaches can be plugged in on the sites that require them, and ignored for smaller sites where it doesn't make sense (thinking ahead to what a "object is cacheable" UI might look like)...

--
James Walker :: http://walkah.net/

el777’s picture

I'm not an exeprt in web-caching, but I think we may use browser cache for registered users.
According to the RFC 2616 if we set header

Cache-Control: private

we may use client's cache but not caches in the middle.

killes@www.drop.org’s picture

Some people are investigating using memcached for Drupal. They are mainly interested in using it for logged in users, but I guess it could be used for anon users too. This will of course still need the php interpreter to start up.
--
Drupal services
My Drupal services