Hi All,

I noticed that Dries said here that;

As of September 1st, we will start preparing for the next major release by focusing on performance, usability and stability.

Of course usability and stability are very important, but I am particularly interested in what performance improvements are being looked at, and how applicable/accessible they are to the wider Drupal community (i.e. are they platform independent, are they easy to setup, etc)? Very large traffic sites such LiveJournal, addons.mozilla.org, Slashdot and Wikipedia are all using memcached which from articles I have read seems to have removed almost all of the database load. Memcached seems very easy to setup and has a server and client available for both *nix and Win32 making it very accessible.

There might be still issues with how applicable a memcached solution is for Drupal users using shared hosting services that do not offer a memcached server, but with the VPS options becoming more affordable all the time there are alternative hosting solutions. The performance gains offered by memcached should be especially relevant for small~medium sites that due to budgets (or lack of them) are restricted to using a single VPS for the webserver & MySQL. There must be very significant gains for large sites as well if websites as large as the ones mentioned before are convinced enough to implement it into their environments. Probably not perfect, but I would think still a whole lot more accessible and applicable to the wider Drupal community than using something like reverse proxies which I have seen put forward several times in dev list discussions.

Would love to hear some opinions on the Pros and Cons implementing the memcached API into the Drupal core.

Comments

killes@www.drop.org’s picture

Version: 4.7.0 » x.y.z

features go into cvs.

I think makign the Drupal caching layer more modular should be high on the agenda. I am even willing to work on it. :p

brashquido’s picture

I can't code, but am willing to put in time in testing. I am totally unfamiliar with the current caching layer in Drupal. How big of a job would it be to make it more moduler?

chx’s picture

guys, I have session memcached in my sandbox.

I also run cache in memcached . Path aliases. Menu. These are not yet released, but will be .

I have no idea of how to roll it into core. We should be cautious with providing hooks because there are very time critical parts of Drupal.

brashquido’s picture

Just wondering what you mean by this;

there are very time critical parts of Drupal

One thing I suppose I would also like to know is what are the potential dangers of implementing memcached into Drupal?

chx’s picture

I meant "these are". Dangers are slowdown for non-memcached site, obviously.

Arkimedes’s picture

Version: x.y.z » 4.7.1
Component: database system » other
Status: Active » Needs review
StatusFileSize
new12.28 KB

I've written a patch to bootstrap.inc to add support for using memcached. It uses the PECL memcache-2.0.4 package to talk to memcached servers. To keep code changes to a minimum I just changed the implementation of the cache_* functions and enough to configure the connection. I preserved the database caching code as the default method for people not using memcached. This should remove the db queries for the cached content. It supports multiple memcached servers. See the attached patch.

Arkimedes’s picture

Component: other » database system
StatusFileSize
new901 bytes

You also need to adjust the settings.php file for your site(s). This patch works for the default site and can be adjusted for any site wanting to start using memcached.

chx’s picture

Status: Needs review » Needs work

Drupal does not use 'class'. Otherwise, your code looks good. Could you please examine the sessions cache code in my sandbox and roll into this. Also, there is a problem with wildcard cache wipes in memcached. memcached can not handle that. The solution for that is to simply start a memcached for every wildcard-wiped cached data and flush it. In modules, a grep cache_clear *module |grep TRUE reveals that only menu is such a beast. Checking *.inc will show the same. Therefore, you need a separate menu memcached. I am actually running in such a setup: one session memcached, one data memcached, one menu memcached.

killes@www.drop.org’s picture

Version: 4.7.1 » x.y.z

chx refers to this:
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/chx/session-m...

moving to cvs, as it is a new feature.

Arkimedes’s picture

Is it possible to create a generic object in PHP? Something that can add data and header members to, in order to maintain compatability with the database object return type? I only created the class to be compatable with that db object, otherwise it looks like there will need to be a some changes to any code using cache_get() so we can use any type of caching scheme. Ideally we would just set and return a serialized string and let the calling code convert it into useable data. I'd also like to re-evaluate the use of the $headers variable in the page caching. Ideally the page caching code could join those two values into $data. It's not a problem for the database layer, but I had to set two values in memcached. I'd like to only need one for any given key.

I still need to look at the session code. Is it possible to just add calls to the current caching code, or does it need to be handled separately. I'll also look into the wild card caching.

I'd also like to see some benchmark numbers if someone has a setup to do the testing.

Arkimedes’s picture

Status: Needs work » Needs review
StatusFileSize
new13.65 KB

Ok, I removed the class and just create a stdClass() for the return object. I also defined CACHE_MISS for the return type of cache_get() for cache misses to help document the code. I updated the configuration based on someone's _drupal_memcache_init function from another thread. With that comes an option for persistent connections. I haven't looked at using the compression code yet. This patch is based on 1.99 in cvs.

I've got thoughts on wildcard caches so I'll look at that next. Then sessions.

Has anyone tried making the cache table in the database a memory type table to see what effect that has on performance?

Arkimedes’s picture

StatusFileSize
new13.65 KB

Here's the patch to settings.php v 1.28 for the cleaner configuration.

Arkimedes’s picture

StatusFileSize
new1008 bytes

Hmmm, attached the wrong file. Try this one.

Arkimedes’s picture

StatusFileSize
new17.76 KB

Ok, I worked on the wildcard flushes with memcached. I don't like the idea of having to run
multiple memcached servers each with their own purpose. Everything should fit into one
cache server or several load balanced servers. It's easier to maintain and configure one memcached
server than it is several each with different purposes.

I also found a side-effect in the database caching for wildcard flushes. The wildcard query places
% wildcard characters on both sides of the key. This means if the key is found as a substring of
another key, this query will prematurely flush it as well. That results in a cache miss the next
time it's needed, which is only minor, but still a side-effect that should be avoided.

The wildcard flush basically removes a set of entries from the cache in one call. This flush occurs
at a specific time, which means any matching values cached prior to that time cause a cache miss
on their next query. That will result in rebuilding those items and caching them with newer values.

This command
grep -rn cache_clear * | grep -i TRUE
should result in a list of all locations (with line numbers) that currently use wildcard caching. From
what I could tell in cvs, that's only 'menu:' and 'filter:' settings.

I decided to depricate wildcard flushes as such. Instead I set a variable with the time()
when either menu or filter items are rebuilt. When calling cache_get() I then pass
this variable as a new optional second parameter. If the time the item was cached is lower than the
passed in value, the cache_get() will result in a cache miss. So the only changes to
implement this for other wildcard flushes is to store a variable with the last updated time and pass it
into the cache_get() function. The details are all handled in the code. The default
value is 0 for the second parameter, and as long as time() never returns negative
values, the cached items in memcached will always cause a cache hit until memcached expires them.
The database method also checks this new parameter and I got rid of the wildcard db query to
get rid of its side-effect.

Attached is a combined patch of all the files changed. It's made against cvs HEAD as of a little
eariler today. It replaces the previous diffs, including the settings changes which are now included.

I still need to work on sessions code.

brashquido’s picture

I've applied the patch against the Drupal CVS and have uncommented the memcached settings in the settings.php file. Dumb question, but how do you tell if it is working? I'm working totally from within a Win32 environment, and this is what I have done;

1) Download and install pecl4win memcached.dll, loaded into php.ini and restarted IIS.

2) Installed memcached server for Win32 and have verified the server is running and the ports match up between the memcached server, client and Drupal.

3) Viewing phpinfo() I can see that memcache is loaded, but the "Active persistent connections" counter doesn't get off 0 when browsing the site, and the memcached server memory usage is not moving. I have set 'persist' => true in settings.php.

Should I have to enable Drupal cache for this to work (I've tried it on and off)? Any other ideas for quickly checking that this patch is actually making Drupal use the memcached server?

brashquido’s picture

Cancel that. I failed to change $cache_mode from CACHE_METHOD_DATABASE to CACHE_METHOD_MEMCACHED in settings.php. Now when I view phpinfo() I can see Active persistent connections for the memcached client has incremented.

brashquido’s picture

Could someone explain in layman terms the difference between the CACHE_METHOD_NONE, CACHE_METHOD_DATABASE and CACHE_METHOD_MEMCACHED settings? Also what is the best sort of environment to test in, and should having the Drupal cache turned on make a difference?

This could be an issue with the Memcache server for Win32 as it is still in very early stages of development, but when I ran a simulated load of 100 concurrent users on my server first with the CACHE_METHOD_DATABASE set and then CACHE_METHOD_MEMCACHED I had a drop in successful HTTP requests by about 75%. That is with Memcached enabled there was a 75% reduction in requests for the same time period. I also noticed there was a noticable gain (around 10%) when I enabled the caching system in Drupal on top of that with all three settings above. The test script was pretty simple and jest comprised of 4 GET requests that make up the front page of a default Drupal install, so it might not be the best setup for testing Memcached. I'm still to examine the data tomorrow so I can see where the bottle neck is, but just thought I'd ask as at facce value my results seem to be the complete opposite of what I was expecting (which makes me think I've stuffed something).

Arkimedes’s picture

I'm using 2 memcached servers running on loopback. I've added the -vv option to
enable verbose logs, so I just check the logs for the various commands. You'll see what is
being requested and what it does in response. Check if the windows port has a similar option.

CACHE_METHOD_NONE disables all of the caching in the cache_* methods. It's useful for
testing the worst case scenario and setting up a base line, as all cahce_gets result in a
cache miss.

CACHE_METHOD_DATABASE is the original implementation for Drupal. It's there as default
for people not using memcachedd.

CACHE_METHOD_MEMCACHED is what I added to support the memcached servers. The goal
is to reduce the overall database usage by caching values in memory (via memcached).

Things to try include turning off the persistent connections, maybe that has some effect on the
windows port. When it is off, every page request will create, use, and destroy a connection to
the memcached server. You'll see it in the verbose logs. If you happen to have a unix machine
available on your network, you could try running the memcached server there and setting the correct
IP and port in your configuration. Also create a page from the sample code here
http://us2.php.net/manual/en/function.memcache-getextendedstats.php
and check the output to see if connection and requests are getting through to the server.
Lastly, make sure the configured IP and port point to the correct values for the win32 port
of memcached. Maybe the defaults are different from the standard 127.0.0.1:11211.

I'll try to run some benchmarks here and see what I can come up with.

Arkimedes’s picture

There's a typo in my combined patch. It's in the bootstrap.inc file in the cache_get function.

In two places I assign a value to $cache->header but both should actually read $cache->headers
to match the output of the database query in the database method.

Arkimedes’s picture

I have some questions about sessions as used by Drupal.

First, how important is it for the session data to be persistent? The current code reads the
data from a database. chx's sandbox version for memcached puts everything in memory
using memcached. If the memcached server gets flushed or restarted, then all the data
in it is gone, which means any stored session data is lost.

Will losing the session data cause users to be logged out and thus require them to login
again and create a new session?

Next, in sess_read() the $user global is assigned a value based on the database query, but
the function only returns the session value. It seems like the setting of the $user global
is a side effect of the function, is it intentional? Is this where the $user data first gets populated
or is it only loaded to access the session value?

I don't want to implement memcached support for sessions without knowing what is expected.

dries’s picture

We need to load the $user object early on in the bootstrap process, so we're doing that in sess_read(). Once we loaded the user object, we can determine whether we're dealing with an anonymous user, what permissions he has, etc. Session handling and user handling go somewhat hand in hand.

dries’s picture

Code looks good. The one bit I don't like is the variable_set('menu-cache-time', time()); stuff.

Also, why do we need a CACHE_METHOD_NONE? Is that a developer feature?

chx’s picture

IMNSHO we should move cache_* into another .inc file, and make the filename a variable so that you can override it from settings.php. This will make SQL, memcached, Berkeley DB, file caching all possible. Not necessarily all core.

Arkimedes’s picture

A few more observations on adding support for memcached. The main performance boost seems to
come from the fact that we reduce database calls by pulling data from a memory cache instead of
the database. As long as we go through the cache_* functions, any site set to CACHE_METHOD_DATABASE
will hit the database for the cached value. A cache miss usually means another database call to get
the data we are looking for. With memcached a cache miss means we contact the memcached server
followed by the original database call. A cache hit means we avoid that original database call.

There are plenty of items we could cache in memory by migrating to cache_* functions and using
memcached, but that may mean extra database calls for sites using CACHE_METHOD_DATABASE.
It doesn't seem fair to cause the extra load on the database cache just so memcached users can reduce
the overall database usage. Should we look at using memcached without going through the cache_*
functions, or is there another way to get good support for memcached without causing more load
with the database cache?

The current patch saves a few database calls when using memcached. The biggest savings is from
the using the page cache feature for anonymous users, but I think there is more we can do using
memcached.

I'm using this command before and after a benchmark run to get the number of db queries made.

sh -c "mysql -e status;" | grep Threads

If you know how many page requests were made, you can figure out how many database calls
there are per page request, on average.

I think most benchmark programs are going to request anonymous pages, here's a sample
command line to use with CURL. It should use the user credentials for the user whose cookie
is passed in. Just update the random characters with the value of the correct cookie. Check
your browser on how to view cookies.

curl http://127.0.0.1/site/ -o /dev/null -s --cookie PHPSESSID=6krpfo8rlt8oiu2122951nil02

With this you can monitor database usage along with the output of your benchmark runs.

chx’s picture

I vote on moving this to contrib. I have the patch http://drupal.org/node/67675 to make it possible.

dries’s picture

Arkimedes: could you check chx's patch? Because there are potentially a dozen of caching mechanisms (file caching, memcached, etc) it seems like a really good idea to make a good, generic API that allows everyone to plug in its favorite caching mechanism.

I'd like to see chx's patch go in first, and then we can revisit this one? Does that make sense to you?

moshe weitzman’s picture

Status: Needs review » Active

the cache.inc patch referenced above has been committed. i suggest starting a new issue for a memcache backend that uses this pluggable interface.

moshe weitzman’s picture

Status: Active » Closed (fixed)

memcached should be in contrib