I'm wondering about the practical outer limits of what drupal can handle. I have a client who currently has a site with about 40k regular users, and wants to blog-enable them. I'm reasonably confident that I can handle the user detail sync stuff, and have a few ideas about mysql changes that might be needed, but I'm wondering what I've not thought about.

Is anyone out there running a site of that sort of scale with drupal? If so, got any words of wisdom to share?

Comments

al’s picture

Provided the database can handle it (which PostgreSQL/MySQL easily can), having an awful lot of users in itself shouldn't cause any problems. The user table is indexed, so logins should scale nicely.

The real problem is the number of nodes. People have encountered problems with having huge numbers of nodes for the book module (it builds a tree for navigation, which isn't currently very efficient) but this is being looked at.

Should be fine, but let us know, right? Only local images are allowed.

--
Al Maw
almaw.com

Malach’s picture

Assuming the client goes ahead with it (we're still talking about things), I'll be sure to let y'all know how it goes Only local images are allowed.

bertboerland’s picture

i was wondering, if someone will be able to make a module to get the smsc info or mail from sms to drupal. it seems like this is -at least in europe where both blogging and using sms is rather popular- like a way to generate a lot of kickback fees for the mobile teenagers.

in dutch, i blogged this idea some time ago, see willy dobbe

--
groets

bertb

--
groets
bert boerland

slack’s picture

I don't know whether you've already worked this one out yourself, but most mobile phone carriers provide an SMS-to-Email gateway, which you can use to make moblogging work on Drupal websites. i.e. Use the normal email-to-drupal module from the Downloads section on this website to handle the email once it comes in from the gateway.

ricard@puntbarra.com’s picture

Not everyone can make it work...

moshe weitzman’s picture

see scott's paper.

since that paper, we have marco and jeremy have added a filecache such that drupal can serve up files without any database calls at all during extreme distress. with this enhancement, drupal scalability exceeds other limiting factors like bandwidth, i/o, etc.

Malach’s picture

That, right there, is some damned worthwhile content.

marco’s picture

My last bottleneck was caused by Apache+PHP, which takes a lot of memory per process. I solved using Squid as an accelerator which serves static content only (not drupal pages, not even cached pages). I also use lingerd with Apache which further decreases the number of concurrent Apache processes.
And of course a php code cache like php-accelerator.

--
Marco

shane’s picture

Just some general notes I'd make, in regards to dealing with large scale sites. I'm guessing that if you've been admining a site with so many users already, you probably know a lot of this ... but, just in case ...

I highly recommend using Server Load Balancing systems in front of your web servers, and in front of your DB servers.

By implementing an SLB (with a failover partner) in front of the web servers, you can scale the number of web servers you need to handle the load. This means you can use a bunch of lower end platforms that might otherwise be useless. Just define that they get used a lot less than your bigger hardware platforms. Keeping the DocumentRoot directories in sync can be done in a number of ways, either NFS mounted volumes with a single master, or more preferably, a "gold master" that RSyncs the content to each of the individual web servers.

You can then handle maintenance tasks much more easily. Take a dead, dying, or otherwise box out of the SLB mix, fix, upgrade, maintain - or whatever. Then when done - put it back in.

On the DB side, it gets a bit more complicated. You have to deal with the actual DB content, locking, etc. One way to do it, is simply have one big beefy DB. Then a second which is only used in a standby emergency mode. If the main DB dies, the SLB redirects queries to the standby. Keeping the DB tables in sync is done using a Network Attached Storage (NAS - I.E. an NFS mounted volume, Network Appliance Filer, or any other lower end tyep NAS unit), or more appropriately (obviously if money is less of an issue) a Storage Area Network. In this case, there's only one DB store.

The other method is to simply backup the main DB database, and move it over and restore it to the Standby DB - doing this on a regular schedule. Not perfect, because you'll have a window of lost data in the event of a failover.

I know that MySQL is beginning to implement some features that may allow multiple DB front-end transaction systems querying and modifying a single DB back-end store (I think through the InnoDB type tables). This may be used such that you can have multipled DB front-ends, reading and writing to the same DB table space behind them, on a NAS or SAN type storage. I don't have direct experience with this though.

One benefit to the SLB front-end process is you can quickly and easily scale your capacity. If you find that 1 web server isn't enough, stick in another. You need more? Add a 3rd or 4th. You can use lower end hardware to add capacity.

In both of these cases since session persistence is important, you'll want to insure your SLB units are setup to maintain connections from the client, to the same server throughout the clients session. Most all hardware based SLB units (F5 [my favorite - but very expensive], Cisco, Nortel/Alteon, Array Networks, etc...) easily do this through a configuration parameter. I'm assuming that software based SLB systems available for Linux and other OSs can do this also.

If a server dies in the middle of a session with a client, then the client would have to login again, and reestablish their connections. They may lose an amount of work/time/data, but it only effects a handful of users connected to that specific user when the server died. Other users are not effected.

Sorry if this is useless info you already know - but it's an excellent mechanism to handle volume, capacity, and fault tolerance in a site.

moshe weitzman’s picture

excellent overview, shane.

one note for drupal readers - drupal overrides the standard php session handlers and istead stores its session information in the database. Thus it is no problem if users bounce between web servers during a session.

shane’s picture

Weitzman - I'm curious then ... I occasionally see page loads under Drupal that have "phpsesson=NNNNN ..." or similar (can't recall exact details right now) URLs getting loaded. If all session data is stored in the DB - is this URL passing of that session info? Normally isn't this supposed to occur via Cookies (assuming I've got them enabled - which I do). TIA.

-Shane

marco’s picture

this is php that test whether your cookies are enabled or not. it rewrites every internal url adding "phpsession=xxx"; if your browser sends a phpsession cookie (with the same value) it means you have cookies.

--
Marco

bertboerland’s picture

that you could have a farm with webservers running without a hardware loadbalancer in front of it but just by for example miultiple a records in the dns?

it seems that if *all* session info is installed in the database, the webserver doesnt need "staefull" session information anymore.

am i correct on this one?

--
groets

bertb

--
groets
bert boerland

Malach’s picture

Most of that, I already know - (or, more accurately, I have resources which know - one of my business partners has done a lot of work on high traffic/high availability sites), but that's a nice summary of things, and still very useful.

Thanks Only local images are allowed.

Anonymous’s picture

I don't know if this helps, but I know that Drupal can work on both PostgreSQL and MySQL based systems, and that the current version of PostgreSQL has a clustering project associated with it to create easy DB clusters. So if you're willing to convert your databases to PostgreSQL, then that may help a bit.

shane’s picture

That's very interesting - I'm glad to see PostgreSQL going a favorable route. Anyone know of any similar Clustering projects for MySQL - that'd be the holy grail (in my MySQL centric-view-of-the-world-opinion!). If either PostgreSQL or MySQL could correctly tackle the clustering issue - they'd pretty much sew up the low end market 100% and seriously start eroding into the middle-end market space.

-Shane

szczym’s picture

wikipedia folks have similar setup
http://meta.wikipedia.org/wiki/Server

phpdude’s picture

good post