We are running a Drupal 4.6.3 site on two (!) modern dual-Xeon front end servers (Apache 2.x and 2 GB RAM) and one modern dual-Xeon database server (MySQL 4.x and 4 GB of RAM).

Running one Drupal community on two front end servers is probably quite unusual. We are using a hardware load balancer that sends every second user to one server and every second to the other. Both servers have access to the same Drupal- and user files trough a shared disk array that both servers have mounted. This solution works flawlessly and makes Drupal truly scalable. At least as far as the front-end is concerned. So far so good.

The database server with its dual-Xeon and 4 GB of RAM should be able to handle anything we throw at it, but this is not the case. When we reach about 50 authenticated users or 25 page impressions per minute, the server locks up with slow queries that takes hundreds of seconds to complete. This despite the processing power and the fact that the server is able to keep the entire database in RAM. We are expecting more from this hardware.

The community has something like 40.000 nodes and 60.000 comments.

The problem is SQL queries that scans trough the node_access table (we are using og.module witch has access control) and the comments table without using an index. With tens of thousands or rows in these tables, scanning trough them frequently overloads even this hardware. With only a few users, these queries takes a few seconds but as we hit some 50 authenticated users, these queries starts piling up and are getting cued and the queries starts taking over 100 seconds each.

Access control is resource intensive for obvious reasons and this seems to be a common problem with large communities even on powerful hardware using taxonomy access, node privacy byrole or organic groups:

http://lists.drupal.org/archives/drupal-devel/2005-02/msg00378.html

Moshe Weitzman helped us tweak the node access query in Drupal 4.5.x as discussed in the February posting (link above) but now we are facing these slow queries again with Drupal 4.6.x.

And it is not only access control, queries that scan trough the comments table with tens of thousands of lines also puts a serious load on the database server. Databases is not my specialty, but I was told that the part of the problem is that the queries are scanning trough these large tables without making use of an index and making use of an index is of paramount importance in MySQL.

I would like to hear general comments on using Drupal with large databases and how to keep queries quick, as well as comments more specifically about access control queries. I would also like to know more about using (and not using) indexes in queries.

I am well aware of the benefits of caching and I would like to get into things like page element caching, caching to the file system and so on, but here I have tried to focus specifically on queries and their impact on large Drupal sites.

Comments

andre75’s picture

Very interesting. Sorry I cant offer a solution but I am very interested in the responses you are going to get.
What happens to your sessions table with those many users? Isn't this table growing like crazy?

Andre

-------------------------------------------------
http://www.opentravelinfo.com
http://www.aguntherphotography.com

dumell’s picture

No, the session table is not giving us any trouble, it only has about 500 rows right now. We don't actually have that many users - 50 simultanious users is not that much. The fact that we are using two front end servers is not causing any trouble either.

moshe weitzman’s picture

hi dumell. it would be helpful if you posted the slow queries here. that would give developers something to chew on.

if you are not using private groups, you could click the button to 'disable acess control'. og still functions perfectly in this state, and there is no performance penalty for using a node access module.

if you are using private groups, then you really should share the queries here, or on the devel mail list so that we can find a better solution.

the 4.7 version of og will not write a record in the node_access table for every subscription (i.e. the og_uid realm is going away) so that alone will greatly trim down the size of this table and speed up the queries. the HEAD version of og is almost in this state now. i expect to finish that off in the next week.

dumell’s picture

I'll get back to the slow queries, I don't have a copy of any right now.

Unfortunately we do need private groups and I understand that this makes everything a lot more complexed.

In 4.5 we eventually managed to speed up Drupal significantly by tweaking the query, can something like that be done in 4.6 as well? It will probably take quite a while before we get around to upgrading to 4.7.

I thought that came to mind was this: most requests are for new nodes. In our case, I think 95% or more of all requests are for content made during the last month and content has been produced quite steadily for about 12 months. In other words, about 95% of the queries are aimed at less than 10% of the content.

As a general idea, might it make sense to create something like a node table and a node_archived table where a cron job would move content from the node table to the node_archived table after a period specified by the admin and direct node queries to the node table first, and only if not found, to the node_archived table. There would be a performance hit whenever a node is not found during the first search, but potentially quite a performance increase for most queries if the node table is significantly smaller and most queries find their node in this table. Or would this all be lost in the added complexity.

I know that many of you have put a lot of time into developing Drupal and thought about these things a lot more than me so it would be interesting to hear some sober thoughts about these performance issues.

dumell’s picture

For a few examples where we told mysql to "explain" just before the database hangs:

http://web.dumell.net/temp/slow_explanes.txt

And here are two examples of queries that are slowing us down to a standstill:

# Query_time: 11 Lock_time: 0 Rows_sent: 10 Rows_examined: 106993
SELECT DISTINCT(c.nid), c.* FROM comments c INNER JOIN node_access na ON na.nid = c.nid WHERE (na.grant_view = 1 AND CONCAT(na.realm, na.gid) IN ('all0','og_uid536','og_group0','og_group21312')) AND c.status = 0 ORDER BY c.timestamp DESC LIMIT 0, 10;

# Query_time: 11 Lock_time: 0 Rows_sent: 5 Rows_examined: 55837
SELECT DISTINCT(n.nid), n.title, u.uid, u.name FROM node_counter s INNER JOIN node_access na ON na.nid = n.nid INNER JOIN node n ON s.nid = n.nid INNER JOIN users u ON n.uid = u.uid WHERE (na.grant_view = 1 AND CONCAT(na.realm, na.gid) IN ('all0','og_uid536','og_group0','og_group21312')) AND s.daycount <> '0' AND n.status = 1 ORDER BY s.daycount DESC LIMIT 0, 5;

And here are a few examples where we told mysql to "explain" just before the database hangs totally:

peterx’s picture

Run the following SQL in phpmyadmin.
alter table node_access add index grant_view_index (grant_view)

Your SQL "where (na.grant_view = 1" will speed up if there is an index on column grant_view. Create the index using phpmyadmin and measure the result. There may be other indexes needed.

realm and gid are part of the node_access primary index but are used independently in concat(na.realm, na.gid). The best approach would be to build a field, realmgid, that contains realm and gid already concatenated, index the new field, and replace the concat with realm_gid. If you cannot do that then replace the following SQL.
CONCAT(na.realm, na.gid) IN ('all0','og_uid536','og_group0','og_group21312')

Use something like the following SQL.
((na.gid = 0 and na.realm = 'all')
or (na.gid = 536 and na.realm = 'og_uid')
or (na.gid = 0 and na.realm = 'og_group')
or (na.gid = 21312 and na.realm = 'og_group'))

You may have to add separate indexes for gid and realm.

How many rows are in the node_access table?

http://petermoulding.com/web_architect

dumell’s picture

Thank you for your reply. We will try it out and see where it takes us. Right now we have 31935 rows in the node_access table.

moshe weitzman’s picture

these seems like sounds suggestions. i am working with dumell on this problem. if these suggestions work, i'll cook up a patch for drupal core.

peterx’s picture

Replace the following SQL at about line 1898 in node.module. A similar change is needed at line 1942. The full detail is at http://drupal.org/node/36429.

$sql = 'SELECT COUNT(*) FROM {node_access} WHERE (nid = 0 OR nid = %d) AND CONCAT(realm, gid) IN (';
    $grants = array();
    foreach (node_access_grants($op, $uid) as $realm => $gids) {
      foreach ($gids as $gid) {
        $grants[] = "'". $realm . $gid ."'";
      }
    }
    $sql .= implode(',', $grants) .') AND grant_'. $op .' = 1';

Use the following code.

$grants = array();
foreach(node_access_grants($op, $uid) as $realm => $gids)
    {
    foreach ($gids as $gid)
        {
        $grants[] = '(gid = ' . $gid . " and realm = '" . $realm . "')";
        }
    }
$sql = 'SELECT COUNT(*) FROM {node_access} WHERE (nid = 0 OR nid = %d) AND (' . implode(' or ', $grants) . ') AND grant_'. $op .' = 1';

http://petermoulding.com/web_architect

dumell’s picture

I made the changes to node.module (although the row numbers where slightly off, we are using v 1.485.2.13 2005/08/11).

No error messages and Drupal was noticeably more responsive, the difference was immediately evident.

We did a explain in MySQL and now the queries are making use of the index. This is an important improvement.

Our weekly usage peak will be in about 3 hours and then we will know more. I'll let you know how it works out.

Here is a copy of the explain:

http://web.dumell.net/temp/quick_explanes.txt

dumell’s picture

A user can create a group and post in it, but when this person opens the group page again, it will say "No posts in this group."

This the same message the user gets if the user tries to open a group in witch he or she is not a member. Obviously, we are not getting the correct node access information.

As user 1 I can see the postings in all groups.

With the old node.module, we did not have this problem.

The Drupal log also shows I am getting these errors when I try to open a group:

Column: 'gid' in where clause is ambiguous query: SELECT DISTINCT(n.nid), n.title, n.uid, u.name, n.created, l.comment_count, l.last_comment_timestamp FROM node n INNER JOIN node_access na ON na.nid = n.nid INNER JOIN node_access og_na ON n.nid = og_na.nid INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid WHERE (na.grant_view = 1 AND ((gid = 0 and realm = 'all') or (gid = 83 and realm = 'og_uid') or (gid = 0 and realm = 'og_group') or (gid = 3962 and realm = 'og_group') or (gid = 24705 and realm = 'og_group'))) AND og_na.gid = 3962 AND og_na.realm = 'og_group' AND n.status=1 AND n.type IN ('story') ORDER BY n.sticky DESC, l.last_comment_timestamp DESC LIMIT 0, 20 in /drupal/includes/database.mysql.inc on line 66.

And the other error message I keep getting:

Column: 'gid' in where clause is ambiguous query: SELECT COUNT(*) FROM node n INNER JOIN node_access na ON na.nid = n.nid INNER JOIN node_access og_na ON n.nid = og_na.nid INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid WHERE (na.grant_view = 1 AND ((gid = 0 and realm = 'all') or (gid = 83 and realm = 'og_uid') or (gid = 0 and realm = 'og_group') or (gid = 3962 and realm = 'og_group') or (gid = 24705 and realm = 'og_group'))) AND og_na.gid = 3962 AND og_na.realm = 'og_group' AND n.status=1 AND n.type IN ('story') in /drupal/includes/database.mysql.inc on line 66.

peterx’s picture

In the first query:
AND ((gid = 0 and realm = 'all') or (gid = 83 and realm = 'og_uid')))
should be:
AND ((na.gid = 0 and na.realm = 'all') or (na.gid = 83 and na.realm = 'og_uid')))

The node_access table is used twice, once as na and once as og_na, which means the fields are defined twice and have to have the table prefix.

The second query requires a similar change. I cannot find "SELECT COUNT(*) FROM {node}" in my test Drupal. Which module contains the query?

petermoulding.com/web_architect

dumell’s picture

So what do I do about that first query? I'm sorry, it's not obvious to me :)

The second query is on line 1421 in og.module (version 1.53.2.30 2005/07/16).

peterx’s picture

My suggestions for changes are based on what I interpret from the error messages. Some SQL is assembled from SQL built by several pieces of code. That means I may not find the SQL by searching for the SQL in the source.

In the case of node.module, at least one piece of SQL is partially generated through a function named _node_access_where_sql() and the function lets you provide a prefix for field names. My first suggestion, in http://drupal.org/node/36429, for changing the SQL in _node_access_where_sql() excluded the field name prefix. I rewrote the suggestion in http://drupal.org/node/36429.

When you access a field in SQL you have to prefix a field name with the table name if the field is in more than one table. You can also include a table twice by joining the table twice. In the SQL INNER JOIN node_access na ON na.nid = n.nid INNER JOIN node_access og_na ON n.nid = og_na.nid the table node_access is included twice. On the first inclusion node_access is renamed na and on the second inclusion node_access is renamed og_na. That means field gid and realm are in the SQL result twice. You then have to refer to na.gid or og_na.gid and na.realm or og_na.realm.

The code that generates the SQL has to include the na.. Function _node_access_where_sql() supplies a default na in variable $node_access_alias. To use $node_access_alias, I had to replace:
$grants[] = "(.gid = " . $gid . " and realm = '" . $realm ."')";
with:

$grants[] = "(" . $node_access_alias . ".gid = " . $gid
	. " and " . $node_access_alias . ".realm = '" . $realm ."')";

You probably need something similar in og.module.

I downloaded og.module,v 1.53.2.30 2005/07/16 and looked at line 1421. At line 1414 in function og_block_new() I found the following SQL.

$sql = "SELECT COUNT(*) FROM {node} n INNER JOIN {og} og ON n.nid = og.nid
 WHERE og.directory=1 AND n.type = 'og' AND n.status = 1";

At line 1418 in the same function I found the following SQL.

$sql = "SELECT n.nid, n.title FROM {node} n INNER JOIN {og} og ON n.nid = og.nid
 WHERE n.status = 1 AND n.type = 'og' AND og.directory=1 ORDER BY nid DESC";

The SQL does not match the error message. Is your og.module modified? Was your og.module modified before you started this post? What was in your module at the start? What is in your module now?

petermoulding.com/web_architect

dumell’s picture

We used to have a modified og.module (back in the Drupal 4.5 days) but since 4.6 we have tried to avoid all kinds of modifications to the modules. I am quite sure we have not modified the og.module but I'll check it (I don't have access to the server right now).

The only unusual thing that comes to mind is that we are using the Drupal multi-site support.

I'll think this trough and I'll get back to it tomorrow.

It was very encouraging to see the queries making use of indexes and the database did not lock up this time. On the other hand, we only got about 1 page impression per second so no massive load - probably because some of the pages where inaccessible due to the problem with node access.

But if we can get working queries that are this efficient, then we will probably have solved the problem.

dumell’s picture

I changed the node.module acording to the rewrite in http://drupal.org/node/36429 and now everything seems to work. I've only tested it for 30 seconds, but it feels fast, I can see nodes that I should be able to see, and I don't see nodes that I am not supposed to see. And there are no error messages in the log. It looks like the rewrite fixed both the errors that appeared in our log after the first change to node.module.

This is looking so good that I will leave this module in use and let's see what the end users have to say. Hopefully they will not notice any other change than a generally faster community.

On Monday afternoon (GMT+2) we will see how it works under stress. I'll let you know how it turns out.

A big thank you to everyone and especially to peterx.

dumell’s picture

We have now been using the modified node.module for four days without problems. Today we peaked at about one request per second and the site still felt fast and responsive. With the standard 4.6.3 node.module our MySQL server would not have been able to handle these numbers.

Right now have 33.000 rows in the "node_access" table, 21.000 in "node", 57.000 in "comments" and 4.000 in "og".

Unfortunately, we have lost customers because of our technical problems, so we did not get as many users as previously on Mondays, but even with this number, the benefits of the modifications to the node.module are obvious.

No one has complained about any problems with access rights or anything else during these four days. As far as I can tell, the modifications work and they greatly improve Drupal's performance when using a per node access control mechanism.

dries’s picture

Dumell, are you using the exact patch from http://drupal.org/node/36429? Could you update that issue? Thanks.

dumell’s picture

No, we are not using the patch, I did the changes before the patch was available. But I'll switch to the patch and check it out as soon as possible and make a comment in /36429.

There might, however, be a problem (with the pre-patch changes at least) that I noticed today: the site where we are using the updated node.module no longer indexes nodes for the search.module. The problem might be unrelated, but I'll try changing back to the normal 4.6.3 node.module for a while and see if that helps in order to see if it is related to our changes in node.module. I'll get back to this.

handelaar’s picture

The comments table doesn't have a useful index on it. I got a *serious* boost by adding an index on "nid, cid".

Welcome to the club - not many of us have this particular problem which is probably why it never got addressed in the releases. I have about a quarter million comments to deal with. The display of pages *with* comments was being bottlenecked by the table scan. The new index made that go away.

I had an additional speed boost by not having Drupal load the entire url_alias table into memory too. See http://drupal.org/node/22035 for details. (It's now fixed in cvs.)

peterx’s picture

The following changes could be useful for large sites depending on their traffic balance. The effect might be greater with compound indexes similar to your nid,cid.

function comment_block

'SELECT c.nid, c.* FROM {comments} c WHERE c.status = 0 ORDER BY c.timestamp DESC'
If this SQL is used for every page on your site then try an index on column comment.status.

functions comment_edit, comment_reply, and comment_preview

The SQL in these functions could benefit from an index on uid because a few users editing the comments table can slow down the whole site.

function comment_post

The duplicate check in this function could be slow because the first search is for the unindexed pid. Either index pid or change WHERE pid = %d AND nid = %d to WHERE nid = %d AND pid = %d (and change the matching parameters).

Reverse the tests on the following SQL to use the index on nid before the scan for LIKE. Some versions of MySQL can change the search sequence for you but you cannot be sure if it is using the fastest access all the time unless you regularly run MySQL reports.
WHERE thread LIKE '%s.%%' AND nid = %d", $parent->thread, $edit['nid']
WHERE nid = %d and thread LIKE '%s.%%'", $edit['nid'], $parent->thread

petermoulding.com/web_architect

dries’s picture

Peter, any chance you can turn that into a patch? That would be much appreciated.

bslade’s picture

The site I was supporting (running Drupal 4.5x) was having all sorts of cache collision errors and performance issues. Some of the errors were annoying, eg. we kept getting cache collisions eventhough Drupal caching was disabled. Also, because throttling kept kicking in, there were random "hangs" (slow responses) as perceived by the users.

That site eventually went to a file based cache (see dfcg) to get dramatic performance improvements with very few errors.

There are downsides to this sort of cache. You can't put user specific content on the front page or on the nodes linked to the front page, but you get way faster more consistent performance using a lot less server horsepower.

PublicMailbox@benslade.com
"It's the mark of an educated mind to be moved by statistics"
Oscar Wilde

dumell’s picture

We have another community than the one I have mentioned here so far, with more traffic, but it demands much less of the servers since is has less content, doesn't make us node access control and most users are anonymous users and are therefore served from the Drupal built in cache.

This site typically generates about 5000 page impressions per hour, peaking at over 100 requests per minute, and probably something like 90% of the pages impressions goes to anonymous visitors, in other words: caching works well.

It would be nice with a cache that works like the current built in Drupal cache, but by storing the nodes to files instead of in the database and where the cached pages would be served only to anonymous users. It would probably not increase performance that much, but it would slightly decrease the load on the database server and most importantly, it would keep on serving anonymous users even if the database server goes down.

I know this has been done in previous Drupal versions and it did not make it into the core. Still, the idea of a file based cache sound very interesting to me and I will certainly give this solution a try. Does anyone here know why the old implementation (in 4.4 I think it was) didn't become popular and why it was abandoned? Are there some obvious drawbacks compared to caching the nodes in a database?

How does the dfcg solution mentioned above work in regards to anonymous/authenticated users? Does it serve cached pages only to anonymous users, or does it serve cached pages to all users? A quick look at the documentation tells me it uses rewrite to direct all users to the caced files. And judging by the documentation, I guess it is not multi-site (as in 4.6) compatible?

bslade’s picture

The dfcg (Drupal File Cache Generator) serves cached pages to all users, for the front page and the nodes that are linked to from the front page.

So logged in users, see the same front page as anonymous users. You can see what it looks like at www.afterdowningstreet.org. Note if you add an "&" after the URL it bypasses the file caching.

Ben in DC

zaczek’s picture

One way is to hack the common.inc functions that serve cache from the cache table. I introduced a function that saves a serialized result of the cache query as a file. Then, instead of querying the database each time to find the cache, I first check if a file with the result (not older than 5 minutes) exists on disk. If it does, it is read, unserialized and served as if it was an SQL query result.

The name of the file is an MD5 hash of the SQL query string - so no problems finding the right file.

In order to cache logged in sessions, I have hacked the function that writes the cache URI, adding the current userid at the end.

I have seen a tremendous improvement (several seconds on a page refresh at peak times) with this hack.