Dear friends,

I am really new to Drupal, but I think I discovered a critical issue for Drupal. In the past I developed pgAdmin 1, PostgreSQL GUI. My website has 600.000 visitors every month. The database server is PostgreSQL 8.3 with full logging turned on to monitor query times. We are using Drupal 6.2 and PHP5.

Today we were flooded very easily. My server was so slow that I had to reboot it remotely on a rescue system and analyse logs. Which happens very-very rarely !

I looked at the log and noticed the anormal duration of some SQL queries. Also, I monitored Apache server-status and logs and found access to the same Drupal pages.

My Drupal configuration is configured without cache, because PostgreSQL is usually faster than any existing cache * ever. My log usually shows execution times below 1 millisecond. For example, my PHPBB board never reaches times over 5 millisecond. Even for 500.000 messages and thousands of connected people. I manage to set a relatively high shared memory.

Now ... here is how the attackers flooded Drupal :
* They connected to the same drupal pages,
* PostgreSQL served the pages in 10 milliseconds,
* PostgreSQL and Apache began a fight to access shared memory,
* With Apache forking, PostgreSQL could not keep Drupal tables in shared memory.
* PostgreSQL became slower and served pages in 100 milliseconds because there was no indexes on tables.
* SWAP was filled up and I could not connect to sshd.

Okay, after understanding the scenario I enabled cache data in Drupal ...
but this is not a correct solution to serve very high load servers.

The best, the fastest solution is to use PostgreSQL configured with proper indexes, especialy in JOINS.

Possible solution to boost Drupal :
* Make a test installation with thousands of pages and users (possibly this site) under PostgreSQL.
* Configure PostgreSQL with full logging ON, including query execution times.
* Surf on the site and copy-paste the queries in pgAdmin graphical client.
* Run explain analyse to understand query plans.
* Add indexes where it is needed.
* Do not allow queries running in more than 5 milliseconds. Most queries should run below 1 millisecond.

That's it. Quite a large work, but in the end Drupal could be 10 times faster.

Do not hesitate to answer this email.
I can help.

Kind regards,
Jean-Michel Pouré

Comments

grub3’s picture

You could also limit the number of Apache server or buy more RAM. But this is not a scenario for high load servers, with thousands of online users. We should manage to run every Druap query below 1 milisecond, with long queries at 5 milliseconds. And this is enough to work well without cache. PHP cache eats a lot of memory and you are never sure that pages are real.

grub3’s picture

By the way, the attackers only needed to call different pages to flood Drupal. Do built-in Drupal protection against flood was not working. Most of the time, 500 people+ are connected to my side. So this kind of issue happens on high-load servers.

Now I am going to give you explainations in SQL :

damien tournoud’s picture

Priority: Critical » Normal
Status: Active » Postponed (maintainer needs more info)

How could that be a critical *Drupal* bug? Looks more like a failure of the administrator of the server, and there are so much information missing.

Where to start: Which queries are slow? Which indexes are missing?

What about limiting the number of apache requests (that's really 101 of server administration because your server resources, even quite high, are necessarily limited!)?

grub3’s picture

Priority: Normal » Critical
Status: Postponed (maintainer needs more info) » Active

Example :

2008-06-30 13:15:36 CEST LOG: instruction : SELECT s.lid, t.translation, s.version FROM locales_source s LEFT JOIN locales_target t ON s.lid = t.lid AND t.language = 'fr' WHERE s.source = 'A blog entry is a single post to an online journal, or blog.' AND s.textgroup = 'default'
2008-06-30 13:15:36 CEST LOG: durée : 2.213 ms

This is a very long duration for such a query.

Let's analyse the query:

EXPLAIN ANALYSE
SELECT s.lid, t.translation, s.version
FROM locales_source s
LEFT JOIN locales_target t ON s.lid = t.lid AND t.language = 'fr'
WHERE s.source = 'A blog entry is a single post to an online journal, or blog.' AND s.textgroup = 'default'

Nested Loop Left Join (cost=0.00..158.92 rows=1 width=68) (actual time=0.838..1.135 rows=1 loops=1)
-> Seq Scan on locales_source s (cost=0.00..150.64 rows=1 width=8) (actual time=0.818..1.114 rows=1 loops=1)
Filter: ((source = 'A blog entry is a single post to an online journal, or blog.'::text) AND ((textgroup)::text = 'default'::text))
-> Index Scan using locales_target_lid_idx on locales_target t (cost=0.00..8.27 rows=1 width=64) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (s.lid = t.lid)
Filter: ((t.language)::text = 'fr'::text)
Total runtime: 1.191 ms
(7 lignes)

This query includes a sequential scan: the exact problem which leads to high SWAP and slow servers, because it needs to run on disc.

Looking at schema, it appens that there is no index on target language.
Also, source.lid is not indexed AND IT SHOULD as it used in a JOIN.

So let's add the index:

CREATE INDEX locales_target_language_idx
ON locales_target (language);

CREATE INDEX local_source_lid_idx
ON locales_source (lid);

Now analyse the query again:

EXPLAIN ANALYSE
SELECT s.lid, t.translation, s.version
FROM locales_source s
LEFT JOIN locales_target t ON s.lid = t.lid AND t.language = 'fr'
WHERE s.source = 'A blog entry is a single post to an online journal, or blog.' AND s.textgroup = 'default'
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..16.55 rows=1 width=68) (actual time=0.098..0.100 rows=1 loops=1)
-> Index Scan using local_source_lid_idx on locales_source s (cost=0.00..8.27 rows=1 width=8) (actual time=0.088..0.088 rows=1 loops=1)
Index Cond: (source = 'A blog entry is a single post to an online journal, or blog.'::text)
Filter: ((textgroup)::text = 'default'::text)
-> Index Scan using locales_target_lid_idx on locales_target t (cost=0.00..8.27 rows=1 width=64) (actual time=0.007..0.008 rows=1 loops=1)
Index Cond: (s.lid = t.lid)
Filter: ((t.language)::text = 'fr'::text)
Total runtime: 0.129 ms
(8 lignes)

Oh man, this is 10 times faster.

Okay, Drupal needs a complete analysis of all queries.
You cannot write a fast tool if you keep on using MySQL, which can only be compared to ... access for Windows.

Again, I garantee that Drupal can be 10 times faster (at least) using PostgreSQL with correct indexing.

damien tournoud’s picture

This issue is becoming funnier by the minute.

I can guarantee you that Drupal queries are optimized very carefully and that indexes are in place where there are needed.

I will not comment on your opinion about the relative performance of MySQL and PostgreSQL, especially when the discussion is on such a trivial JOIN between {locales_source} and {locales_target} (which is nothing more than: a filter on the indexed locales_source.source column, followed by a JOIN on the indexed locales_target.lid column and a linear filter between a few results on locales_target.language).

I can assure you that there is an index on {locales_source}.source. If it's not there on your installation, this is either (1) because something got wrong on your particular infrastructure or (2) because there is a bug in the Drupal PostgreSQL layer.

The Drupal PostgreSQL layer does require more testing, and you are very welcome to join in the testing effort. A good starting point is the PostgreSQL group on groups.drupal.org.

But please, quit the patronizing tone. If it only affect your little corner of the world, that can't be a critical bug.

grub3’s picture

Priority: Critical » Normal
Status: Active » Postponed (maintainer needs more info)

Sorry, this was :

CREATE INDEX local_source_lid_idx
ON locales_source
USING btree
(lid);

Total runtime is driven from 1.191 ms to 0.129 ms.

damien tournoud’s picture

CREATE INDEX local_source_lid_idx
ON locales_source
USING btree
(lid);

There I don't understand: lid is a primary key for {locales_source}. How could that column need an index?

grub3’s picture

You are right, only CREATE INDEX locales_target_language_idx
ON locales_target (language);
is needed.

damien tournoud’s picture

Priority: Normal » Critical
Status: Postponed (maintainer needs more info) » Active

This one too is using sequential scans: SELECT data, created, headers, expire, serialized FROM cache WHERE cid = 'locale:fr';

Exactly the same: cid is the primary key for the table, it should not require any other index.

grub3’s picture

grub3’s picture

Priority: Critical » Normal
Status: Active » Postponed (maintainer needs more info)

You are right.
I will get back.

damien tournoud’s picture

@jmpoure#8: the primary key of {locales_target} is (language, lid, plural), so the condition lid = xx AND language = xx can use the index of the primary key.

It looks like for some reason your installation lost all (or most of) its indexes.

grub3’s picture

drupal=# explain analyse
drupal-# SELECT dst FROM url_alias WHERE src = 'shoutbox/js/view' AND language IN('fr', '') ORDER BY language DESC;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=7.79..7.79 rows=1 width=52) (actual time=0.141..0.141 rows=0 loops=1)
Sort Key: language
Sort Method: quicksort Memory: 25kB
-> Seq Scan on url_alias (cost=0.00..7.78 rows=1 width=52) (actual time=0.114..0.114 rows=0 loops=1)
Filter: (((language)::text = ANY (('{fr,""}'::character varying[])::text[])) AND ((src)::text = 'shoutbox/js/view'::text))
Total runtime: 0.161 ms
(6 lignes)

We should:

CREATE INDEX url_alias_language_idx
ON url_alias
USING btree
(language);

Which is 40% faster:

drupal=# explain analyse
SELECT dst FROM url_alias WHERE src = 'shoutbox/js/view' AND language IN('fr', '') ORDER BY language DESC;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=7.79..7.79 rows=1 width=52) (actual time=0.102..0.102 rows=0 loops=1)
Sort Key: language
Sort Method: quicksort Memory: 25kB
-> Seq Scan on url_alias (cost=0.00..7.78 rows=1 width=52) (actual time=0.096..0.096 rows=0 loops=1)
Filter: (((language)::text = ANY (('{fr,""}'::character varying[])::text[])) AND ((src)::text = 'shoutbox/js/view'::text))
Total runtime: 0.123 ms
(6 lignes)

Sort is done in memory, not disc.
Do you agree? I see plenty of these things and would like to report.

moshe weitzman’s picture

You might have good information to share, but your bug report is awful. Please be specific about what queries are slow, or what indexes are needed.

damien tournoud’s picture

@jmpoure#13: needed indexes are already there on url_alias. I see no difference between the two EXPLAIN ANALYSE, except that your system was loaded differently between the two tests :)

grub3’s picture

You are right. The last index was not needed as PostgreSQL will sort in memory anyway.
I can only state that under PostgreSQL some indexes are not declared whereas they are under MySQL.

Sorry for this bad bug report.
I will try to find long running queries and report back.

I still suspect that some queries run slowly.

grub3’s picture

drupal=# explain analyse
drupal-# SELECT s.lid, t.translation, s.version
drupal-# FROM locales_source s
drupal-# LEFT JOIN locales_target t ON s.lid = t.lid AND t.language = 'fr'
drupal-# WHERE s.source = 'Login or register to post shouts'
drupal-# AND s.textgroup = 'default';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..158.92 rows=1 width=68) (actual time=0.671..1.123 rows=1 loops=1)
-> Seq Scan on locales_source s (cost=0.00..150.64 rows=1 width=8) (actual time=0.647..1.098 rows=1 loops=1)
Filter: ((source = 'Login or register to post shouts'::text) AND ((textgroup)::text = 'default'::text))
-> Index Scan using locales_target_lid_idx on locales_target t (cost=0.00..8.27 rows=1 width=64) (actual time=0.023..0.023 rows=0 loops=1)
Index Cond: (s.lid = t.lid)
Filter: ((t.language)::text = 'fr'::text)
Total runtime: 1.157 ms
(7 lignes)

Missing index on textgroup:

CREATE INDEX locales_source_textgroup_idx
ON locales_source
USING btree
(textgroup);

PostgreSQL only indexes the first 255 character, no need to index only 30:
CREATE INDEX locales_source_source2_idx
ON locales_source
USING btree
(source);

Result:
drupal=# explain analyse
SELECT s.lid, t.translation, s.version
FROM locales_source s
LEFT JOIN locales_target t ON s.lid = t.lid AND t.language = 'fr'
WHERE s.source = 'Login or register to post shouts'
AND s.textgroup = 'default';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..16.55 rows=1 width=68) (actual time=0.108..0.109 rows=1 loops=1)
-> Index Scan using locales_source_source2_idx on locales_source s (cost=0.00..8.27 rows=1 width=8) (actual time=0.100..0.101 rows=1 loops=1)
Index Cond: (source = 'Login or register to post shouts'::text)
Filter: ((textgroup)::text = 'default'::text)
-> Index Scan using locales_target_lid_idx on locales_target t (cost=0.00..8.27 rows=1 width=64) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: (s.lid = t.lid)
Filter: ((t.language)::text = 'fr'::text)
Total runtime: 0.142 ms
(8 lignes)

10 times faster.

I wonder why only index the first 30 characters.
Again, this may be a special PostgreSQL issue.

Kind regards, JMP

grub3’s picture

drupal=# explain analyse
drupal-# SELECT * FROM menu_router
drupal-# WHERE tab_root = 'node/%'
drupal-# ORDER BY weight, title;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Sort (cost=44.74..44.75 rows=7 width=338) (actual time=0.182..0.183 rows=7 loops=1)
Sort Key: weight, title
Sort Method: quicksort Memory: 26kB
-> Seq Scan on menu_router (cost=0.00..44.64 rows=7 width=338) (actual time=0.043..0.166 rows=7 loops=1)
Filter: ((tab_root)::text = 'node/%'::text)
Total runtime: 0.212 ms
(6 lignes)

Let's create an index:
CREATE INDEX menu_router_tab_root_idx
ON menu_router
USING btree
(tab_root);

drupal=# explain analyse
SELECT * FROM menu_router
WHERE tab_root = 'node/%'
ORDER BY weight, title;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=23.71..23.72 rows=7 width=338) (actual time=0.063..0.066 rows=7 loops=1)
Sort Key: weight, title
Sort Method: quicksort Memory: 26kB
-> Bitmap Heap Scan on menu_router (cost=4.30..23.61 rows=7 width=338) (actual time=0.043..0.047 rows=7 loops=1)
Recheck Cond: ((tab_root)::text = 'node/%'::text)
-> Bitmap Index Scan on menu_router_tab_root_idx (cost=0.00..4.30 rows=7 width=0) (actual time=0.037..0.037 rows=7 loops=1)
Index Cond: ((tab_root)::text = 'node/%'::text)
Total runtime: 0.101 ms
(8 lignes)

Twice as fast. That's a good catch. Maybe it is that the index is not set in PostgreSQL.

grub3’s picture

drupal=# explain analyse
drupal-# SELECT u.*, s.*
drupal-# FROM users u
drupal-# INNER JOIN sessions s
drupal-# ON u.uid = s.uid
drupal-# WHERE s.sid = '82daa413ca5d677e920b5374bd1exxxx'
drupal-# ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..1298.30 rows=1 width=530) (actual time=5.255..5.370 rows=1 loops=1)
Join Filter: (u.uid = (s.uid)::integer)
-> Seq Scan on sessions s (cost=0.00..1295.62 rows=1 width=273) (actual time=5.231..5.339 rows=1 loops=1)
Filter: ((sid)::text = '82daa413ca5d677e920b5374bd1exxxx'::text)
-> Seq Scan on users u (cost=0.00..2.30 rows=30 width=257) (actual time=0.003..0.009 rows=30 loops=1)
Total runtime: 5.413 ms
(6 lignes)

An index is missing on my system on :
CREATE INDEX sessions_sid_idx
ON sessions
USING btree
(sid);

drupal=# explain analyse
SELECT u.*, s.*
FROM users u
INNER JOIN sessions s
ON u.uid = s.uid
WHERE s.sid = '82daa413ca5d677e920b5374bd1exxxx'
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=8.29..10.71 rows=1 width=530) (actual time=0.110..0.116 rows=1 loops=1)
Hash Cond: (u.uid = (s.uid)::integer)
-> Seq Scan on users u (cost=0.00..2.30 rows=30 width=257) (actual time=0.006..0.013 rows=30 loops=1)
-> Hash (cost=8.27..8.27 rows=1 width=273) (actual time=0.073..0.073 rows=1 loops=1)
-> Index Scan using sessions_sid_idx on sessions s (cost=0.00..8.27 rows=1 width=273) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: ((sid)::text = '82daa413ca5d677e920b5374bd1exxxx'::text)
Total runtime: 0.157 ms
(7 lignes)

30 times faster, good catch if it is not a PostgreSQL problem. Please confirm this index is missing.

Crell’s picture

@jmpoure: Thank you for taking an interest in database performance. It is definitely an area where Drupal needs more fine tuning.

That said, the best way to address it is to file individual bite-sized issues for each query/index you're trying to optimize. Be sure to include a proper patch for it (including the update hook as appropriate), along with an explanation of each. We'll probably need to apply it to Drupal 7 first and then backport, as that's the usual routine.

grub3’s picture

Thanks, I will follow the usal way.

grub3’s picture

Another very interesting on, for fun:

drupal=# EXPLAIN ANALYSE
drupal-# SELECT iid FROM aggregator_item
drupal-# WHERE fid = 11 AND guid = 'http://www.liberation.fr/actualite/sports/335893.FR.php?rss=true&xtor=RS...'
drupal-# ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on aggregator_item (cost=28.01..918.47 rows=1 width=4) (actual time=1.917..1.968 rows=1 loops=1)
Recheck Cond: (fid = 11)
Filter: ((guid)::text = 'http://www.liberation.fr/actualite/sports/335893.FR.php?rss=true&xtor=RS...'::text)
-> Bitmap Index Scan on aggregator_item_fid_idx (cost=0.00..28.01 rows=1035 width=0) (actual time=0.222..0.222 rows=1036 loops=1)
Index Cond: (fid = 11)
Total runtime: 2.012 ms
(6 lignes)

Add an index:
CREATE INDEX aggregator_item_guid_idx
ON aggregator_item
USING btree
(guid);

drupal=# EXPLAIN ANALYSE
SELECT iid FROM aggregator_item
WHERE fid = 11 AND guid = 'http://www.liberation.fr/actualite/sports/335893.FR.php?rss=true&xtor=RS...'
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on aggregator_item (cost=4.27..12.00 rows=1 width=4) (actual time=0.093..0.094 rows=1 loops=1)
Recheck Cond: ((guid)::text = 'http://www.liberation.fr/actualite/sports/335893.FR.php?rss=true&xtor=RS...'::text)
Filter: (fid = 11)
-> Bitmap Index Scan on aggregator_item_guid_idx (cost=0.00..4.27 rows=2 width=0) (actual time=0.084..0.084 rows=1 loops=1)
Index Cond: ((guid)::text = 'http://www.liberation.fr/actualite/sports/335893.FR.php?rss=true&xtor=RS...'::text)
Total runtime: 0.113 ms
(6 lignes)

20 times faster.
I will submit patches.
This way I will check if this a PostgreSQL schema issue or not.
I can only advice people to use PostgreSQL + pgAdmin graphical interface to monitor and debug queries.

Anonymous’s picture

Status: Postponed (maintainer needs more info) » Fixed

Based on #21 and #22 I'm marking this fixed since other issues will be opened to provide patches.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.