On my 500.000 post forum, I read this query on PostgreSQL logs:

2009-08-25 14:28:30 CEST LOG: durée : 720.124 ms, instruction : SELECT n.nid, r.tid, n.title, n.type, n.sticky, u.name, u.uid, n.created AS timestamp, n.comment AS comment_mode, l.last_comment_timestamp, IF(l.last_comment_uid != 0, cu.name, l.last_comment_name) AS last_comment_name, l.last_comment_uid, l.comment_count AS num_comments, f.tid AS forum_tid FROM node_comment_statistics l INNER JOIN (SELECT DISTINCT ON (nid) * FROM node) n ON n.nid = l.nid INNER JOIN users cu ON l.last_comment_uid = cu.uid INNER JOIN term_node r ON n.vid = r.vid INNER JOIN users u ON n.uid = u.uid INNER JOIN forum f ON n.vid = f.vid WHERE n.status = 1 AND r.tid = 9 ORDER BY n.sticky DESC, l.last_comment_timestamp DESC, n.created DESC LIMIT 50 OFFSET 0

The problem comes from these lines in forum.module:

$sql = db_rewrite_sql("SELECT n.nid, r.tid, n.title, n.type, n.sticky, u.name, u.uid, n.created AS timestamp, n.comment AS comment_mode, l.last_comment_timestamp, IF(l.last_comment_uid != 0, cu.name, l.last_comment_name) AS last_comment_name, l.last_comment_uid, l.comment_count AS num_comments, f.tid AS forum_tid FROM {node_comment_statistics} l INNER JOIN {node} n ON n.nid = l.nid INNER JOIN {users} cu ON l.last_comment_uid = cu.uid INNER JOIN {term_node} r ON n.vid = r.vid INNER JOIN {users} u ON n.uid = u.uid INNER JOIN {forum} f ON n.vid = f.vid WHERE n.status = 1 AND r.tid = %d");

Again, the problem comes from :
(SELECT DISTINCT ON (nid) * FROM node)

There seems to be design flows in db_rewrite_sql

You should NEVER ***EVER*** begin each transaction containing node table with a complete sequential index on distinct.

I am not proposing a patch and I am asking in which direction to go:
* either fix db_rewrite_sql
* OR remove db_rewrite_sql in this precise query

CommentFileSizeAuthor
#34 left-join-fix.diff1.83 KBgrub3

Comments

damien tournoud’s picture

Status: Active » Closed (won't fix)
grub3’s picture

Status: Closed (won't fix) » Needs review

Dear Damien, please reopen.
This grave issue needs review from the community.

This is such an important issue at Drupal that it could double the speed of the whole framework.
So let us leave this issue open until core developers fix this either locally or in the database abstraction layer.

damien tournoud’s picture

Status: Needs review » Postponed (maintainer needs more info)

There is probably something broken with your installation. There is no reason for db_rewrite_sql() do to anything if you don't have access control enabled, and I see no {node_access} table joined in the query you copied.

Could you please dump the values of $join, $where, $distinct inside db_rewrite_sql() for this particular query on your installation?

grub3’s picture

This is line 574 from forum.module

$sql = db_rewrite_sql("SELECT n.nid, r.tid, n.title, n.type, n.sticky, u.name, u.uid, n.created AS timestamp, n.comment AS comment_mode, l.last_comment_timestamp, IF(l.last_comment_uid != 0, cu.name, l.last_comment_name) AS last_comment_name, l.last_comment_uid, l.comment_count AS num_comments, f.tid AS forum_tid FROM {node_comment_statistics} l INNER JOIN {node} n ON n.nid = l.nid INNER JOIN {users} cu ON l.last_comment_uid = cu.uid INNER JOIN {term_node} r ON n.vid = r.vid INNER JOIN {users} u ON n.uid = u.uid INNER JOIN {forum} f ON n.vid = f.vid WHERE n.status = 1 AND r.tid = %d");

I don't understand where to look further.
Thanks for your help, I wait further.

Damien, I upgrade to Drupal-dev meanwhile to make sure we use the same code base.

damien tournoud’s picture

@jmpoure: I know where it is, please dump the runtime values of $join, $where and $distinct in your actual Drupal site.

grub3’s picture

Okay Damien, I install dev module or display these values by hand.

grub3’s picture

Stay tuned, I am learning the devel module.

damien tournoud’s picture

In db_rewrite_sql(), just after the call to _db_rewrite_sql().

grub3’s picture

Yes, I read _db_rewrite_sql() and I think it is a flow in design to let a computer run DISTINCT automaticaly when most of the time the query produces distinct results. I installed dev module, I output the information you nee. This is a cool module.

grub3’s picture

Look for example, this is my forum:

2380.57
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 24 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
1100.26
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 27 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
639.03
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 50 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

These queries are natural distinct queries, no need to force them distinct:

SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 24 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
executes in 62 ms not 2380.57

Wait, I give you the needed information.

damien tournoud’s picture

Something is wrong on your system, and forces a distinct in those queries. This is clearly not by design.

damien tournoud’s picture

Oh, I see what could happen. Please visit your site using an user that doesn't have the 'administer node' permission, and please copy/paste the resulting queries here.

grub3’s picture

Okay. Wait.

grub3’s picture

Disconnecting and browsing under Anonymous account gives instant results.

On the converse using a simple user returns very long queries
I need to wait 10 seconds to display the forum list.

Executed 361 queries in 11516.66 milliseconds. Queries taking longer than 30 ms and queries executed more than once, are highlighted. Page execution time was 11681.42 ms.
ms
#
where
query
2339.48
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 AND tn.tid = 24 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
1075.42
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 AND tn.tid = 27 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
712.09
1
forum_get_forums
SELECT r.tid, COUNT(n.nid) AS topic_count, SUM(l.comment_count) AS comment_count FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN term_node r ON n.vid = r.vid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 )GROUP BY r.tid
625.64
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 AND tn.tid = 50 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
585.69
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 AND tn.tid = 17 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
203.41
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ( n.status = 1 AND tn.tid = 22 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
[...]

0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.24
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.23
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.23
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
0.2
113
acl_node_grants
SELECT acl_id FROM acl_user WHERE uid = 4284
Memory usage:
Memory used at: devel_init()=0.67 MB, devel_shutdown()=8.28 MB.
grub3’s picture

Dear Damien, I am upgrading to dev to make sure we are using the same code base.

grub3’s picture

After upgrading to Drupal-dev, browsing under anonymous returns long queries too:

Executed 300 queries in 10848.98 milliseconds. Queries taking longer than 30 ms and queries executed more than once, are highlighted. Page execution time was 11063.85 ms.
ms
#
where
query
2470.37
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM (SELECT DISTINCT ON (nid) * FROM node) n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ((n.type NOT IN ('project_project', 'project_issue'))) AND ( n.status = 1 AND tn.tid = 24 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
grub3’s picture

Okay, I am lost. Could you tell me exactly how to output the resquested debug information.

damien tournoud’s picture

Well, the DISTINCT is legit here, but it is not at the correct place. This query should be:

SELECT DISTINCT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ((n.type NOT IN ('project_project', 'project_issue'))) AND ( n.status = 1 AND tn.tid = 24 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1
grub3’s picture

Dear Damien, I remove the contributed access module and it works again:

Executed 331 queries in 2289.42 milliseconds. Queries taking longer than 30 ms and queries executed more than once, are highlighted. Page execution time was 2773.9 ms.
ms
#
where
query
401.46
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 3 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
389.69
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 2 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
372.71
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 4 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

Three seconds instead of 12.

grub3’s picture

But I still do not understand why results are not cached. I would save a tremendous time on normal middle-size forums. My cache is set to 3 minutes. It is enough. Large forums should use a kind of cache to display.

grub3’s picture

Well, the DISTINCT is legit here, but it is not at the correct place. This query should be:

SELECT DISTINCT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid INNER JOIN node_access na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ((n.type NOT IN ('project_project', 'project_issue'))) AND ( n.status = 1 AND tn.tid = 24 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1

{node] carries distinct values.
The following query starts processing on {node} and does only INNER JOINS.
Therefore there is no need to force DISTINCT values.

SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid 
FROM node n 
INNER JOIN users u1 ON n.uid = u1.uid 
INNER JOIN term_node tn ON n.vid = tn.vid 
INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid 
INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid 
INNER JOIN node_access na ON na.nid = n.nid 
WHERE (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all'))) AND ((n.type NOT IN ('project_project', 'project_issue'))) AND ( n.status = 1 AND tn.tid = 24 )ORDER BY ncs.last_comment_timestamp DESC LIMIT 1

So Drupal code is exact.

Again, when developers understand the logic of DISTINCT, there is no need to add it to SQL.
Adding a DISTINCT on large tables can destroy a query when not needed.

damien tournoud’s picture

@jmpoure: you don't understand how DISTINCT and JOINS work.

SELECT t1.id FROM table1 t1 INNER JOIN table2 t2 ON t2.id = t1.id

^ This query *will* return duplicate entries, even if id is a primary key for table1.

That's exactly what happens here, and why the DISTINCT is needed.

grub3’s picture

Okay, I understand:

SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 2 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

Is a complex query returning no result. It is trying to fetch the latest comment on the database container. Therefore it can only result in nothing.

In this case, the cache does not seem to record the result.
Thefore Drupal runs the query, which gets very long answers because of the huge database.

Right?

grub3’s picture

@jmpoure: you don't understand how DISTINCT and JOINS work.
SELECT t1.id FROM table1 t1 INNER JOIN table2 t2 ON t2.id = t1.id
^ This query *will* return duplicate entries, even if id is a primary key for table1.
That's exactly what happens here, and why the DISTINCT is needed.

Dear Damien.
I am afraid no.

In your case, t1.id is a primary key, thus unique.
In SQL92, primary key implies unique Index.

SELECT t1.id FROM table1 t1
produces a unique set of t1.id because it is unique by defaut in the database.

When people wrote SQL, they set primary keys and constraint to make sure that records were unique from the entry of the database.
The database make sure t1.id is unique on entry.

Then you agree that this is exactly the same as:
SELECT DISTINCT t1.id FROM table1 t1

Then the INNER JOINS produces unique output, by defaut, because it is only inclusive.

Thefore:


Is exactly the same as:
<code>SELECT DISTINCT t1.id 
FROM table1 t1 
INNER JOIN table2 t2 ON t2.id = t1.id

Do we agree?

Now in the upper query, last_comment_timestamp and last_comment_name are unique.
A single user cannot enter two comments on the same exact Unix time.

Thefore, the query is unique.
Note that you can force uniqueness setting a double unique index.

Setting unique indexes is the way to make sure tables produces distinct results.

grub3’s picture

Dear Damien, I tested setting a constraint on a 700.000 entries comment table and some uid,timestamp are not unique. So you are right for the query, but wrong in general about INNER JOINs. But I guess we agree. So you are right to say Drupal should run the DISTINCT clause. But why do we have duplicates?

damien tournoud’s picture

Thefore:

Is exactly the same as:

SELECT DISTINCT t1.id
FROM table1 t1
INNER JOIN table2 t2 ON t2.id = t1.id

Do we agree?

No. DISTINCT applies on the *result* of the query, after the query has been processed.

SELECT t1.id FROM table1 t1 INNER JOIN table2 t2 ON t2.id = t1.id

Can return several rows of the t2 table per row of the t1 table (if the relation is 1-N). It would in that case return duplicate values of t1.id. Adding a DISTINCT clause would remove those duplicate values.

grub3’s picture

SELECT COUNT (DISTINCT (uid, timestamp)) FROM comments
647021

SELECT COUNT (*) FROM comments
647224

grub3’s picture

SELECT t1.id 
FROM table1 t1 
INNER JOIN table2 t2 ON t2.id = t1.id

is the same as

SELECT DISTINCT t1.id 
FROM table1 t1 
INNER JOIN table2 t2 ON t2.id = t1.id

when t1.Id is unique or primary key

If the join is performed on a non unique index, then it may produce duplicates.
So I agree with you.

damien tournoud’s picture

INNER JOINS are 1-1 relationship.

No.

grub3’s picture

Okay. Do you know why some of my empty SQL results are not cached?
It seems like queries returning empty results are not cached. Right?

For example:

ms
#
where
query
401.46
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n INNER JOIN users u1 ON n.uid = u1.uid INNER JOIN term_node tn ON n.vid = tn.vid INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 3 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

grub3’s picture

This is quite bad, because without this Drupal would be lightning speed.
I will enquire more ... and get back.

Many thanks Damien for help and YES you were right.

The ACL module needs some fixing, urgently.

grub3’s picture

Dear Damien, I wrote this :

Devel should warn that SQL execution times are from cache
http://drupal.org/node/559158

This is especially with queries returning natural DISTINCT VALUES and that have a DISTINCT in the query.
It make very long queries and people don't notice it Devel module.

If you would like to comment, you are welcome.

grub3’s picture

Okay, I think I found the issue debugging the SQL query server side.

The query:

SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid
FROM node n
INNER JOIN users u1 ON n.uid = u1.uid
INNER JOIN term_node tn ON n.vid = tn.vid
INNER JOIN node_comment_statistics ncs ON n.nid = ncs.nid
INNER JOIN users u2 ON ncs.last_comment_uid=u2.uid
WHERE n.status = 1 AND tn.tid = 3
ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

400 ms
should be replaced with:

SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid
FROM node n
LEFT JOIN users u1 ON n.uid = u1.uid
LEFT JOIN term_node tn ON n.vid = tn.vid
LEFT JOIN node_comment_statistics ncs ON n.nid = ncs.nid
LEFT JOIN users u2 ON ncs.last_comment_uid=u2.uid
WHERE n.status = 1 AND tn.tid = 3
ORDER BY ncs.last_comment_timestamp DESC
LIMIT 1 OFFSET 0

63 ms.

When a query needs to kwno whether there is a record or not (LIMIT 1 OFFSET 0),
INNER JOINS should be replaced with LEFT JOINS.

Internaly, databases treat INNER JOINS like LEFT JOINS with a DISTINCT CLAUSE.
On large databases like mine this ends-up with a nested loop and DISTINCT sort.

This can be nightmare for a database system.

grub3’s picture

StatusFileSize
new1.83 KB

Bingo:

Executed 299 queries in 1144.19 milliseconds. Queries taking longer than 30 ms and queries executed more than once, are highlighted. Page execution time was 1284.1 ms.
ms
#
where
query
215.43
1
forum_get_forums
SELECT r.tid, COUNT(n.nid) AS topic_count, SUM(l.comment_count) AS comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN term_node r ON n.vid = r.vid WHERE n.status = 1 GROUP BY r.tid
141.1
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n LEFT JOIN users u1 ON n.uid = u1.uid LEFT JOIN term_node tn ON n.vid = tn.vid LEFT JOIN node_comment_statistics ncs ON n.nid = ncs.nid LEFT JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 8 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0
140.13
1
forum_get_forums
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid FROM node n LEFT JOIN users u1 ON n.uid = u1.uid LEFT JOIN term_node tn ON n.vid = tn.vid LEFT JOIN node_comment_statistics ncs ON n.nid = ncs.nid LEFT JOIN users u2 ON ncs.last_comment_uid=u2.uid WHERE n.status = 1 AND tn.tid = 9 ORDER BY ncs.last_comment_timestamp DESC LIMIT 1 OFFSET 0

Excution time divided by two for the forum code !!!
So funny.

Patch attached, please verify under MySQL using Devel module.
Under PostgreSQL, the result is quite impressive.
MySQL internals also treat INNER JOINS like LEFT JOINS with a distinct clause, so this should result in the same gain.

I am going to open a dedicated issue thread for database and start hunting for LIMIT 1 OFFSET 0 queries.
Your guys should leave MySQL, which is not a very serious database and start using PostgreSQL exclusively for devel.

grub3’s picture

Status: Postponed (maintainer needs more info) » Needs review

Patch needs review on a large forum under MySQL.
Otherwize if you know database programming apply safely.

josh waihi’s picture

Status: Needs review » Needs work

Your guys should leave MySQL, which is not a very serious database and start using PostgreSQL exclusively for devel.

You're talking to PostgreSQL people ;) But keep in mind that Drupal is a application that prides itself in being flexible enough to run on several different OS and databases. Therefore, testing and debugging is needed by all environments, only the ones that get tested stick around. So thanks for sticking with PostgreSQL!

I can't really say much on this subject as I don't enjoy the forum section of Drupal - I'm sure there is a lot of room to improve it in both functionality and performance. In regards to this patch, I really don't think Drupal 6 could actually commit it because its changing SQL in a big way - you can configure PostgreSQL however to optimize how it scans, DBAs do this quite often as the size of there applications grow. You can use EXPLAIN and EXPLAIN ANALYZE at the beginning of queries to find the bottle necks and tweak PostgreSQL accordingly to see if it helps at all.

nnewton’s picture

While the relationship between LEFT and INNER join for performance is true and should be considered, its not as clear cut as presented here. A "DISTINCT", although thats not quite what is going on at least for MySQL is not always a performance killer if its well indexed. To that point, an INNER JOIN gives the optimizer more indexes to choose from in the JOIN order than does a LEFT JOIN. This can be a much larger performance "win" than the LEFT joins lack of DISTINCT. Also, if your not careful just blindly replacing INNER JOINs with LEFT JOINs can kill performance depending on the join order and the query, as there are added padded nulls going on here. That can work to your favour, as in your case above, or not as the case maybe.

"Your guys should leave MySQL, which is not a very serious database and start using PostgreSQL exclusively for devel."

Also, please leave this sort of thing at the door.

grub3’s picture

I fixed the explanation about LEFT JOINs vs INNER JOINS here:
Replace INNER JOINS with LEFT JOINS whenever possible

INNER JOINs are processed as LEFT JOINs and then the parser looks for NULL values to remove them.

On large tables, an INNER JOIN has more chances to result in a sequential scan, either in memory or on disc. When SORT BY is used, it may even be worse.

On the converse, a LEFT JOIN has more chances to be optimized by the parser, as the database will not bother for removing NULLs thus will be able to access data directly. This leads to performance result ranging from 1 to 5 or more.

This is true for PostgreSQL and MySQL and many other databases.
LEFT JOINs are known to be faster than INNER JOINs in all databases systems.

Sorry, there is no link with DISTINCT.
I wrote that texte during the night and was quite tired.

I hope that you have enough information now.

Please apply the patch safely.

damien tournoud’s picture

Status: Needs work » Postponed (maintainer needs more info)

#38 makes very little sense.

Could you please run an EXPLAIN on those two queries and report the result? It's probably that the query planner is making the wrong choice here, for some reason. Let's try to understand it, instead of jumping to far-fetched conclusions.

nnewton’s picture

At least for mysql I believe you are confused as to how it executes these joins. The looping algorithm shouldn't have to discard nulls really, the block looping algorithm...possibly, but I don't really think its likely to impact much. The queries in #33 execute in .20 seconds and .21 seconds respectively on one of my servers, with the inner join winning. That was on a database of 500,000 nodes.

grub3’s picture

Thanks I am in contact with PostgreSQL hackers and we enquire on that query.

It seems that LEFT JOINs in MySQL do not execute as fast they could.
According to PostgreSQL developers, this is a know issue.

In PostgreSQL, I can execute this query in 63ms. Maybe drive it further down.
v.s. 200ms under MySQL (see previous post)

So the best is to let us enquire and I will report back later on.

My plan:

EXPLAIN ANALYSE
SELECT ncs.last_comment_timestamp, IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name, ncs.last_comment_uid
FROM node n
LEFT JOIN users u1 ON n.uid = u1.uid
LEFT JOIN term_node tn ON n.vid = tn.vid
LEFT JOIN node_comment_statistics ncs ON n.nid = ncs.nid
LEFT JOIN users u2 ON ncs.last_comment_uid=u2.uid
WHERE n.status = 1 AND tn.tid = 3
ORDER BY ncs.last_comment_timestamp DESC
LIMIT 1 OFFSET 0 
"Limit  (cost=1156.57..1156.57 rows=1 width=17) (actual time=0.020..0.020 rows=0 loops=1)"
"  ->  Sort  (cost=1156.57..1156.85 rows=112 width=17) (actual time=0.019..0.019 rows=0 loops=1)"
"        Sort Key: ncs.last_comment_timestamp"
"        Sort Method:  quicksort  Memory: 25kB"
"        ->  Nested Loop Left Join  (cost=5.13..1156.01 rows=112 width=17) (actual time=0.014..0.014 rows=0 loops=1)"
"              ->  Nested Loop Left Join  (cost=5.13..1123.24 rows=112 width=21) (actual time=0.014..0.014 rows=0 loops=1)"
"                    ->  Nested Loop Left Join  (cost=5.13..1090.76 rows=112 width=13) (actual time=0.014..0.014 rows=0 loops=1)"
"                          ->  Nested Loop  (cost=5.13..1054.66 rows=112 width=8) (actual time=0.014..0.014 rows=0 loops=1)"
"                                ->  Bitmap Heap Scan on term_node tn  (cost=5.13..246.19 rows=112 width=4) (actual time=0.013..0.013 rows=0 loops=1)"
"                                      Recheck Cond: ((tid)::integer = 3)"
"                                      ->  Bitmap Index Scan on term_node_tid_idx  (cost=0.00..5.10 rows=112 width=0) (actual time=0.012..0.012 rows=0 loops=1)"
"                                            Index Cond: ((tid)::integer = 3)"
"                                ->  Index Scan using node_vid_idx on node n  (cost=0.00..7.21 rows=1 width=12) (never executed)"
"                                      Index Cond: ((n.vid)::integer = (tn.vid)::integer)"
"                                      Filter: (n.status = 1)"
"                          ->  Index Scan using node_comment_statistics_pkey on node_comment_statistics ncs  (cost=0.00..0.31 rows=1 width=13) (never executed)"
"                                Index Cond: (n.nid = (ncs.nid)::integer)"
"                    ->  Index Scan using users_pkey on users u2  (cost=0.00..0.28 rows=1 width=12) (never executed)"
"                          Index Cond: (ncs.last_comment_uid = u2.uid)"
"              ->  Index Scan using users_pkey on users u1  (cost=0.00..0.28 rows=1 width=4) (never executed)"
"                    Index Cond: (n.uid = u1.uid)"
"Total runtime: 0.089 ms"

As you can notice, a cast happens which forces nested loop.
If we can avoid this cast, the query may execute in 20 ms or less.
We are enquiring on that issue.

I will get back to you.
This is extremely interesting to be able to execute queries on large installs to analyse plans and results.

Kind regards,
Jean-Michel

grub3’s picture

Status: Postponed (maintainer needs more info) » Closed (won't fix)

I closed this issue.

The planner does not have enough information to find the right query plan.
Eventually, the query planner may even rewrite the query itself.

My database is running on a 650.000 nodes install.
But there are no real connections from the internet.

I will only be able to review this issue when going live.

Reference: http://drupal.org/node/562180