How to write efficient table JOINs

This page is based on an e-mail posted by Craig Courtney on 6/21/2003 to the drupal-devel mailing list.

There are 3 kinds of joins: INNER, LEFT OUTER, and RIGHT OUTER. Each requires an ON clause to let the RDBMS know what fields to use when joining the tables. For each join there are two tables: the LEFT table and the RIGHT table. The syntax is as follows:

{left table} (INNER | LEFT OUTER | RIGHT OUTER) JOIN {right table} ON (join criteria)

An INNER JOIN returns only those rows from the LEFT table having a matching row in the RIGHT table based on the join criteria.

A LEFT OUTER JOIN returns all rows from the LEFT table even if no matching rows where found in the RIGHT table. Any values selected out of the RIGHT table will be NULL for those rows where no matching row is found in the RIGHT table.

A RIGHT OUTER JOIN works exactly the same as a LEFT OUTER JOIN but reversing the direction. So it would return all rows in the RIGHT table regardless of matching rows in the LEFT table.

It is recommended that you not use RIGHT OUTER JOIN since a query can always be rewritten to use LEFT OUTER JOIN which tends to be more portable and easier to read.

If there are multiple rows in one table that match one row in the other table, the join will return that same row many times.

For example:

Table A: tid, name; 1, 'Linux'; 2, 'Debian'

Table B: fid, tid, message; 1, 1, 'Very Cool'; 2, 1, 'What an example'

Query 1: SELECT a.name, b.message FROM a INNER JOIN b ON a.tid = b.tid

Result 1:: Linux, Very Cool; Linux, What an example

Query 2: SELECT a.name, b.message FROM a LEFT JOIN b ON a.tid = b.tid

Result 2:: Linux, Very Cool; Linux, What an example; Debian, NULL

Visual Examples
codinghorror.com example

khankennels.com example

Hope that helps in reading some of the queries.

Comments

More efficiency

jaxxed commented 1 June 2011 at 13:58

This topic could use more discussion about efficiency, as per it's title.

Here is a first draft on

jaxxed commented 17 September 2012 at 08:04

Here is a first draft on comments of DB efficiency. Please comment:

Un-mentioned JOIN TYPE:

CROSS JOIN: a join formed by linking every item form the left with every item from the right. This is very rarely used, and therefore not common knowledge. Note that I am not sure if all DB backends support it.

Efficient Operations:

FILTERING AS ON vs WHERE

Join using ON options are supposedly more efficient that joins with conditions as WHERE parameters. It is possible to replicate ON functionality via WHERE implementation, but it is much slower.
Imagine a DB server trying to join two tables. If the join condition is specified as a WHERE condition, then the DB may CROSS JOIN (unoin) all results into memory, and then remove the combinations that don't pass the filter. An ON condition can be processed during that first CROSS JOIN. This results in smaller memory use, and fewer passes over the dataset.
That said, it is not necessary to move filters from the WHERE to ON, as most DBs will optimize WHERE conditions into the operations (perhaps even the WHERE join arguments are optimized for you.) The reason seems to be that in the DB server query optimization process, the server doesn't get to decide the priority of ON operations, and has to perform them when during the JOIN, unlike WHERE filters. If a JOIN ON filter is not related to both tables, and doesn't reduce the table data set, then it might not make sense to perform it as an ON filter.

BAD: SELECT * FROM node AS n LEFT JOIN user AS u WHERE n.uid=u.uid AND u.status=1
BETTER: SELECT * FROM node AS n LEFT JOIN user AS u ON n.uid=u.uid WHERE u.status=1
NOT NECESSARILY BETTER: SELECT * FROM node AS n LEFT JOIN user AS u ON n.uid=u.uid AND u.status=1

Boolean and Integer opertions

For obvious memory and processor reasons, integers and booleans go much faster than doubleand string (VARCHAR) values, an math/string expressions.
If you can optimize a query with an expensive operation, by adding a cheap operation - do it; it may seem like more work for the server, but it is actually saving work if the cheap process is performed first.

The Least Efficient Operations:

SUBSELECTS

Subselects, while potentially usefull, are the biggest potential for long queries that I've seen. The slow sub-select queries occur when the sub-select operates using an argument from the super-select. Filtering sub-selects on super-select arguments means that the sub-select is repeated for every row/tuple in the superset -- bad bad bad.
There are some cases where you can't avoid this (arguable,) but if you do this on front end operations then you are a bad person.
In most cases, sub-selects can be rewritten as JOINS, even UPDATES (some DBs don't allow JOINS in UPDATES :( .)

Here is a SUBSELECT example: SELECT * FROM node AS n WHERE uid IN ( SELECT uid FROM user AS u WHERE n.created>u.login )
Here is the JOIN equicalent: SELECT * FROM node AS n LEFT JOIN users AS u ON n.created>u.login

Note that after reading my examples, I noticed that they don't really make sense as Drupal operations, althought they do make the point.

EXPRESSIONS AS JOINS/FILTERS ??

It used to be that JOINING on expressions such as mathematic or string operations would seriously affect perfomance. In particular using such an operation as a JOIN ON argument, and then filtering by the same argument would seriously slow things down - but over the last 5-10 years things changed.
Most DBs do a very good job optimizing queries, and when they notice repeated operations, the operations are not repeated, but kept between operations.
An example of a serious mathematical operation is a Geolocation distance operation. Such an operation has numerous trigonometric operations that are necessary for curved surfaces in both dimensions, and a zero center in the middle. It is ugly and hard to read, but nececssary for accurate distance calculations (in the real-world doing a pretend flat projection and square box calculation is much simpler and easier - hence those really poor classical maps that make the North Pole point into 1000s of kms long.) This operation performed on large sets of location data is actually quite fast! One reason is that many of the trigonometric operations are repeated, and so the values are kept, and not performed redundantly.

HERE IS AN EXAMPLE OF A GEOGRAPHICAL DISTANCE CALCULATION (in miles)(non-Drupal):
Avg(3958.75 * ACos(Sin(@Latitude/57.2958) * Sin(City.Latitude/57.2958) + Cos(@Latitude/57.2958) * Cos(City.Latitude/57.2958) *
Cos(City.Longitude/57.2958 - @Longitude/57.2958)))
http://www.viawindowslive.com/Articles/VirtualEarth/Freereversegeocoding.aspx

Despite the impressive efficiency of modern DBs in such operations, one should try to minimize them.

An effective way of reducing cost for this geographic calculation is to first filter on other paramters such as Country, Province, City (plural) etc, which the server can use to reduce the set, before calculating distance.

SQL SOURCES

I only mention this because I've heard of many cases where people use Drupal Views to create SQL, and then use that SQL in other places. Please note that in general, views produces rather inefficient and poorly aliased SQL code. Cases to avoid are where Groupings are used.
For example, in one case, I had a views feed produce an http timeount, and cause major MySQL problems (60 second queries that used full CPU,) where a full data substitute managed to run in 10-15 seconds with much less (but still significant) CPU use.
The Drupal SQL system (especially in DRP7) is not hard to learn.
This doesn't mean that you should avoid using Views - I use it in every project - but pay attention to the queries, and diverge away from views if needed.
I haven't had much opportunity to stress test the Drp7 fields implementation.

[edit: 1 - emphasis and tried to take it easier on views.]

what about full outer join?

xorcus commented 25 June 2012 at 12:51

I wouldn't fully agree with the statement "There are 3 kinds of joins: INNER, LEFT OUTER, and RIGHT OUTER."
There is also the full outer join returning all rows from both tables (left and right), filling in nulls on the other side, when there's no matching row