PostgreSQL requires that, when a GROUP BY clause is present, all fields must either be in aggregate functions or in the GROUP BY clause. This means that when joining a table that has multiple entries for each row in the base table, it is currently not possible to return only one result for each row of the base table.

This can easily be fixed by adding an aggregate function to PostgreSQL that returns the first item of a field in a grouped row. MySQL is not as strict, so an aggregate function is not necessary for MySQL. Patch to follow.

Comments

darren oh’s picture

Status: Active » Needs review
StatusFileSize
new2.42 KB

Patch attached.

merlinofchaos’s picture

I'm going to need a reroll against the 3.x branch as well, it appears. This affects query.inc so the patches will be different.

darren oh’s picture

Re-roll. I haven't tested the 3.x branch.

merlinofchaos’s picture

Status: Needs review » Fixed

Hmm. I think I must've half committed this in the 3.x branch already. That was lame of me. Anyway, committed to 2.x and 3.x branches.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

josh waihi’s picture

Priority: Normal » Critical
Status: Closed (fixed) » Needs review
StatusFileSize
new628 bytes

I can't say I agree with the whole idea because not only does it look ugly but I question its scale-ability, which is a huge factor for PostgreSQL sites. If we're going to do pgsql specific stuff, I'd rather add non-aggregate fields to the group by clause rather than use first(). Running tests I found that this improved performance by 185%. For the meantime, when other tables with nid are included, this patch fails so attached is the fix for it.

darren oh’s picture

Status: Needs review » Closed (fixed)

Please refer to issue 607418.

darren oh’s picture

Status: Closed (fixed) » Needs review
StatusFileSize
new719 bytes

Sorry, that issue deals with a different problem. In answer to Josh's comment, adding more fields to the GROUP BY clause could result in duplicate records. We have to use aggregates for Distinct to work. The FIRST() function simply does explicitly what MySQL does implicitly.

I've modified the patch to use the base field alias, as is done in Views 3. Should be ready to commit.

merlinofchaos’s picture

Note that Josh is concerned that FIRST() causes major performance problems. To which I am sympathetic, but I also feel pulled between terrible forces on this one.

darren oh’s picture

We may be able to eliminate the use of FIRST by adding a DISTINCT ON clause. By including every GROUP BY field in the DISTINCT ON clause, we would match the behavior of MySQL.

josh waihi’s picture

I know this comment isn't going to go down well, but, is there a way that we could wrangle MySQL to perform more like PostgreSQL in this case? I thought that MySQL can handle PostgreSQL syntax fine in this case... SELECT DISTINCT(field_1), other_fields FROM {table} GROUP BY other_fields Is that different to SELECT DISTINCT(field_1), other_fields FROM {table} in MySQL?

If not, I'd like to look at the DISTINCT ON idea, though from the documentation, it didn't seem like it was what we're looking for. Hopefully I'm wrong.

@merlinofchaos can you commit #8 please?

YK85’s picture

I have come across some issues with distinct.
I was wondering if there has been further development in this area?

Thanks!

grub3’s picture

A possible solution would be to use SELECT DISTINCT ON (foo, bar) ON as proposed by Josh Berkus
Reference: http://www.postgresql.org/docs/8.4/static/queries-select-lists.html

Here expression is an arbitrary value expression that is evaluated for all rows. A set of rows for which all the expressions are equal are considered duplicates, and only the first row of the set is kept in the output. Note that the "first row" of a set is unpredictable unless the query is sorted on enough columns to guarantee a unique ordering of the rows arriving at the DISTINCT filter. (DISTINCT ON processing occurs after ORDER BY sorting.)

Could you give an example of the original SQL query working in MySQL and I will do my best to translate it for PostgreSQL using pgAdmin3.

By the way, making extensive use of DISTINCT hits the database performance a lot, because it may provoque sequential scans to differentiate data. Drupal make heavy use of DISTINCT everywhere, because people have the impression that it has no impact on performance. But in database with 100.000 records, it can have a huge impact.

Many times, DISTINCT can be replaced with LEFT JOINs which are much faster, but I need to read your SQL to tell.

In Views, the user clicks on "Distinct" to choose whether results should be distinct. In some cases, when using views, I had the impression that the LEFT ROW was the node and that we could make a LEFT JOIN without asking the user whether it was distinct or not.

Please not this is my first post on Views, so I have NO KNOWLEDGE of any kind on this matter, only SQL.

Kind regards,
Jean-Michel

josh waihi’s picture

Status: Needs review » Active

@jmpc, this only applies to situations where the user determines that the query must be DISTINCT. Views won't make a query distinct unless the user tells it too. So for example, Joins are generally used where possible.

So our concern here is how Views deals with query building on PostgreSQL when a query must be distinct. View has a working solution at the moment but its crap in terms of performance.

Both Josh Berkus and Darren Oh (#8) suggest using DISTINCT ON which I am fore over the user of FIRST. What we need is a patch that does this and doesn't break MySQL.

robby.smith’s picture

subscribing

Letharion’s picture

Assigned: Unassigned » merlinofchaos
Priority: Critical » Major

"Low performance" bugs aren't critical.
@merlin
This one is getting rather old and could use some dev love :)

dawehner’s picture

Status: Active » Needs work

And additional we need a patch for views 3.x

merlinofchaos’s picture

Assigned: merlinofchaos » Unassigned
Status: Needs work » Closed (won't fix)

I wasn't really the person working on this, and the DISTINCT thing is kind of an ugly stepchild in Views. It's known to be a hack to try and deal with some issues. It's totally broken in 7.x and probably always will be at this point, so I don't think I am particularly interested in this issue.

If someone wants to take up the cause and fix this, they're welcome to, but for now, I'm just going to close this.