PostgreSQL fixes

plj - September 23, 2009 - 15:04
Project:Spam
Version:6.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work
Description

There is a PostgreSQL problem in duplicate filter's mega-SQL that searches for blacklisted IPs; it complains about missing GROUP BY fields, and won't list anything.

I've attached a patch that fixes that, and also includes another fix, where Spam attempts to update spam_tracker as if content_id were a string ('%s' modifier), even though it is an integer. This causes an “invalid input syntax for integer” SQL error on Postgres, if the variable is empty.

AttachmentSize
spam_psql.patch2.47 KB

#1

gnassar - November 9, 2009 - 18:53
Priority:critical» normal
Status:needs review» needs work

(note: edited from my original for clarity/readability)

Thank you for your patch.

----

On the second fix, shouldn't an error be coming back if content_id is empty? Or, to ask another way, when is it the case that it is valid for content_id to be empty? Maybe my memory is rusty on the code.

----

On the first fix, you point out a SQL statement that may have seriously needed some looking at. Thank you.
I think, however, your fix doesn't work -- it gets rid of the error, but it breaks the purpose of the line:

sql = "SELECT * FROM (SELECT DISTINCT x.hostname, x.timestamp, COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d ORDER BY timestamp DESC) AS x GROUP BY x.hostname, x.timestamp) AS y WHERE y.count >= %d";

will, since the chance is nearly 0% that anything will have an identical timestamp to something else, make everything have count = 1 and thereby effectively turn off the blacklist (while slowing everything down with an extra GROUP BY that matches everything individually).

The original:

sql = "SELECT * FROM (SELECT DISTINCT x.hostname, x.timestamp, COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d ORDER BY timestamp DESC) AS x GROUP BY x.hostname) AS y WHERE y.count >= %d";

is better fixed by adding a MAX aggregate function to the timestamp (and removing the now-redundant ORDER BY and probably never actually necessary DISTINCT clauses), fixing your problem while returning the same thing as the MySQL would (which I believe your code would not):

sql = "SELECT * FROM (SELECT x.hostname, MAX(x.timestamp), COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d) AS x GROUP BY x.hostname) AS y WHERE y.count >= %d";

Then the inner SQL simplifies to one SELECT clause, so the final statement would be:

sql = "SELECT * FROM (SELECT hostname, MAX(timestamp), COUNT(hostname) AS count FROM {spam_tracker} WHERE score > %d GROUP BY hostname) AS y WHERE y.count >= %d";

That final SQL string should work in both MySQL and Postgres, and additionally execute a good deal faster than the first version.

 
 

Drupal is a registered trademark of Dries Buytaert.