PostgreSQL fixes
plj - September 23, 2009 - 15:04
| Project: | Spam |
| Version: | 6.x-1.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
Jump to:
Description
There is a PostgreSQL problem in duplicate filter's mega-SQL that searches for blacklisted IPs; it complains about missing GROUP BY fields, and won't list anything.
I've attached a patch that fixes that, and also includes another fix, where Spam attempts to update spam_tracker as if content_id were a string ('%s' modifier), even though it is an integer. This causes an “invalid input syntax for integer” SQL error on Postgres, if the variable is empty.
| Attachment | Size |
|---|---|
| spam_psql.patch | 2.47 KB |

#1
(note: edited from my original for clarity/readability)
Thank you for your patch.
----
On the second fix, shouldn't an error be coming back if content_id is empty? Or, to ask another way, when is it the case that it is valid for content_id to be empty? Maybe my memory is rusty on the code.
----
On the first fix, you point out a SQL statement that may have seriously needed some looking at. Thank you.
I think, however, your fix doesn't work -- it gets rid of the error, but it breaks the purpose of the line:
sql = "SELECT * FROM (SELECT DISTINCT x.hostname, x.timestamp, COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d ORDER BY timestamp DESC) AS x GROUP BY x.hostname, x.timestamp) AS y WHERE y.count >= %d";will, since the chance is nearly 0% that anything will have an identical timestamp to something else, make everything have count = 1 and thereby effectively turn off the blacklist (while slowing everything down with an extra GROUP BY that matches everything individually).
The original:
sql = "SELECT * FROM (SELECT DISTINCT x.hostname, x.timestamp, COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d ORDER BY timestamp DESC) AS x GROUP BY x.hostname) AS y WHERE y.count >= %d";is better fixed by adding a MAX aggregate function to the timestamp (and removing the now-redundant ORDER BY and probably never actually necessary DISTINCT clauses), fixing your problem while returning the same thing as the MySQL would (which I believe your code would not):
sql = "SELECT * FROM (SELECT x.hostname, MAX(x.timestamp), COUNT(x.hostname) AS count FROM (SELECT timestamp, hostname FROM {spam_tracker} WHERE score > %d) AS x GROUP BY x.hostname) AS y WHERE y.count >= %d";Then the inner SQL simplifies to one SELECT clause, so the final statement would be:
sql = "SELECT * FROM (SELECT hostname, MAX(timestamp), COUNT(hostname) AS count FROM {spam_tracker} WHERE score > %d GROUP BY hostname) AS y WHERE y.count >= %d";That final SQL string should work in both MySQL and Postgres, and additionally execute a good deal faster than the first version.