Over the past week we've been hit with a lot of spam - more spam than we typically deal with. I believe it's to the point where it's ridiculous we don't have any type of automated system in place to help with the problem, and it's wearing on me and possibly others.

In #1293186: Spam - meta: better spam-combating suggestions we've tossed around ideas about what to do, but the bottom line is that we should try something and if it doesn't work, change it.

I propose we try http://drupal.org/project/spam on this site.

To be determined: Security Implications and Performance

Comments

dman’s picture

+10
This month has been really dirty

Gerhard Killesreiter’s picture

see http://drupal.org/node/1293186#comment-5478346

Should we mark this as a duplicate?

klonos’s picture

I think not. That issue over there is more of a generic discussion and its goal shifts all the time with a lot of different opinions being expressed every now and then. This one here is specific and to the point.

cweagans’s picture

I think Spam.module fits our needs pretty well - a filter that learns seems to be the ideal tool for us. Looking at some of the filters that come with the spam module, I think there are some pretty cool tools that we could use effectively:

Included filters:
Bayesian filter - auto-learns, performing statistical analysis on the words in new content
Custom filter - regexp/plain text matching.
URL limiter - auto-learns spammer websites and blocks content linking to these URLs
SURBL - blacklist of URLs that commonly occur in spam (Third Party).
Node age filter - treats comments on old content as likely spam
Duplicate filter - blocks duplicate posts and bans associated IPs

I'd like to look at using that custom filter to find things that say 'rel="dofollow"' and automatically mark that as spam. I wish we had a database of all the spam that has been posted on Drupal.org - it'd be an unpleasant task, but we could analyze that and figure out some of the common patterns. Perhaps we could even train the bayesian filter with it.

cweagans’s picture

Another thing that we could add is a flood control module. That is, with each subsequent post in a given timeframe by a given user, the more likely it is that the post is spam.

klonos’s picture

Yeah, but there should be a way for legitimate users (TBD) to be excluded so that we don't accidentally block them if they happen to post a series of successive comments from time to time.

Instead of having to calculate such things on the fly for each new comment, I believe it would be wise performance-wise (no pun intended) to simply check for a specific role assigned to the user posting the comment(s). We already have flag deployed in d.o, so another way to do this would be to auto-assign a "newcomer" flag to new user accounts and have it so that only comments from such accounts are checked by this flood control routine. That'd save us some CPU time and decrease the possibility of false positives.

The thought behind my proposal is that it is highly unlikely for a legitimate user like you or me to suddenly start spamming. So, why even bother checking those accounts in the first place? The only thing we need to figure out is what account properties define a "legitimate user" and how to flag those accounts as such.

cweagans’s picture

How about we focus on the functionality first, and optimize later where needed? We can sit here and talk forever about what needs checked where, but in the end, spam.module will solve more problems than it creates for us. Let's just get it deployed, get the filters trained, and then we can start talking about skipping spam checks for users with a certain role or length of membership or whatever.

silverwing’s picture

@klonos - I would have any roles (including 'vetted git user') bypass the filters.

As for flag.module, there's a JOIN argument that freaks out killes and others a bit, so using it any more on d.o without more investigation into its performance implications probably wont happen.

@cweagans - killes mentioned a database table that he'd be worried about before deployment, so I'm looking into that.

klonos’s picture

Yes, definitely! By no means was my comment intended as a show-stopper argument. Lets deploy now and tweak as we go.

andypost’s picture

Having flag module already installed suppose it's much easy to add flag to report spam

cweagans’s picture

Status: Active » Closed (won't fix)

The idea is to reduce the amount of manual intervention required for taking care of spam. We want to automate it. If people flag things as spam, somebody still has to go through and review the flagged content. In addition, the testing issue was marked as won't fix, so I'm going to won't fix this as well: Spam module doesn't seem to do what we want (or it wasn't configured properly or something). Spamicide was mentioned as a possible solution.

klonos’s picture

Status: Closed (won't fix) » Postponed

...from #226678-53: Add a "Report spam/abuse" link to forum/issue comments (next to the "edit" & "reply" links).:

...RIght now, we're going to install Mollom. ... This is a temporary solution and will only be used until there is a working port of spam.module and report_spam.module for Drupal 7, at which point we'll start using those.

So this is not a wontfix but postponed on: #1063524: Port spam module to Drupal 7 and #1714302: Port Report Spam module to Drupal 7 I guess. Right?

cweagans’s picture

yep

klonos’s picture

...that's a relief. Thanx.

klonos’s picture

When we get back to this task, perhaps we should consider implementing a way to stop spammers from creating an account in d.o in the first place. That should considerably reduce the amount of work spam.module would need to do. There is such a solution available and it has a 7.x version available too: http://drupal.org/project/spambot (it uses www.stopforumspam.com)

killes@www.drop.org’s picture

We've had bad experiences with IP-based blocks when we had the http:bl module enabled. Some countries are only connected to the net through a smallish amount of external IPs.

klonos’s picture

Hmm, I wasn't aware of that situation. Still, we can use the (an) external service and instead of blocking registration completely for blacklisted IPs simply give user accounts created a certain amount of "spaminess" points to begin with. Besides, from what I see http://www.stopforumspam.com/ doesn't log only IPs, but a combination of IP-username-email used to register. The spambot module on it's part takes that under account:

Checks (username, email, ip address) data against the www.stopforumspam.com blacklist. Blacklisting can be based on either of email, username or IP address (with configurable thresholds).

That should be safe enough I guess.

killes@www.drop.org’s picture

The issue is that we then need to send our users' email to a 3rd party service which is one of the issues with mollom.

klonos’s picture

Yes, I know, but thankfully http://www.stopforumspam.com/ besides offering the API to connect and check things against their db they also provide their db data in various downloadable formats!! No need to send out any data at all - just set a cron job to download their hourly/daily ip/email/username files and store them locally. This way I guess the check will be faster too. Perhaps in return for the benefits of using their data we should implement a way to send back out to http://www.stopforumspam.com/ only data of registered users that are indeed deemed as spammers by our Bayesian filter or any manual clean up. Alternatively we could consider donating a certain amount each year ;)

anarcat’s picture

For the record, I have had numerous problems with spam.module on my blog, I can't imagine installing this on something of the scale of Drupal.org. Spam.module does a lot of things, and it's not always clear which part marks which post as spam. And to get an idea why, you need to crank up debugging which will yield too much data here (see #1118442: Trace option in spam module just shows blank page for a discussion about this).

So some caveat... I am not sure spam.module even works anymore... Right now my situation is that I am on the verge of disabling it because it's marking *everything* as spam right now...

killes@www.drop.org’s picture

Status: Postponed » Closed (won't fix)

I think we are reasonably happy with honeypot & co.