Spam handling usability issues [#11991]

I applaud the effort in this module, I really do, honest ... but in it's present form, I wonder if the module is too awkward to be useful against typical real-world spam attacks.

Here's why:

Three days in a row, my site has been assailed by spammers bent on flogging anatomical enhancement aids; the spam module offers low protection, slow defense, and awkward recovery.

Low protection: I can add the anatomical keyphrase into the db manually, as in
INSERT INTO spam_tokens VALUES("Levitra","1","0","99","1097334991"); ... this is naively copied and modified from the example training file, but what I learned by adding rules this way is that one instance is not enough -- the next day, 28 more comments, all of them with the noted key phrase. It seems that without a bulk-training mode on a sample with repeated examples, it could take a long long time before my site sees any useful protection.
Slow defense: Related to (1) and maybe even the same thing, there's no facility for the use-case to escalate a key-phrase; when I get these anatomical enhancement adds flooding in today, I can't ban one and be protected from any more; I have to ban a lot of them and even then only see results statistically. When the spammers fling dozens of crap balls at you, statistical is not good enough.
Awkward recovery: Ok, let's work through the scenario of 28 enhancement ads ... how do you (a) train the filtre on them and (b) how do you remove them. the answer appears to be: One at a time. This is a big show-stopper for any practical use of spam.module as it is today -- one at a time you must click the comment, scroll down, click Mark as Spam, go back to the comment and Delete, then go back to the list of comments and do the same for the other 27.

I didn't. What I did was mark the first two as spam, then fire up mysql at the command line and issue a command somewhat like delete from comments where comment like '%anatomic-item enlargement%';... it works, but it's not the sort of solution we can recommend to the average webmaster, plus you have to purge your cache to be completely rid of them.

My own experience shows us spam protection is urgent for Drupal; without some sort of defense, we might as well rename the project The Levitra Portal ;) -- I think the module still has merit as a talking point, as an example and the beginning of a framework.

So much to code, so little time: If I had the time and means, here's what I would add to spam module to get it quasi production ready as quickly as possible:

Mark as Spam: this should be one of the options on the admin/comments, and it should be a batch operation via checkboxes -- 28 clicks to mark them, or maybe, as with webmail programs, show one or two dozen at a time with a javascript-hooked button to Select All and Invert Selection, and then action buttons for the selected set to mark the batch as spam and/or delete the works. Unpublish is a first step, but I really don't want to be paying for hosting space for 365x28 viagra adverts.
Two Stage Filtering: Spamassassin uses this approach and it's a very good method -- it is true that some spam is very clever and while obvious to human readers, totally defies pattern-matching, and for this we need the Bayesian analysis. But it is also true that some spam, like the two dozen per day I've collected this week, have obvious keywords and key patterns that could be legitimate, but which my blog can live without. Instant disqualification. Zap. Those that pass the array of perl regex filters, those then go on to the more slippery Bayesian test.

I realize that both extensions are non-trivial, one of them may even require new hooks in the comment module -- I didn't say the fix would be easy, only that without them, I don't think Drupal can work out well for those who aren't comfortable with direct SQL access (plus, once I do my SQL delete, I suppose I must also flush my cache too, right?) -- we do need to face this scourge of blog-spam and offer whatever help we can to bring spam.module up to production environment requirements asap.

Comments?

Comment	File	Size	Author
#3	blacklist.txt	21.33 KB	garym@teledyn.com

Comments

Comment #1

jeremy commented 24 October 2004 at 02:32

Hi Gary,

Thanks, I appreciate the feedback. As you can see, this module is fairly young and still has plenty of room for improvement. I'm glad to get your suggestions -- keep them coming. :)

1. Low protection: I can add the anatomical keyphrase into the db manually, as in
INSERT INTO spam_tokens VALUES("Levitra","1","0","99","1097334991"); ... this is naively copied and modified from the example training file, but what I learned by adding rules this way is that one instance is not enough -- the next day, 28 more comments, all of them with the noted key phrase. It seems that without a bulk-training mode on a sample with repeated examples, it could take a long long time before my site sees any useful protection.

It does take time to train a Bayesian filter. This is always true, including with popular tools such as Spamassassin. Generally the filter needs to see a few hundred spam or more before it's able to start automatically catching spam. As for "key phrases", that's not really how this module works. Instead, it breaks content into "words", then finds the 15 most "interesting" words and sees if they appear more often in spam content or non-spam content. Based on that, it decides whether it should mark the new content as spam or non-spam.

I intend to introduce a Markonian tokenizer some time soon, at which time the module will begin to recognize phrases in addition to words. When time permits.

2. Slow defense: Related to (1) and maybe even the same thing, there's no facility for the use-case to escalate a key-phrase; when I get these anatomical enhancement adds flooding in today, I can't ban one and be protected from any more; I have to ban a lot of them and even then only see results statistically. When the spammers fling dozens of crap balls at you, statistical is not good enough.

In theory, if you train the filter with each new spam content, it should eventually "learn" enough spam "words" and start marking a high percentage of them correctly. (I would hope >95%). Proper and consistent training is key, however.

3. Awkward recovery: Ok, let's work through the scenario of 28 enhancement ads ... how do you (a) train the filtre on them and (b) how do you remove them. the answer appears to be: One at a time. This is a big show-stopper for any practical use of spam.module as it is today -- one at a time you must click the comment, scroll down, click Mark as Spam, go back to the comment and Delete, then go back to the list of comments and do the same for the other 27.

True, there is a fair amount of manual effort involved. I'll give this some thought. More comments on this below.

I didn't. What I did was mark the first two as spam, then fire up mysql at the command line and issue a command somewhat like delete from comments where comment like '%anatomic-item enlargement%';... it works, but it's not the sort of solution we can recommend to the average webmaster, plus you have to purge your cache to be completely rid of them.

This is a bad idea. You want to train the filter with _all_ spam that you see. If you only train it with a few of the spam that you see, it will take it longer to learn. Even if all the comments are identical, it's in your best interests to train with each and every one of them that the module gets wrong. (The default is "TOE" mode, meaning "Train On Error". Any time the spam filter incorrectly marks content, you need to train it what it should have done. That's how it learns.)

I repeat: always correct the module when it makes mistakes. If it doesn't recognize a spam comment as spam, you need to correct it each and every time or it will continue to make mistakes.

1. Mark as Spam: this should be one of the options on the admin/comments, and it should be a batch operation via checkboxes -- 28 clicks to mark them, or maybe, as with webmail programs, show one or two dozen at a time with a javascript-hooked button to Select All and Invert Selection, and then action buttons for the selected set to mark the batch as spam and/or delete the works. Unpublish is a first step, but I really don't want to be paying for hosting space for 365x28 viagra adverts.

I did not do this for one simple reason: it requires patches to the Drupal core. I have patched Drupal core many times over the years, but have found that I simply don't have time to keep up with the patches each time a new version is released. Of course, if there was interest from Dries et al in getting the spam filtering functionality into core itself, then many improvements could happen.

That said, I will consider the possibility of introducing some patches for the 4.5 tarball to allow mass-marking messages. In your case, it sounds like this is essential for the module to be usable. And it would not be difficult.

As for deleting spam content, I think this is a bad idea. The reason being that I only started working on this module a little over a month ago, and it will go through some major changes in the coming months. If you delete your spam, then you'll either not be able to upgrade when a new version comes out (in which I change the tokenizer logic to recognize phrases in addition to just "words"), or you'll have to "forget" everything you've taught it so far and start over from the beginning. I firmly believe that at this time deleting spam content is not in your best interests.

2. Two Stage Filtering: Spamassassin uses this approach and it's a very good method -- it is true that some spam is very clever and while obvious to human readers, totally defies pattern-matching, and for this we need the Bayesian analysis. But it is also true that some spam, like the two dozen per day I've collected this week, have obvious keywords and key patterns that could be legitimate, but which my blog can live without. Instant disqualification. Zap. Those that pass the array of perl regex filters, those then go on to the more slippery Bayesian test.

I like this idea quite a bit. It is a simple enough change, though I'm not currently sure on the user interface. In any case, I'll add this functionality at some point in the relatively near future. (I'll have limited spare time for the next couple of weeks to months) Ideally I'd like to make it so you scroll through phrases the module has seen and select the ones that immediately qualify content as being spam. However for now it will be simpler to have the admin manually enter such words/phrases.

I realize that both extensions are non-trivial, one of them may even require new hooks in the comment module -- I didn't say the fix would be easy, only that without them, I don't think Drupal can work out well for those who aren't comfortable with direct SQL access (plus, once I do my SQL delete, I suppose I must also flush my cache too, right?) -- we do need to face this scourge of blog-spam and offer whatever help we can to bring spam.module up to production environment requirements asap.

If you want to help improve this module, something you can do is to save all of your spam. And after you have a lot of it (hundreds to thousands), dump it all to a text file and mail the dumpfile to me. That would greatly help me to improve the tokenizer logic, and to better tune the Bayesian filter.

Comments?

I have some suggestions for things you might try. First, enable 'Advanced configuration' of the module. Review the options there, as it may help to do some tuning. The module is currently tuned based on some papers I read about spam filtering email. But spam comments are quite a bit different, and obviously the tuning will have to change. (That's why I'd like to get a dump of all your spam -- it will help me to get the default tuning right)

Some things you might want to tune: If you have a lot of shorter spam comments, drop the number of words examined from 15 down to 10 or even 5. That could result in a lot of false positives, but it should help the module to be more responsive quicker. You may also want to increase the 'assign unkown token probability' from 40 to something much higher -- this will cause the filter to assume words it hasn't seen before are spam, rather than non spam. Again, this could result in an increase in false positives, but it will also allow the filter to catch more spam. Perhaps try setting it to 60, or even 70?

Finally, you could lower the default threshold from 80, to maybe 60 or 50.

Of course, making any or all of these changes you will want to carefully watch for false positives. And anything the spam filter incorrectly marks as spam, be _sure_ to tell it really wasn't spam. If you tune too far, you'll find instead of always clicking 'mark as spam', instead you'll be always clicking 'mark as not spam'. The key is to test and find that happy medium...

Comment #2

jeremy commented 24 October 2004 at 03:56

Assigned:

Unassigned

» jeremy

That said, I will consider the possibility of introducing some patches for the 4.5 tarball to allow mass-marking messages. In your case, it sounds like this is essential for the module to be usable. And it would not be difficult.

Okay, patches are available in the new "optional" directory. Apply the patch against the Drupal 4.5.0 comment module and it will give you the ability to mass-mark comments as spam or not spam. (It also adds the ability to mass-delete, mass-publish, and mass-unpublish comments).

(Note: You will also need an updated spam.module, as I had to fix an existing function, and introduce a new one...)

In the future, please file unique feature requests as seperate issues.

Comment #3

garym@teledyn.com commented 24 October 2004 at 17:19

Status	File	Size
new	blacklist.txt	21.33 KB

Of course, if there was interest from Dries et al in getting the spam filtering functionality into core itself, then many improvements could happen ...

I am doing my utmost best to thrust this issue forth at every opportunity ... to the point of sounding like a broken record (CDs skip too ;) because I think this is a big gaping hole that prevents any mass adoption of Drupal for non-technical and non-fringe (read "secure through obscurity") situations ... and it's currently the biggest thorn in my side having now moved all my websites into Drupal.

Because I've been outspoken on spam since the early 90's, and because I was one of the first to positively identify robot comment spam, I'm on their hate-lists and thus become an excellent example target. I also have a stern conviction against blaming every anonymous commenter and I'm a strong advocate of open communications that welcomes participation, so I have a high need for a solution at least as workable as Jay Allen's mt-blacklist.

As for keeping the spams, sorry, but that just isn't practical: I have set my site temporarly to suspect everyone and throw all unidentified comments into the approval queue, a situation I want to reverse asap, but when the spams will outnumber real comments 200:1, I don't have time to page through long lists of search-engine, viagra and cigarette ads until I get to the one or two true comments; as much as I realize you could use the data, I have to delete them in order to suss the signal from the noise.

What I can do is attach my mt-blacklist list of perl expressions; I hope it helps.

Comment #4

garym@teledyn.com commented 24 October 2004 at 17:30

Title:

Spam handling usability issues

» In defense of patterns

Just a note on the Bayesian method's need to have a certain threshold of phrase-length to test: Spammers know this. In my email spam, a large percentage now pad their spam message with a paragraph of innocuous random words to sway the Bayesian filters from disqualifying them. Also, for example, the cheap cigarettes vendors spams are short statements of fact where nearly ever word is an HTML A link, and the only really identifying feature of the spam is the domain name of the URL.

You may notice a lot of perfectly legitimate patterns in my blacklist.txt ... I know, I'm banning a lot of perfectly fine comments and I've even been stung once or twice myself trying to enter a comment and wondering why I'm banning myself, but when you spend hours every single day scraping shite from your website, after a while you just say to hell with it and ban whole categories of topics.

I hate spam.

Comment #5

garym@teledyn.com commented 24 October 2004 at 17:33

Title:

In defense of patterns

» Spam handling usability issues

Hmmmm ... not a feature request, just a thought: What about using the blogger-api with a link item that says "report this to Jeremy ..."

Comment #6

garym@teledyn.com commented 24 October 2004 at 17:59

the comment.module patch works wonderfully well on the CVS, but I just thought I'd let you know that I had to remove the spam.module in order to see the modules page -- could it be your patch on that code hasn't been check into the HEAD?

Comment #7

garym@teledyn.com commented 24 October 2004 at 18:21

disregard that last one -- I get the same behaviour with v 1.17 as v 1.9.2.8, the admin/modules page bombs out, and sure enough, both versions are identical except for the version number.

Comment #8

jeremy commented 25 October 2004 at 02:12

the comment.module patch works wonderfully well on the CVS, but I just thought I'd let you know that I had to remove the spam.module in order to see the modules page

I'm unable to duplicate this. I've got a clean 4.5.0 tarball installation with the comment.module patch applied, and the spam.module installed, and everything works fine. You're saying if you move the spam.module out of the modules directory, then 'admin/modules' displays a list of all available modules. But if you move the spam.module into the directory, then you just get a blank page at 'admin/modules'?

I suspect that you've got something funky in your database. Perhaps do a sanity check on your 'system' table?

Anyway, I had a spare hour today and i'm 75% of the way done with implementing regex filtering in addition the Bayesian filtering. I'm hoping to have time to finish tonight before I have to call it a night.

Comment #9

jeremy commented 25 October 2004 at 03:42

Okay, grab the latest spam.module. You'll need to update your database as well, creating the 'spam_custom' table. Then go to "administer > spam > custom filters" and you can add all the items from your blacklist. Remember to add the beginning and ending /'s, and add an "i" at the end if the matching should be case insensitive. For example, from your blacklist:

     \b(annunci|canzoni|cartoni|donne|scarica|sesso|sonnerie)([\w\d_.-]+)+\.\w{2,3}\b

Should be entered as a custom filter as:

     /\b(annunci|canzoni|cartoni|donne|scarica|sesso|sonnerie)([\w\d_.-]+)+\.\w{2,3}\b/i

At some point I will probably enhance this functionality to validate regex strings. Other suggestions on the usability, etc, are of course very welcome.

Perhaps a default blacklist could be shipped with the module, though I'm not sure if this would be locale specific.

I'm marking this feature request fixed as both of your requests are now implemented. If you have further problems with "admin > modules" noit displaying, please open a seperate support request or bug report.

Spam handling usability issues

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

News items

Our community

Documentation

Drupal code base

Governance of community