In the past couple of months, I've begun to notice the occasional posting of 'comment spam' on my website. These have tended to include a short string of nonsense "mad-lib" style text, followed by a large number of offsite links. I currently utilize the tracker module to at least glance at every comment left on my website, so I eventually find this spam and manually delete it. However as the rate of this comment spam has increased, I've been looking for a better way to deal with it.

Not wanting to re-invent the wheel, I began by looking at Spamassassin and other free anti-spam tools. I had hoped to integrate one of these tools into Drupal, letting it do the actual work of deciding whether or not a given comment was spam. With further research, I found that this wasn't very workable as these anti-spam tools tended to be very mail-centric, looking at more than just the body of the email. Instead, I read up on using Bayesian logic, and ultimately decided it would be best to write a simple Bayesian filter in PHP.

Spam module
Thus, last weekend the spam.module was born. Based on Paul Graham's papers on the subject, it breaks each new comment into words, finding those that are the "most interesting", which means those that are most likely spam or most likely not spam. Using the selected words, it calculates the probability of whether or not the new comment is spam. For words that have a high probability of being spam, special actions can be taken (such as preventing the comment from being viewed).

Bayesian logic, how it works
In the beginning, the module does not know what is spam and what is not spam. In the default configuration, it will assume that all words it has not seen before have a 40% chance of being spam. Based on this assumption, all comments it sees will be considered non-spam. It is up to the site administrators to "teach" the module when a comment was actually spam, by clicking a link at the bottom of the comment that reads "mark as spam". At this point, the occurance of all words in the spam comment will be counted and stored in a "spam" database table. Should the module mistakingly label a valid comment as spam, the administrator will need to click "mark as not spam", and the words from that comment will be counted and stored in a separate "nonspam" table.

When the module sees that either the "spam" or "non-spam" tables have changed, it will recalculate the probability of every word in these tables being spam or non-spam, storing these probabilities in a third "spam-probability" database table. It is against this third table that all new comments are weighed to determine whether or not they are most likely spam.

By default, the module operates in TOE mode, or Train On Error mode. That is, if it correctly labels a comment as spam or non-spam, it doesn't learn from the comment. Only if it makes a mistake which an admin then manually corrects will it add the comment's words into the appropriate database table for calculating new probabilities. Alternatively, the module can operate in in TEFT mode, or Train Everything mode. This is also known as "auto-learn" mode, as it will store away words from all comments that it sees to try and fine tune its probability tables. While both modes are supported, research seems to favor TOE as ultimately being more reliable.

As spam comments are still relatively rare, it will probably take a very long time to train the Bayesian logic to properly catch most spam. It is doubtful that you have a large pool of spam-comments to draw from, meaning you will simply have to train as you go (another reason that TOE mode is well-suited). That said, it is probably only a matter of time before spam comments become a regular nuisance, at which time training the filter won't take nearly as long.

Plans for the module, looking for feedback
At this time, the spam.module is in early development. In particular, while the underlying logic is believed to be fully functional it has not been optimized for best performance, the administrative interface used to control the module is still rough, and the module doesn't actually take any actions when it detects spam. That said, none of these are difficult obstacles, so I expect to have a fully functional and useful module in the near future.

The purpose of this article is to generate some feedback. I have a number of ideas, but would be interested in hearing more from others that have also thought about this problem, or have perhaps already begun to try and tackle the problem.

For example, what is the best action to take when the module detects spam? It could mail the administrator to let him/her know what has been posted to his/her site. It could prevent the spam from being posted. It could let the spam be posted, but prevent it from being displayed. It could operate silently, or tell the offender that his/her comment appears to be spam. It could save the poster's IP address and blacklist it preventing the user from leaving more contents, or even from accessing the website. It could provide all of these options and/or others, allowing the administrator to choose the best action.

As for detecting spam, the method currently used is simple Bayesian logic. It does not try and be intelligent, looking for certain "tell tale phrases". Personally I think going that route is a loosing battle, as it's only a matter of time until the "tell tale phrases" would be changed. However, I do intend to explore using Markovian logic, looking at multiple words together in addition to looking at each word individually.

I've also received spam forum and story postings on my web site. Thus, I intend to expand the module to also scan newly submitted nodes. However, it may not be necessary/appropriate to scan all nodes, so I'm thinking to again make this configurable. For example, you could enable spam filtering on 'comments' and 'forum posts', but not on 'stories' or 'book pages'.

Large Drupal sites may not be focused on only one topic. This could make it more difficult to properly train the Bayesian filter, as what is appropriate in one of the site's forum may not be at all proper in another one of the site's forums. To handle this, I'm considering the possibility of allowing the administrator to break his/her site into multiple logical sections, allowing each their own Bayesian databases. However, I've not decided if the increased complexity is worth the gain.

The actual interface for marking comments as spam or not-spam also needs a lot of work. The current implementation makes comments feel far too cluttered. Also, the administrative spam overview page should probably be reworked to provide more functionality.

Finally, I'm open to ideas on how to improve the Bayesian logic itself. For example, the method for tokenizing content should probably be smarter, especially when dealing with html markup, IP addresses, quoted content, numerical ranges, and monetary units. The function for measuring the spam probability of content currently accesses the database once for every token and needs to be optimized, probably by reading larger chunks from the database at a time. If TEFT mode is used, a mechanism to prevent learning and relearning the same spam or non spam over and over will also need to be developed to prevent word prejudice. Additionally, the algorithm that determines the probability of a given word being spam could also be improved.

Using the module
All this said, the module is currently being tested on my website, KernelTrap.org. In the 24 hours it's been installed, I've already fed it two true spam comments. For the first time I actually look forward to spam, wanting to see how the module will perform. Of course, patience is necessary, as it will probably have to see a couple hundred spam comments before it's able to actually recognize them on its own. I best be careful what I wish for.

Comments

joshuajabbour’s picture

Has anyone looked into how Wordpress[1] does their comment spam filtering? I'd heard plenty of raves from people who use it who have been thoroughly satisfied. I don't have the time to look into it right now, but will soon. Wordpress is GPL I believe, so we could implement their methods fairly* easily.

Comment spamming on Drupal will happen eventually, it's just a matter of time. So I heartedly encourage a way to filter out spam. I will gladly help out in coming up with a solution...

[1] http://www.wordpress.org

Jeremy’s picture

It appears that they use a combination of (1) comment moderation, (2) link limiting, and (3) keyword blacklisting.

exo’s picture

I know nothing about code but having used WordPress....and comment filtering in particular, I am really impressed. It looks so simple yet working to my satisfaction. For example, under 'Comment Moderation' you can #1) Hold a comment in the queue if it contains X number or more links. (A common characteristic of comment spam is a large number of hyperlinks.) and then there's a blacklist. Here's how Wordpress describes it #2) "When a comment contains any of these words in its content, name, URL, e-mail, or IP, it will be held in the moderation queue. One word or IP per line. It will match inside words, so "press" will match "WordPress"."

I have a few site using drupal 5.x...I WISH Comment filtering feature with such simplicity with will be available drupal.

Thanks.

Exo
Code Dummy!!
http://www.onechinapolicy.com

Michelle’s picture

Anyway, the spam module does that.

Michelle

--------------------------------------
See my Drupal articles and tutorials or come check out life in the Coulee Region.

exo’s picture

Ha ha thanks for the pointer Michelle. You can see I am late in the game. Well, I am very new to drupal..trying to catch up.

Thanks again.

Exo
Code Dummy!!

Michelle’s picture

No problem. Welcome to Drupal. And watch the dates on posts... We dont' delete anything around here. ;)

Michelle

--------------------------------------
See my Drupal articles and tutorials or come check out life in the Coulee Region.

axel’s picture

Spam filtering - is mostly evident using of Bayesian logic (and other such algorithms). But what about more widely areas? - user post topic to forum and then filter decide: into which of forums place this topic? Maybe also autoassign appropriate taxonomy terms with user posted nodes. Filter in that way becomes as smarty "automoderator" of a site. This is a crude thoughts, I know...

--
Axel,
Russian Drupal Community

Jeremy’s picture

I agree that Bayesian logic could be used for many additional purposes. I had originally considered making spam.module more generic so it could be used for other purposes as you've described, however I ultimately decided to keep it focused on spam so it could do a better job.

That said, I very much like your idea of using Bayesian logic to auto-categorize forum posts. I would find that very useful.

Jeremy’s picture

Giving the current implementation more thought, I realized that using three separate token database tables was wrong. I greatly simplified and optimized the logic by combining them together into one table.

Bèr Kessels’s picture

Hi,

I think the problem of learning could be adressed by adding very simple SQL import and export of the table data. We could then collect the database in a central place, so that people can import it and use it on their own site. Of course this would only be the records or marked as spam, for the "non spam" records are very site specific.

Furthermore, you might be able to use taxonomy and taxonomy description to have the filters look fr sections and categories on the site. This might solve the "differing content" issue. What you could do, is for example give the taxonomy descriptions very high weights of non spam. Also you might be able to use a taxonomy id number in the data, so that the filters will be able to choose more approporiate content to comparethe new content against.

[Ber | Drupal Services webschuur.com]

Jeremy’s picture

> I think the problem of learning could be adressed by
> adding very simple SQL import and export of the table
> data.

Yes, this is probably true for most generic spam. Actually, I'm trying to collect such data now for the purpose of tuning the Bayesian logic. If you get any spam comments on your Drupal powered website, I'd like to receive a copy.

Mail your comment spam to commentspam@kerneltrap.org.

Be sure that the words 'comment spam' are in the subject. Make the first line of the email read 'subject: ' followed by the comments actual comment. In the second line and beyond paste in the actual spam comment.

I will first use this data for fine tuning the Bayesian logic. I will also make a spam_tokens dump available for anyone interested. I want raw data and not a database dump as I need to also test the tokenizer logic.

(I can set up a website to collect this data if people prefer.)

Gunnar Langemark@www.langemark.com’s picture

The "natural" development path would probably go in the direction of NOT maintaining a central database, but rather by distribution via XML through "close" partners in a network of sites. That way a list of "typical spam expressions" could dissipate through the net without depending on a centrally governed db.
I think this would also be in the same spirit as Drupal.
Let Cron run the update, and perhaps have a central db of sites that offer "callers" (If you chose to be a part of the network you allow 5-10-20 sites to call you, and you call a select few other sites which offer the service to you.
Naturally this scheme is vulnerable to spammers who somehow manages to access this db. and thus get access to enemy "intelligence". If we make this a closed cirquit, it could work.

Just my 2c

Dropping in from Langemarks Cafe.

Jeremy’s picture

Yes, I've given such an idea some thought, but for now it's more complicated than I'm looking to implement. I'd like to get the Bayesian logic and resulting actions fully functional first.

rosen-1’s picture

Apple's Mail.app program uses LSA for spam filtering. More info on LSA is available at this site:

http://lsa.colorado.edu/

It'd be nice to use this technique, which is a "Bayesian" method, for comment spam filtering, as it has proven to be more accurate than SpamAssassin-style Bayesian methods. If you need help implementing this, let me know.

pgp: http://www.cs.uchicago.edu/~ido/pgp

Jeremy’s picture

Sounds interesting, I'll read more about it.

> it has proven to be more accurate than SpamAssassin-style
> Bayesian methods

Can you point me to this research? I've found write-ups explaining why it should be better than Bayesian filtering, and why it could be better than Bayesian filtering, but also that spam still gets through. So far I've not come across any scientific studies.

In the near future I plan to move to a Markovian tokenizer (looking at phrases instead of just words) which should help quite a bit. Beyond that, we'll see what's necessary.

rosen-1’s picture

The Markovian tokenizer is actually a good idea, but the tokenizer is just part of the spam filter. You still need a good reactive system.

As far as spam filtering goes, the reason spam would still get through with LSA is because it requires a stricter spam/ham set -- that is, if something is spam and you do not mark it so, but rather delete it, and the system assumes that anything not marked as spam is ham (good email), then you are creating a situation in which LSA will allow spam through.

I think none of these models are optimal because none of them work without cooperation and configuration from an informed party, so Joe Schmoe who wants to start his blog will have to hear a spiel about how he must always mark spam as spam and not delete it before doing so.

It's all in the training. :)

Jeremy’s picture

After a final round of updates, I've added all functionality currently intended and marked the spam module for release as a 4.4 module. To see what's new, review the changelog.

If you give the module a try and find a bug, please first check that nobody else has reported the same bug, then file a bug report. If you feel the module is missing important functionality, please first check that nobody else has requested the feature, then file a feature request.

Next up: a 4.5 release of this module.

Jeremy’s picture

The spam module has been ported to Drupal 4.5.

javanaut’s picture

I expect this module to be central to the operation of any sizeable public site in the not-too-distant future. Since my site accepts public posts via email, being able to detect spam is increasingly important to me. I will be installing this soon.

narres’s picture

For those who are not deep in tecnical:
http://www.seo-blog.org/407_mesothelioma_lung_cancer/categories/607_gene...
explains some Spam-handling to.

http://www.goodkeywords.com/ gives some (as the name says) good keywords to set up a initial lex.

I love the spam.module. Thanks a lot.

Thomas Narres
Keep the sunny side up

freyquency’s picture

i'd like to see it integrate somehow like FOAF and the drupal login too. It would be nice if there were some sort of way to combine the results from many sites and pull that information through the filter. I can understand wanting to get the logistics down and fine tune it before any work like that went underway. I'm excited to try it out.

Jeremy’s picture

I'm happy to report that on the 37'th spam comment posted to KernelTrap the Bayesian logic caught its first spam comment in the wild. The comment had a rating of 93, and was indeed spam. Of course, I expect to have to train it with 2-300 before I see it consistently catching spam... Still, this is a milestone, proving that the Bayesian logic can work. :)

BTW: Since writing the above article, the spam module has gained two additional mechanisms for catching spam: 1) regex-based custom filters (word/phrase/pattern matching), and 2) filtering comments posted from known email spammer IP addresses. These latter two features are only available in the 4.5+ version of the module.

mike3k’s picture

It would be great if there was a way to automatically disable comments or make them read-only on an entry after a certain length of time with no activity. I've seen a MT plugin which does exactly that.

I find that almost all comment spam is posted on entries several months old or with very low node numbers rather than current entries. Just doing this would cut down on the amount of comment spam.

I'm using the spam module at both macmegasite & worldbeatplanet and it's still missing most spams which I have to flag manually.

Both sites were hit with several comment spams promoting online gambling. I just manually disabled comments for all older entries on both sites and blocked the IP address range the spam came from.

--
Mike Cohen, http://www.mcdevzone.com/

Jeremy’s picture

I'm using the spam module at both macmegasite & worldbeatplanet and it's still missing most spams which I have to flag manually.

If you're using Drupal 4.5, get the latest 4.5 version of the spam module. I added some new features in the past couple of days that should help a lot.

javanaut’s picture

slashdot article here:
http://it.slashdot.org/article.pl?sid=05/01/19/0516246&from=rss

Google's proposal:
http://www.google.com/googleblog/2005/01/preventing-comment-spam.html

Maybe we could apply this to anonymous content?

grendel’s picture

Adding the rel="nofollow" attribute to the {a} tag *should* not be too difficult of an undertaking, I assume it is a matter of making a filter.

I would like to see you have granular control of where this is applied. For instance, only to non-registered users, or certain groups of trusted people. Everyone else would get it tacked on to their tags. This im sure would be a bit more work.

I *really* need to learn how to start developing modules for drupal.
--
eric();

>:-@ | Photography

chx’s picture

filter patch is at http://drupal.org/node/15847

--
Drupal development: making the world better, one patch at a time. | A bedroom without a teddy is like a face without a smile.

grendel’s picture

nifty, just saw that (upgrading to 4.5.2 today, so im going through everything).

How do i implement the patch?
--
eric();

>:-@ | Photography

bradrice’s picture

After I install the module, I get errors when I try to enable the module. I am running Drupal 5.1.

Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to allocate 14592 bytes) in /Library/WebServer/WebSites/bradrice/bradrice.com/drupal/sites/all/modules/spam/spam.module on line 1577

Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to allocate 3565 bytes) in /Library/WebServer/WebSites/bradrice/bradrice.com/drupal/includes/database.mysql.inc on line 400

bradrice

Michelle’s picture

First off, please don't bump up 3 year old threads to post support requests on them.

Second, memory errors and suggestions for fixing them are covered in the troubleshooting guide in the drupal.org handbook.

Michelle

--------------------------------------
See my Drupal articles and tutorials or come check out life in the Coulee Region.

bradrice’s picture

Sorry and thanks. The author's site doesn't have a support area.

bradrice

Michelle’s picture

Module support requests belong in the issue queue. http://drupal.org/node/add/project_issue/spam/support

[Edit] After I hit submit, I remembered what the problem you were having is. That isn't an issue with the spam module; you need to up your memory limit. So don't submit it to the spam module issue queue. But, in general, that's where module support goes.

Michelle

--------------------------------------
See my Drupal articles and tutorials or come check out life in the Coulee Region.

bradrice’s picture

Thanks for the information. i'm still new to drupal and the website has a lot of information, and I'm still not familiar with the layout of this site. I'll try to be more thoughtful as to where I put my posts.

bradrice

Michelle’s picture

It takes time to find your way around here. Sorry if my first post came off harsh. I'm often terse when juggling a baby and unable to type well.

Michelle

--------------------------------------
See my Drupal articles and tutorials or come check out life in the Coulee Region.