Spam module

Last modified: August 8, 2009 - 00:29

This is the documentation page for the Spam module. Docs are being collected as issue #455066: Documentation needed to be placed here when a section is completed.

Intro

The Spam module currently deals with spam comments, nodes, and users. Unlike most other spam modules for drupal, the Spam module works as a standalone plugin, and doesn't rely on a third party for processing spam (though one of the optional filter modules does, and any other filter modules created through the Spam API may as well). You get to control what happens and how it happens.

Installation

Spam module installation is straightforward:

  1. Download the latest version of Spam from the project page.
  2. Extract it, and move the Spam folder to <drupal directory>/sites/all/modules/ or <drupal directory>/sites/default/modules/
  3. In your drupal site, go to Administer > Site Building > Modules, and enable the Spam module, and all the filter modules that you want to use.
  4. Check the settings in Administer > Site configuration > Spam, and you're done

Using the Spam module

All new content is passed through all of the enabled Spam filters. It is given a score between 0 and 99, where a higher score implies that the message is more spammy. If the score is above the threshold the the content will be marked as spam, and, depending on your configuration, will be unpublished, sent to the spam queue, or deleted.

If content (nodes, comments, users) is marked as spam, it will show up on the Administer > Content management > Spam page. From here, you can choose to "Mark as not spam", which will set that item's score to zero (and publish it if it was unpublished). You can also choose to publish or unpublish the spam.

Feedback

The Feedback tab allows you to see any comments that users have left. For example, if a user's comment is wrongly marked as spam (false positive), they can alert you to this, and you can take appropriate action, e.g. adjust your filters. You can also set the comment that has feedback as "not spam" from here, which can save you some time checking comments in the main spam list for false positives.

Comments

For comment spam, there is also an extra tab on Administer >Content managent >Comments, called Spam. This tab allows you to delete comments and view them directly.

Scoring

Before content has been filtered it starts with a score of 0. Once it passes through the spam filters, it gets assigned a score between 1 and 99, 1 means most likely not spam, and 99 means most likely spam. The number is actually a probability, so 1 is a 1% chance of being spam, and 99 is a 99% chance of being spam.

For each filter the spam passes through, a score is assigned, and multiplied by the filter's gain - so if the filter has a gain of 250, the content is given a score between 0 and 250. If a filter gives a score of 0, the filter is ignored. All other scores are then added together, and all the gains for the applied filters are added together, and the final score is expressed as Sum of filter scores / Sum of filter gains.

The final score is then checked against the threshold, and marked as spam or not-spam

Mark as spam/not-spam

At the bottom every content item of a type that's being checked for spam, your should see a link that says either "mark as spam" or "mark as not-spam", depending on how the content is already marked. This allows you to quickly identify not-spam (or "ham" - false positives), or spam that has been marked incorrectly, and correct it. The Bayesian filter learns when you change the score of a content item, which will reduce incorrect assesments later on.

Configuration

Configuration of the Spam module happens at Administer > Site Configuration > Spam. Here you can set what content types are sent though the spam filters, and how they are handled.

Content to filter

By default, comments are filtered. If you have any open publishing on your site, you may want to filter other content types. Keep in mind that spam processing will add slightly to the load on your server, so there's no point filtering content that only trusted users can create anyway.

You can also filter users, but usually this is more problematic than normal content, because users don't have a lot of text to match against. Using User Profiles may increase this - i.e. if you have an "about me" box for each user, the ones with links to viagra a probably spam.

Actions

Here you can decide how content marked as spam is handled, and what message to send to users if their content is marked as spam. It's definitely recommended to NOT silently prevent spam from being posted until you're sure that your filter setup is working well.

Advanced Configuration

Threshold: The threshold decides what score is needed before a comment is marked as spam. A higher score, and less spam will get caught, a lower score, and you risk more false positives. A good rule of thumb is probably to leave it fairly high to start with (~80-85), and then gradually bring it down as the bayesian filter starts learning what is and what isn't appropriate to you site.

Log level: Decides what amount of information is logged. "Important" is the default, and it only provides information about things that aren't working (errors). "Debug" is useful for working out what gain levels to use on your filters (I recommend turning the debug level on and watching a couple of comments go through just to see how it works)

Discard spam logs older than: How long to keep spam logs for.

Filters

Filter Overview

On the filters overview page, you can see the currently available filters, sorted by weight. Like elsewhere in drupal, a lower weight (more negative) means that the filters float higher - which means that they get applied earlier.

Gain

The Gain variable for a filter decides how much of an impact the filter will have on the final score. If the filter has 0 gain, it won't be applied at all. If it has a higher gain than other filters, it will have more impact on the final score. This can be useful if you find that one filter is working better than others on your site.

For example, when you start training the Bayesian filter, you should set the gain very low as the filter will make a lot of mistakes. As the Bayesian filter becomes more trained and thus more accurate, you can then increase the gain on the filter to make its decisions have more impact on the overall spam score.

The Bayesian filter

The Bayesian filter does statistical analysis on spam content, learning from spam and non-spam that it sees to determine the likelihood that new content is or is not spam. The filter starts out knowing nothing, and has to be trained every time it makes a mistake. This is done by marking spam content on your site as spam when you see it. Each word of the spam content will be remembered and assigned a probability. The more often a word shows up in spam content, the higher the probability that future content with the same word is also spam. As most comment spam contains links back to the spammer's websites (ie to sell Prozac), the Bayesian filter provides a special option to quickly learn and block content that contains links to known spammer websites.

The Custom Filter

The custom filtering functionality can blacklist, whitelist or greylist based on the matching of words, phrases and regular expressions. For example, a custom filter can be defined to always mark content as spam if it contains the word 'Viagra'. Or, a custom filter can be defined to increase the probability that content is spam if it matches the case insensitive regular expression /free/i.

The URL filter

The spam module can also limit the total number of URLs allowed in comments and other content, as well as the number of times the same URL can be repeated in the same content. These limits can be different for comments and for other types of content. For example, if the module is set to only allow the same exact URL to appear in a comment twice, if "http://kerneltrap.org/" shows up in the same comment three or more times, the comment will be considered spam.

The SURBL filter

The SURBL filter is currently the only filter that uses a third-party service. SURBL filters check the body of content items for URLs that are commonly found in spam.
See http://en.wikipedia.org/wiki/SURBL or http://www.surbl.org for more info

The Node Age filter

For example, when you start training the Bayesian filter, you should set the gain very low as the filter will make a lot of mistakes. As the Bayesian filter becomes more trained and thus more accurate, you can then increase the gain on the filter to make its decisions have more impact on the overall spam score.

The Duplicate Filter

The Duplicate filter allows you to decide how many times the same content can be posted to the site - spammers often simply cut and paste content, so this can be a good way to catch them. If you select 2 for the threshold, then every duplicate after the first will be marked as spam.

Starting off: Custom whitelist

naught101 - August 8, 2009 - 06:01

I've found that it's hard to find a good line between legitimate content and spam. This is always going to be a problem, but you can make it easier on yourself by adding a custom white-list.

This only really works if your site is based around a single topic. For example, if your site is about dogs, you might add a custom filter, using a regex that looks like:
(doberman|chihuahua|poodle|\bdog\b|\bpets\b)
The "\b" is important - it matches the start or end of a word. If you don't have it, your regex will match things like "carpets" (not that I've seen spam for carpets - yet).
Set the filter to "mark as not spam", or "mark as probably not spam"

This will give you some breathing space, you can make your other filters harsher (preventing more spam), while reducing false positives, and quickly teaching the Bayesian filter new words (other words that aren't white-listed, but are common in legitimate content will get better scores for future checking)

 
 

Drupal is a registered trademark of Dries Buytaert.