This is the documentation page for the Spam module. Docs are being collected as issue #455066: Usage and API documentation needed to be placed here when a section is completed.

Intro

The Spam module currently deals with spam comments, nodes, and users. Unlike most other spam modules for Drupal, the Spam module works as a standalone plugin, and doesn't rely on a third party for processing spam (though one of the optional filter modules does, and any other filter modules created through the Spam API may as well). You get to control what happens and how it happens.

Installation

Spam module installation is straightforward:

  1. Download the latest version of Spam from the project page.
  2. Extract it, and move the Spam folder to <drupal directory>/sites/all/modules/ or <drupal directory>/sites/default/modules/
  3. In your drupal site, go to Administer » Site Building » Modules, and enable the Spam module, and all the filter modules that you want to use.
  4. Tweak the settings in Administer » Site configuration » Spam, and you're done

Using the Spam module

New content can be passed through the enabled Spam filters. Those filters assign a score between 0 and 99, where a higher score implies that the content is more spammy.

A score equal or larger then the Spam threshold value is marked as spam, and depending on your configuration, is unpublished, sent to the spam queue, or deleted.

Content (nodes, comments, users) marked as spam shows up on the Administer » Content management » Spam page. From here, you can choose to "Mark as not spam", which will set that item's score to zero (and publish it if it was unpublished). You can also choose to publish or unpublish the spam.

Feedback

The Feedback tab allows you to see your user comments along their arguments, why it isn't spam. When a user's comment is wrongly marked as spam (false positive), they can alert you to this via the feedback form, and you can take appropriate action, e.g. adjust your filters. You can also set comments having feedback as "not spam" from that administrative screen, which saves you time.

Comments

For comment spam, there is also an extra tab on Administer » Content managent » Comments, called Spam. This tab allows you to delete comments and view them directly.

Scoring

Before content has been filtered it starts with a score of 0. Once it passes through the spam filters, it gets assigned a score between 1 and 99, 1 means most likely not spam, and 99 means most likely spam. The number is actually a probability, so 1 is a 1% chance of being spam, and 99 is a 99% chance of being spam.

For each filter the content passes through, a score is assigned, and multiplied by the filter's gain - so if the filter has a gain of 250, the content is given a score between 0 and 250. If a filter gives a score of 0, the filter is ignored. All other scores are then added together, and all the gains for the applied filters are added together, and the final score is expressed as Sum of filter scores / Sum of filter gains.

For those who can read LaTeX:

IsSpam_{probability} = \frac {\sum_{f=1}^n {Percent \bullet Gain}} {\sum_{f=1}^n {Gain}}

The final score is then checked against the Spam threshold and marked as spam (larger or equal) or not-spam (smaller).

Mark as spam/not-spam

At the bottom of every content item of a type that's being checked for spam, there is a link saying either Mark as spam or Mark as not-spam, depending on how the content is already marked. The links allow you to quickly correct the spam filter work by marking the content as not-spam (or "ham"—false positives), or spam. In particular, the Bayesian filter learns by updating the score of content items you correct manually, which reduces incorrect assessments later on.

Configuration

Configuration of the Spam module happens at Administer » Site Configuration » Spam. Here you can set what content types are sent though the spam filters, and how they are handled.

Each filter may also offer some detailed settings.

Content to filter

By default, comments are filtered. If you have open publishing on your site, you may want to filter those content types. Keep in mind that spam processing adds slightly to the load on your server (in speed and database space), so there's no point filtering content that only trusted users can create.

You can also filter users, but usually this is more problematic than normal content, because users don't have a lot of text to match against. Using User Profiles may increase this - i.e. if you have an "about me" box for each user, the ones with links to Viagra are probably spam.

Actions

Here you decide how content marked as spam is handled, and what message to send to users if their content is marked as spam. It's definitely recommended to NOT silently prevent spam from being posted until you're sure that your filter setup is working well.

Advanced Configuration

Spam threshold: The Spam threshold decides what score is needed before content is marked as spam. A higher score, and less spam will get caught, a lower score, and you risk more false positives. A good rule of thumb is probably to leave it fairly high to start with (~80-85), and then gradually bring it down as the Bayesian filter starts learning what is and what isn't appropriate to you site.

Log level: Decides what amount of information is logged.

  • Disable to not log anything.
  • Important (default) provides information about things that aren't working (errors).
  • Verbose is useful for working out what gain levels to use on your filters (I recommend turning the debug level on and watching a couple of comments go through just to see how it works.)
  • Debug is mainly for developers, it most certainly generates a lot more logs than you care for.

Discard spam logs older than: How long to keep logs about Spam for.

Filters

Filter Overview

On the filters overview page, you see the currently available filters, sorted by weight.

Filter Weight

Like elsewhere in Drupal, a lower weight (more negative) means that the filters float higher - which means that they get applied earlier.

Filter Gain

The Gain variable for a filter decides how much of an impact the filter will have on the final score. Filters with a gain of 0 have no effect (i.e. it has nearly the same effect as disabling that filter in the Modules administration screen).

Filters with a higher gain than other filters, have more impact on the final score. This can be useful if you find that one filter is working better than others on your site.

For example, when you start training the Bayesian filter, you should set its gain very low as the filter will make a lot of mistakes. As the Bayesian filter becomes more trained and thus more accurate, you can increase the gain on the filter to give its decisions more impact on the overall spam score.

Bayesian filter

The Bayesian filter does statistical analysis on content, learning from spam and non-spam that it sees to determine the likelihood that new content is or is not spam. The filter starts out knowing nothing, and has to be trained every time it makes a mistake. This is done by marking spam content on your site as spam when you see it. Each word of the spam content will be remembered and assigned a Spam probability. The more often a word shows up in spam content, the higher the probability that future content with the same word is also spam. As most comment spam contains links back to the spammer's websites (ie to sell Prozac), the Bayesian filter provides a special option to quickly learn and block content that contains links to known spammer websites.

Custom filter

The Custom filter is used to blacklist, whitelist or greylist based on the matching of words, phrases and regular expressions.

For example, a Custom filter can be defined to always mark content as spam if it contains the word 'Viagra' or '[url='.

Similarly, the Custom filter can be defined to increase the probability that content is spam if it matches the case insensitive regular expression /free/i.

URL filter

The ULR filter is used to limit the total number of URLs allowed in comments and other content, as well as the number of times the same URL can be repeated in the same content. These limits can be different for comments and for other types of content. For example, if the module is set to only allow the same exact URL to appear in a comment twice, if http://kerneltrap.org/ shows up in the same comment three or more times, the comment will be considered spam.

SURBL filter

The SURBL filter is currently the only filter that uses a third-party service. SURBL filters check the body of content items for URLs that are commonly found in spam.

See http://en.wikipedia.org/wiki/SURBL or http://www.surbl.org for detailed information.

Node Age filter

The Node Age filter allows you to specify the age of nodes, in weeks (from 1 to 24), that you consider old content and really old content. The filter then assign a probability (60% to 99%) that comments to such nodes are spam.

Duplicate filter

The Duplicate filter allows you to decide how many times the same content can be posted to the site - spammers often simply cut and paste content, so this can be a good way to catch them. If you select 2 for the threshold, then every duplicate after the first will be marked as spam.

Comments

naught101’s picture

I've found that it's hard to find a good line between legitimate content and spam. This is always going to be a problem, but you can make it easier on yourself by adding a custom white-list.

This only really works if your site is based around a single topic. For example, if your site is about dogs, you might add a custom filter, using a regex that looks like:
(doberman|chihuahua|poodle|\bdog\b|\bpets\b)
The "\b" is important - it matches the start or end of a word. If you don't have it, your regex will match things like "carpets" (not that I've seen spam for carpets - yet).
Set the filter to "mark as not spam", or "mark as probably not spam"

This will give you some breathing space, you can make your other filters harsher (preventing more spam), while reducing false positives, and quickly teaching the Bayesian filter new words (other words that aren't white-listed, but are common in legitimate content will get better scores for future checking)