Documentation sorely needed, especially for advanced functions.

Gain is not explained at all, what does it do, how does it work?
After a quick look at the logs, it looks like gain is simply a multiplier, such that each filter's score is multiplied by it's gain, and then added to the total score, and the gain is added to the total gain factor (but only if score is > 0). Then the total score is divided by the total gain factor.

Is this right? If others can correct this, or add more notes, I am happy to start a docs page.

Comments

tamhas’s picture

Another point needing documentation is what categories mean in the reports. E.g., our current statistics show:
prevented comment spam 195 17 hours 12 min ago
marked comment as spam 335 9 hours 54 min ago
manually marked comment as spam 70 3 days 18 hours ago
marked comment as not spam 1 17 hours 55 min ago
manually marked comment as not spam 1 17 hours 55 min ago

Now, I get the last one since we just got the not spam link working. Is the one above it a summary which might include other types besides manual and, if so, what else? Likewise for manually marked as spam versus marked? I would be surprised if we had actually manually marked 70, but I suppose it is possible, especially since the number has changed little in the last few weeks when the module has been working pretty well. But, then what is prevented? How is it prevented. Is this both blocked IPs and known spam sites? If the latter, it might make sense, I guess. Right now we only have one blocked IP, although we have had 3 at one point. How did they get unblocked?

gnassar’s picture

Better documentation is absolutely needed. I spoke some to Jeremy about this, and neither of us probably have time to sit down and write out complete docs all at once, but what we can do is use this issue as a scratchpad for user-submitted documentation to collect and put in the handbook. Please post your experiences once you figure out something you were having particular trouble with, and we can start building up a crowdsourced set of docs. Jeremy and I will also be contributing some chunks of documentation as we have time. Thank you!

gnassar’s picture

Handbook page is now at http://drupal.org/node/498092

naught101’s picture

Do you just want people to post here, or it is ok for me to make a start at the Doc page? I'm happy to write an introduction to all the features.

gnassar’s picture

Hey, it's a free Internet. Do what you like. :-)

Seriously, if you want to go straight to the Doc page with a feature intro, that's fine. It's all freely editable anyway, so we can edit there and discuss here if we need to. In my mind, putting docs in this thread was more for folks who would document a small part of something, but who wouldn't have time to handle an entire page. Perhaps editing there and discussing here is better.

tamhas’s picture

If one adds new content to the doc page, it would be nice to get at least a comment here so that one's attention would be brought to it.

naught101’s picture

ok, first go at it is up. I plan to come back and add some stuff about filters and gain.

I'm sort of thinking that it might be a good idea to move most of the detailed filter stuff from he project page into the Docs page, so maybe I'll just use that as a basis.

gnassar’s picture

Probably so. That makes sense.

Thank you!

naught101’s picture

OK, I put some stuff up about Scoring, gain, and copied the filter stuff accross. I didn't edit it yet, and there are a couple of filters missing. Would be good if people couple check over the scoring and gain section. I'm happy to add some more on the other filters.

Some questions:

Weights on filters obviously decides which filters get applied first, but does this have any real effect? Don't all the filters get applied anyway?

I said that the filters are only applied if the score is not zero - that's how it appears in the logs. So a score of 0 for one filter doesn't lower the final score (but a score of 1 does). Is that right? seems odd.

Regular expressions for the custom filter are PCREs, right?

The DSBL list has been replaced by SURBL?

jeremy’s picture

Thanks for all your effort on stabilizing this module on Drupal 6, and on documenting it! This is wonderful to see!

> Weights on filters obviously decides which filters get applied first, but does this have
> any real effect? Don't all the filters get applied anyway?

I believe currently all filters do get applied anyway, correct. However, the intention is to add an option so that any filter could mark given content as spam if over a certain threshold, bypassing the rest of the filters. The idea was to make this an optional per-filter setting. (For example, if the link filter find 100+ links, and after it runs it finds a spam score >90, it may want to "short circuit" the filtering system and just mark the content as spam rather than passing it along to be tokenized and run through the Bayesian filter, etc.)

> I said that the filters are only applied if the score is not zero - that's how it appears in the logs. So a score of
> 0 for one filter doesn't lower the final score (but a score of 1 does). Is that right? seems odd.

That is correct and by design. There are some filters that can really only determine "this content is spam", but they can't determine "this content is not spam". In the former case, they return "99". In the latter case they return "0" to avoid affecting the spam score.

> Regular expressions for the custom filter are PCREs, right?

That is correct.

> The DSBL list has been replaced by SURBL?

I don't know the answer to that. I don't see an answer here, either, but perhaps what you're looking for is there:
http://en.wikipedia.org/wiki/Distributed_Sender_Blackhole_List
http://en.wikipedia.org/wiki/SURBL

jeremy’s picture

Quick feedback on the documentation so far:

> "The Spam module deals with spam comments, nodes, and users."

The goal is to make the API generic enough to block other content, too. One goal has been to allow any form that is submitted to be run through the spam filter. Perhaps just add the work "currently" before "deals"?

> All content starts with a score of 0 (definitely not spam). If it passes through
> the spam filters, if gets re-assigned a score between 0 and 99, where 99
> means "definitely spam".

"Before content has been filtered it starts with a score of 0. Once it passes through the spam filters, it gets assigned a score between 1 and 99, 1 means most likely not spam, and 99 means most likely spam. The number is actually a probability, so 1 is a 1% chance of being spam, and 99 is a 99% chance of being spam."

> The Gain variable for a filter decides how much of an impact the filter will have on
> the final score. If the filter has 0 gain, it won't be applied at all. If it has a higher
> gain than other filters, it will have more impact on the final score. This can be
> useful if you find that one filter is working better than others on your site.

For example, when you start training the Bayesian filter, you should set the gain very low as the filter will make a lot of mistakes. As the Bayesian filter becomes more trained and thus more accurate, you can then increase the gain on the filter to make its decisions have more impact on the overall spam score.

> The Node Age filter

Older content is less likely to receive comments than new content. The "Node Age Filter" allows you to increase the likelihood that comments are spam when posted on older site content.

> DSBL list

Where are you getting this from? Are you meaning the SURBL module?

naught101’s picture

>Where are you getting this from? Are you meaning the SURBL module?
http://drupal.org/project/spam <- second last paragraph. That's why I was confused.

naught101’s picture

Added your bits Jeremy, and added some basic duplicate and SURBL info.

jeremy’s picture

Ah, looks like the description of the spam module needs to be updated! I must have written that a very long time ago, as I'd forgotten the DSBL had ever been supported!

gnassar’s picture

Good call! Made that edit to the description page.

naught101’s picture

I removed the DSBL section from the handbook. Saving it below for reference.

Personally, I think the module front page is WAY to full. most of the filter stuff could be removed, and replaced with a link to the handbook pages (it's mostly the same stuff, although there's more filters listed on the handbook page). This would also mean that the docs wouldn't have to be updated in two places at once for every new feature.

It could also be good to reduce the features list a bit, and add some information on installation/setup.

I also think the fact that this module doesn't rely on third parties (like mollom and akismet) is a major selling point, and should be on the front page.

I was also considering splitting the current handbook up a bit - like moving the Filters section to it's own page. Is that a good idea?

----------------

DSBL list

The fourth tool for detecting spam is to look up the poster's IP address in the Distributed Server Boycott List (http://dsbl.org/). If the address is listed, it is known to come from an untrusted email server such as an open relay and is marked as spam. The theory is that most comment-spammers are also email spammers.

jeremy’s picture

Yes, I agree that there's a lot that can be done! Feel free to shrink/cleanup the text on the project page, making it more clear and precise. Also, feel free to break the handbook documentation into multiple pages if there's enough text to warrant it.

naught101’s picture

Jeremy: I can't edit the front page. If you want to give me access, I'd be happy too.

jeremy’s picture

What do I need to do to give you access?

naught101’s picture

Can't find much in the docs, but I think it comes with cvs access. I have a cvs account, and am happy to help co-maintain as well, but I'll leave that up to you.

gnassar’s picture

I cleaned up some of the front page, per your recommendations. I do agree -- that page was pretty busy. And no required third-party dependence is probably a big selling point.

I condensed the feature list a bit too. But adding an "installation" section works against your fundamental premise, I think. That is definitely something that belongs in a handbook page. The module front page is for descriptions, not documentation.

I also left the filter stuff in. It probably needs paring down and compacting, but again, the module front page is for descriptions; it seems appropriate to have (brief!) descriptions of what the module can do there. (And it bothered me less to leave them in since the filter descriptions are effectively "below the fold," in journalism terms, and won't really detract much from the meat of the front page.) The handbook page(s) can go into more detail on each of the filters.

On a side note, it would seem to me that documentation issues (like this one) are appropriate places to suggest any specific text changes by posting the suggested alternate text if you like. After all, that's effectively a "patch" for the docs.

naught101’s picture

I still think it needs a massive prune. The best thing would be to have a list of the main 3-4 filters, and an "more filters", each of which link to the handbook page (I already put some anchors in the docs page, I can go back and put more in).

Might be worth putting in a screen shot too. The more looked-after the front page looks, the more likely people are to try the module.

jeremy’s picture

Category: feature » task

Before we can prune everything from the front page, the handbook pages need to be updated. Do you have handbook access naught101?

naught101’s picture

Mostly the pruning I was talking about was removing the (already duplicated) information on the filters.

Here's my suggestion for a module page. I re-ordered the "features" section so that the stuff that's most important to non-techie users is up top, and reduced the filters info to a list with links to the appropriate handbook anchor (all headers in the handbook now have anchors.

The Spam module provides numerous tools to auto-detect and deal with spam content that is posted to your site, without having to rely on third-party services.

The Spam module provides a trainable Bayesian filter, detection of content posted from open email relays, flagging content with an excessive amount of links, and the ability to create custom filters.

Usage:

The Handbook page contains information on installation and configuration, as well as a comprehensice information on the available filters
filters.

Features:

  • Can be used completely independently of any third-party service.
  • Automatically learns and blocks spammer URLs and IPs.
  • Detects repeated postings of the same identical content, or content containing too many links.
  • Can notify the user and/or administrator that content was determined to be spam, preventing confusion over why their content doesn't show up.
  • Provides 'report as spam' links allowing users to easily help detect spam.
  • Provides comprehensive logging to offer an understanding as to how and why content is determined to be or not to be spam.
  • Language-independent: Automatically learns to detect spam in any language using Bayesian logic.
  • Supports the creation of custom filters using powerful regular expressions.
  • Written in PHP specifically for Drupal.
  • Highly configurable and extendable (includes hooks for writing custom filters).

Filters:

All of the filters below are completely customisable, and can be weighted (using gain) and order according to user preferences.

jeremy’s picture

Thanks -- I've updated the project page, based quite a bit on your feedback. I've also taken a first stab at creating INSTALL and README files which are part of the recent 1.0 release.

Patches to cleanup the README are welcome -- there may also be info in there that should be added to the handbook page.

Steve Dondley’s picture

I can't figure out the formula for generating the total score. I have the node age filter turned on. No matter what I set the gain to for that filter and I submit a new comment to an old node, the value of the score that appears in the spam_tracker.score column in the database is always the same.

Steve Dondley’s picture

I've been playing around with the gain settings. I've got 3 filters enabled now. Changing the gain settings don't seem to have any effect whatsoever on the score in spam_tracker.score. I make a comment with all filters set to 100 and then made the same comment on the same thread with filters set to 10. There was no difference.

Steve Dondley’s picture

I set logs to verbose. Here's the output for a comment that was made:

comment 79 01/02/2010 - 11:11pm final average(72)
comment 79 01/02/2010 - 11:11pm Bayesian filter: total(45.3) redirect() gain(250)
comment 79 01/02/2010 - 11:11pm total(680) count(15) probability(45.3)
comment 79 01/02/2010 - 11:11pm URL filter: total(99) redirect() gain(250)
comment 79 01/02/2010 - 11:11pm found spam url(nyt.com) probability(99)
comment 79 01/02/2010 - 11:11pm inserting
comment 0 01/02/2010 - 11:11pm final average(70)
comment 0 01/02/2010 - 11:11pm Bayesian filter: total(40.9) redirect() gain(250)
comment 0 01/02/2010 - 11:11pm total(614) count(15) probability(40.9)
comment 0 01/02/2010 - 11:11pm URL filter: total(99) redirect() gain(250)
comment 0 01/02/2010 - 11:11pm found spam url(nyt.com) probability(99)

So both the bayesian filter and url filter had gains set to 250.

URL filter found a url that was previously flagged as spam with a probability of 99.

I have no idea how these numbers work together to arrive at the "final average."

I'm not very familiar with the bayesian filter formula. If someone can this might help me develop some accurate documentation as to how the gain is factored in, that would be great.

chrisshattuck’s picture

One more question about how the scoring works, and if the interface is off in some way.

So, you can set the gain on a filter to be 250, but you can only set the threshold up to 99. The docs say that the gain is a multiplier, but then it says that it adds them together. So, if it's an addition equation, then the gain should only go up to 99. If it's a multiplier, then I haven't found any clear documentation on what exactly is being multiplied.

Any tips?

gnassar’s picture

I think you guys are all missing the "then divide by the total gain" part at the end of the description.

Gains are relative to each other. If you have 3 filters set to gain 10, their results are each multiplied by 10, added together, and then the total is divided by 30. If all 3 are set to gain 100, their results are multiplied by 100, added together, and then the total is divided by 300. You would get the same result either way.

But if you have two filters, one set to 5 and one set to 10, the latter would have twice as much "weight" in the final result as the former. (Same as if they were set to 100 and 200, or 1 and 2.)

gnassar’s picture

Title: Documentation needed » Usage and API documentation needed

Just clarifying the title to note that we wanted to move API documentation to this ticket a while back.