Documentation needed
naught101 - May 6, 2009 - 09:29
| Project: | Spam |
| Version: | HEAD |
| Component: | Documentation |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
Description
Documentation sorely needed, especially for advanced functions.
Gain is not explained at all, what does it do, how does it work?
After a quick look at the logs, it looks like gain is simply a multiplier, such that each filter's score is multiplied by it's gain, and then added to the total score, and the gain is added to the total gain factor (but only if score is > 0). Then the total score is divided by the total gain factor.
Is this right? If others can correct this, or add more notes, I am happy to start a docs page.

#1
Another point needing documentation is what categories mean in the reports. E.g., our current statistics show:
prevented comment spam 195 17 hours 12 min ago
marked comment as spam 335 9 hours 54 min ago
manually marked comment as spam 70 3 days 18 hours ago
marked comment as not spam 1 17 hours 55 min ago
manually marked comment as not spam 1 17 hours 55 min ago
Now, I get the last one since we just got the not spam link working. Is the one above it a summary which might include other types besides manual and, if so, what else? Likewise for manually marked as spam versus marked? I would be surprised if we had actually manually marked 70, but I suppose it is possible, especially since the number has changed little in the last few weeks when the module has been working pretty well. But, then what is prevented? How is it prevented. Is this both blocked IPs and known spam sites? If the latter, it might make sense, I guess. Right now we only have one blocked IP, although we have had 3 at one point. How did they get unblocked?
#2
Better documentation is absolutely needed. I spoke some to Jeremy about this, and neither of us probably have time to sit down and write out complete docs all at once, but what we can do is use this issue as a scratchpad for user-submitted documentation to collect and put in the handbook. Please post your experiences once you figure out something you were having particular trouble with, and we can start building up a crowdsourced set of docs. Jeremy and I will also be contributing some chunks of documentation as we have time. Thank you!
#3
Handbook page is now at http://drupal.org/node/498092
#4
Do you just want people to post here, or it is ok for me to make a start at the Doc page? I'm happy to write an introduction to all the features.
#5
Hey, it's a free Internet. Do what you like. :-)
Seriously, if you want to go straight to the Doc page with a feature intro, that's fine. It's all freely editable anyway, so we can edit there and discuss here if we need to. In my mind, putting docs in this thread was more for folks who would document a small part of something, but who wouldn't have time to handle an entire page. Perhaps editing there and discussing here is better.
#6
If one adds new content to the doc page, it would be nice to get at least a comment here so that one's attention would be brought to it.
#7
ok, first go at it is up. I plan to come back and add some stuff about filters and gain.
I'm sort of thinking that it might be a good idea to move most of the detailed filter stuff from he project page into the Docs page, so maybe I'll just use that as a basis.
#8
Probably so. That makes sense.
Thank you!
#9
OK, I put some stuff up about Scoring, gain, and copied the filter stuff accross. I didn't edit it yet, and there are a couple of filters missing. Would be good if people couple check over the scoring and gain section. I'm happy to add some more on the other filters.
Some questions:
Weights on filters obviously decides which filters get applied first, but does this have any real effect? Don't all the filters get applied anyway?
I said that the filters are only applied if the score is not zero - that's how it appears in the logs. So a score of 0 for one filter doesn't lower the final score (but a score of 1 does). Is that right? seems odd.
Regular expressions for the custom filter are PCREs, right?
The DSBL list has been replaced by SURBL?
#10
Thanks for all your effort on stabilizing this module on Drupal 6, and on documenting it! This is wonderful to see!
> Weights on filters obviously decides which filters get applied first, but does this have
> any real effect? Don't all the filters get applied anyway?
I believe currently all filters do get applied anyway, correct. However, the intention is to add an option so that any filter could mark given content as spam if over a certain threshold, bypassing the rest of the filters. The idea was to make this an optional per-filter setting. (For example, if the link filter find 100+ links, and after it runs it finds a spam score >90, it may want to "short circuit" the filtering system and just mark the content as spam rather than passing it along to be tokenized and run through the Bayesian filter, etc.)
> I said that the filters are only applied if the score is not zero - that's how it appears in the logs. So a score of
> 0 for one filter doesn't lower the final score (but a score of 1 does). Is that right? seems odd.
That is correct and by design. There are some filters that can really only determine "this content is spam", but they can't determine "this content is not spam". In the former case, they return "99". In the latter case they return "0" to avoid affecting the spam score.
> Regular expressions for the custom filter are PCREs, right?
That is correct.
> The DSBL list has been replaced by SURBL?
I don't know the answer to that. I don't see an answer here, either, but perhaps what you're looking for is there:
http://en.wikipedia.org/wiki/Distributed_Sender_Blackhole_List
http://en.wikipedia.org/wiki/SURBL
#11
Quick feedback on the documentation so far:
> "The Spam module deals with spam comments, nodes, and users."
The goal is to make the API generic enough to block other content, too. One goal has been to allow any form that is submitted to be run through the spam filter. Perhaps just add the work "currently" before "deals"?
> All content starts with a score of 0 (definitely not spam). If it passes through
> the spam filters, if gets re-assigned a score between 0 and 99, where 99
> means "definitely spam".
"Before content has been filtered it starts with a score of 0. Once it passes through the spam filters, it gets assigned a score between 1 and 99, 1 means most likely not spam, and 99 means most likely spam. The number is actually a probability, so 1 is a 1% chance of being spam, and 99 is a 99% chance of being spam."
> The Gain variable for a filter decides how much of an impact the filter will have on
> the final score. If the filter has 0 gain, it won't be applied at all. If it has a higher
> gain than other filters, it will have more impact on the final score. This can be
> useful if you find that one filter is working better than others on your site.
For example, when you start training the Bayesian filter, you should set the gain very low as the filter will make a lot of mistakes. As the Bayesian filter becomes more trained and thus more accurate, you can then increase the gain on the filter to make its decisions have more impact on the overall spam score.
> The Node Age filter
Older content is less likely to receive comments than new content. The "Node Age Filter" allows you to increase the likelihood that comments are spam when posted on older site content.
> DSBL list
Where are you getting this from? Are you meaning the SURBL module?
#12
>Where are you getting this from? Are you meaning the SURBL module?
http://drupal.org/project/spam <- second last paragraph. That's why I was confused.
#13
Added your bits Jeremy, and added some basic duplicate and SURBL info.
#14
Ah, looks like the description of the spam module needs to be updated! I must have written that a very long time ago, as I'd forgotten the DSBL had ever been supported!
#15
Good call! Made that edit to the description page.
#16
I removed the DSBL section from the handbook. Saving it below for reference.
Personally, I think the module front page is WAY to full. most of the filter stuff could be removed, and replaced with a link to the handbook pages (it's mostly the same stuff, although there's more filters listed on the handbook page). This would also mean that the docs wouldn't have to be updated in two places at once for every new feature.
It could also be good to reduce the features list a bit, and add some information on installation/setup.
I also think the fact that this module doesn't rely on third parties (like mollom and akismet) is a major selling point, and should be on the front page.
I was also considering splitting the current handbook up a bit - like moving the Filters section to it's own page. Is that a good idea?
----------------
DSBL list
The fourth tool for detecting spam is to look up the poster's IP address in the Distributed Server Boycott List (http://dsbl.org/). If the address is listed, it is known to come from an untrusted email server such as an open relay and is marked as spam. The theory is that most comment-spammers are also email spammers.
#17
Yes, I agree that there's a lot that can be done! Feel free to shrink/cleanup the text on the project page, making it more clear and precise. Also, feel free to break the handbook documentation into multiple pages if there's enough text to warrant it.
#18
Jeremy: I can't edit the front page. If you want to give me access, I'd be happy too.
#19
What do I need to do to give you access?
#20
Can't find much in the docs, but I think it comes with cvs access. I have a cvs account, and am happy to help co-maintain as well, but I'll leave that up to you.
#21
I cleaned up some of the front page, per your recommendations. I do agree -- that page was pretty busy. And no required third-party dependence is probably a big selling point.
I condensed the feature list a bit too. But adding an "installation" section works against your fundamental premise, I think. That is definitely something that belongs in a handbook page. The module front page is for descriptions, not documentation.
I also left the filter stuff in. It probably needs paring down and compacting, but again, the module front page is for descriptions; it seems appropriate to have (brief!) descriptions of what the module can do there. (And it bothered me less to leave them in since the filter descriptions are effectively "below the fold," in journalism terms, and won't really detract much from the meat of the front page.) The handbook page(s) can go into more detail on each of the filters.
On a side note, it would seem to me that documentation issues (like this one) are appropriate places to suggest any specific text changes by posting the suggested alternate text if you like. After all, that's effectively a "patch" for the docs.
#22
I still think it needs a massive prune. The best thing would be to have a list of the main 3-4 filters, and an "more filters", each of which link to the handbook page (I already put some anchors in the docs page, I can go back and put more in).
Might be worth putting in a screen shot too. The more looked-after the front page looks, the more likely people are to try the module.
#23
Before we can prune everything from the front page, the handbook pages need to be updated. Do you have handbook access naught101?
#24
Mostly the pruning I was talking about was removing the (already duplicated) information on the filters.
Here's my suggestion for a module page. I re-ordered the "features" section so that the stuff that's most important to non-techie users is up top, and reduced the filters info to a list with links to the appropriate handbook anchor (all headers in the handbook now have anchors.
#25
Thanks -- I've updated the project page, based quite a bit on your feedback. I've also taken a first stab at creating INSTALL and README files which are part of the recent 1.0 release.
Patches to cleanup the README are welcome -- there may also be info in there that should be added to the handbook page.