Currently the module breaks the text up into words, or tokens. Some of these should be ignored, such as "the," "and," and "I." The closest I can currently come to ignoring them is to set their probability to 50 (no "drift"). On one of my sites, these ended up being given a probability of 1, which grossly offset the likelihood of the post being considered spam - which it was.
I was thinking that if we change "spam_bayesian_filter" something like this (untested):
$p = db_fetch_object(db_query("SELECT tid, probability FROM {spam_tokens} WHERE token = '%s'", $token));
if (!p->tid) {
$p->probability = variable_get('spam_default_probability', 40);
}
else {
// Allow a token to have zero probability, which means ignore it.
if ($p->probability == 0) {
continue;
}
}
We could then create tokens (using my add-on?) with a probability of zero (0) so that those words could be ignored. This would probably be easier than creating a list of words that are to be ignored - and, of course, translated.
What do you think?
Comments
Comment #1
jeremy commentedYes, I agree with this idea. However, a hard coded array would only be useful in that particular language. This needs to be possible in an international way. I'm open to suggestions. I probably won't implement this myself until the new version, but would accept a quality patch.
Comment #2
nancydruActually I had in mind the spam_tokens module rather than an array.
Comment #3
jeremy commentedI obviously didn't read your original idea well enough, as you explain this. I apologize, I was trying to get through the entire issue queue quickly.
Yes, exporting this functionality to an external module works for me. Yes, I'd be willing to merge the suggested changes into the core module.
A couple issues:
The next step would be a patch to the core spam module for properly handling tokens that have a have a probability of 0. I believe your module is already capable of adding tokens with a probability of 0.
Comment #4
nancydru1. I'll look into doing a patch to include that and the above suggested change. (And I won't change any whitespace.)
2. I don't understand this one. Do you mean the capability to have a list somewhere that can be read (or written) to create a token list? If so, why limit it to probability = 0?
My module currently won't add a 0, but it's a minor change to one if statement. I'd probably add a comment about it too.
Comment #5
jeremy commentedBecause "probability=0" tokens are special. They are manually added "noise" words. My guess is that we could quickly come up with a decent sized list of such words that everyone would want to use. "a", "the", "it", etc... The import/export idea is not a requirement, it's just a suggestion to make the functionality more useful.
Comment #6
nancydruGotcha
Comment #7
nancydruOkay, here's the "probability=0" patch for spam core. (Sorry, I lied: I added spaces after commas in the queries.)
I think I got everywhere. BTW "spam_tokens_unsave" had the possibility of setting the probability to 0; I changed it to 1.
The typo fix I reported is included.
Comment #8
nancydruDon't know where the patch went, but here it is without the typo fix.
Comment #9
nancydruUpdated to Oct 18 version.
Comment #10
nancydruThe Spam_tokens module now includes the import/export capability.
Comment #11
jeremy commentedThe 5.x-1.x version of this module is no longer supported. Feel free to re-open this issue with a patch against the 5.x-3.x version of the module.