Currently the module breaks the text up into words, or tokens. Some of these should be ignored, such as "the," "and," and "I." The closest I can currently come to ignoring them is to set their probability to 50 (no "drift"). On one of my sites, these ended up being given a probability of 1, which grossly offset the likelihood of the post being considered spam - which it was.

I was thinking that if we change "spam_bayesian_filter" something like this (untested):

    $p = db_fetch_object(db_query("SELECT tid, probability FROM {spam_tokens} WHERE token = '%s'", $token));
    if (!p->tid) {
      $p->probability = variable_get('spam_default_probability', 40);
    }
    else {
      // Allow a token to have zero probability, which means ignore it.
      if ($p->probability == 0) {
        continue;
      }
    }

We could then create tokens (using my add-on?) with a probability of zero (0) so that those words could be ignored. This would probably be easier than creating a list of words that are to be ignored - and, of course, translated.

What do you think?

CommentFileSizeAuthor
#9 spam0.txt3.06 KBnancydru
#8 prob0.txt2.96 KBnancydru

Comments

jeremy’s picture

Yes, I agree with this idea. However, a hard coded array would only be useful in that particular language. This needs to be possible in an international way. I'm open to suggestions. I probably won't implement this myself until the new version, but would accept a quality patch.

nancydru’s picture

Actually I had in mind the spam_tokens module rather than an array.

jeremy’s picture

I obviously didn't read your original idea well enough, as you explain this. I apologize, I was trying to get through the entire issue queue quickly.

Yes, exporting this functionality to an external module works for me. Yes, I'd be willing to merge the suggested changes into the core module.

A couple issues:

  1. the spam module will currently over-write the priority the next time it comes up against that token in new content, logic will need to be added to prevent it from doing this with the probability is set to 0.
  2. to truly be useful, you'd need an import/export ability for all tokens with a probability of 0. This would allow the definition of ignore lists for various languages and use cases. Otherwise everyone needs to manually enter them. Of course, manually entering them to begin with is better than not being able to do it at all.

The next step would be a patch to the core spam module for properly handling tokens that have a have a probability of 0. I believe your module is already capable of adding tokens with a probability of 0.

nancydru’s picture

1. I'll look into doing a patch to include that and the above suggested change. (And I won't change any whitespace.)

2. I don't understand this one. Do you mean the capability to have a list somewhere that can be read (or written) to create a token list? If so, why limit it to probability = 0?

My module currently won't add a 0, but it's a minor change to one if statement. I'd probably add a comment about it too.

jeremy’s picture

Because "probability=0" tokens are special. They are manually added "noise" words. My guess is that we could quickly come up with a decent sized list of such words that everyone would want to use. "a", "the", "it", etc... The import/export idea is not a requirement, it's just a suggestion to make the functionality more useful.

nancydru’s picture

Gotcha

nancydru’s picture

Assigned: Unassigned » nancydru
Status: Active » Needs review

Okay, here's the "probability=0" patch for spam core. (Sorry, I lied: I added spaces after commas in the queries.)

I think I got everywhere. BTW "spam_tokens_unsave" had the possibility of setting the probability to 0; I changed it to 1.

The typo fix I reported is included.

nancydru’s picture

StatusFileSize
new2.96 KB

Don't know where the patch went, but here it is without the typo fix.

nancydru’s picture

StatusFileSize
new3.06 KB

Updated to Oct 18 version.

nancydru’s picture

The Spam_tokens module now includes the import/export capability.

jeremy’s picture

Status: Needs review » Closed (won't fix)

The 5.x-1.x version of this module is no longer supported. Feel free to re-open this issue with a patch against the 5.x-3.x version of the module.