Skip Some Tokens [#175080]

Currently the module breaks the text up into words, or tokens. Some of these should be ignored, such as "the," "and," and "I." The closest I can currently come to ignoring them is to set their probability to 50 (no "drift"). On one of my sites, these ended up being given a probability of 1, which grossly offset the likelihood of the post being considered spam - which it was.

I was thinking that if we change "spam_bayesian_filter" something like this (untested):

    $p = db_fetch_object(db_query("SELECT tid, probability FROM {spam_tokens} WHERE token = '%s'", $token));
    if (!p->tid) {
      $p->probability = variable_get('spam_default_probability', 40);
    }
    else {
      // Allow a token to have zero probability, which means ignore it.
      if ($p->probability == 0) {
        continue;
      }
    }

We could then create tokens (using my add-on?) with a probability of zero (0) so that those words could be ignored. This would probably be easier than creating a list of words that are to be ignored - and, of course, translated.

What do you think?

Comment	File	Size	Author
#9	spam0.txt	3.06 KB	nancydru
#8	prob0.txt	2.96 KB	nancydru

Comments

Comment #1

jeremy commented 15 October 2007 at 13:12

Yes, I agree with this idea. However, a hard coded array would only be useful in that particular language. This needs to be possible in an international way. I'm open to suggestions. I probably won't implement this myself until the new version, but would accept a quality patch.

Comment #2

nancydru

she/her/hers

English

Boston

commented 15 October 2007 at 17:29

Actually I had in mind the spam_tokens module rather than an array.

Comment #3

jeremy commented 15 October 2007 at 17:53

I obviously didn't read your original idea well enough, as you explain this. I apologize, I was trying to get through the entire issue queue quickly.

Yes, exporting this functionality to an external module works for me. Yes, I'd be willing to merge the suggested changes into the core module.

A couple issues:

the spam module will currently over-write the priority the next time it comes up against that token in new content, logic will need to be added to prevent it from doing this with the probability is set to 0.
to truly be useful, you'd need an import/export ability for all tokens with a probability of 0. This would allow the definition of ignore lists for various languages and use cases. Otherwise everyone needs to manually enter them. Of course, manually entering them to begin with is better than not being able to do it at all.

The next step would be a patch to the core spam module for properly handling tokens that have a have a probability of 0. I believe your module is already capable of adding tokens with a probability of 0.

Comment #4

nancydru

she/her/hers

English

Boston

commented 15 October 2007 at 21:40

1. I'll look into doing a patch to include that and the above suggested change. (And I won't change any whitespace.)

2. I don't understand this one. Do you mean the capability to have a list somewhere that can be read (or written) to create a token list? If so, why limit it to probability = 0?

My module currently won't add a 0, but it's a minor change to one if statement. I'd probably add a comment about it too.

Comment #5

jeremy commented 15 October 2007 at 22:39

Because "probability=0" tokens are special. They are manually added "noise" words. My guess is that we could quickly come up with a decent sized list of such words that everyone would want to use. "a", "the", "it", etc... The import/export idea is not a requirement, it's just a suggestion to make the functionality more useful.