Simple Chinese support

Wesley Tanaka - February 17, 2007 - 09:02
Project:Spam
Version:5.x-3.x-dev
Component:Bayesian Filter
Category:feature request
Priority:normal
Assigned:Unassigned
Status:postponed
Description

Right now, the spam module appears to consider only space-separated 1-grams. Thus, something like:

spam_tokenize("为人民服务");

would return an array with one token containing the entire string. In Chinese at least, this behavior is obviously incorrect, as most chinese text contains no whitespace characters. Thus, every message would be categorized differently, and the database would quickly fill up with junk and the filter would be close to useless.

I'm not sure what the "base-level implementation" corresponding to English 1-grams is, but I'll try to find out. It's probably something along the lines of: "each character is a separate token" (为, 人, 民, 服, 务 in the above example) or "each pair of adjacent characters is a separate token" (为人, 人民, 民服, 服务 in the above example)

#1

Wesley Tanaka - February 17, 2007 - 11:31

http://citeseer.ist.psu.edu/594545.html

"context information does nothelp in the Chinese data beyond 2-grams. The performance increase 3-4% from 1-gram to 2-gram, but does not increase any more."

"For many Asian languages such as Chinese and Japanese, where word segmentation is a hard, our character level CAN Bayes model is well suited for text classification because it avoids the need for word segmentation. For Western languages such as Greek and English, we can work at both the word and character levels. In our experiments, we actually found that the character level models worked slightly better than the word level
models in the English 20 Newsgroup data set (89% vs. 88%)."

It sounds like either method for tokenizing described above should be a reasonable choice, and either would be a dramatic improvement over the current tokenizer for many asian languages.

#2

Jeremy - October 15, 2007 - 14:42
Status:active» postponed

I hope to improve international support in the upcoming 5.x-2.x version of the module. Postponing this issue until then.

#3

Jeremy - November 28, 2007 - 14:31
Version:5.x-1.x-dev» 5.x-3.x-dev
Status:postponed» active

Re-opening against the 5.x-3.x development branch. Help on ensuring that this new version of the modules offers better international support would be much appreciated.

#4

Jeremy - April 24, 2008 - 13:37
Assigned to:Anonymous» Jeremy

Assigning this to myself, as I'd like to improve the tokenizer to support other languages.

#5

Jeremy - September 17, 2008 - 15:32
Assigned to:Jeremy» Anonymous
Status:active» postponed

I want to see better support for other languages in the tokenizer, but this will not be happening until at least after we have a beta release. Unassigning and postponing this issue until at least then, or until someone comes along with a patch.

 
 

Drupal is a registered trademark of Dries Buytaert.