Simple Chinese support [#119991]

Right now, the spam module appears to consider only space-separated 1-grams. Thus, something like:

spam_tokenize("为人民服务");

would return an array with one token containing the entire string. In Chinese at least, this behavior is obviously incorrect, as most chinese text contains no whitespace characters. Thus, every message would be categorized differently, and the database would quickly fill up with junk and the filter would be close to useless.

I'm not sure what the "base-level implementation" corresponding to English 1-grams is, but I'll try to find out. It's probably something along the lines of: "each character is a separate token" (为, 人, 民, 服, 务 in the above example) or "each pair of adjacent characters is a separate token" (为人, 人民, 民服, 服务 in the above example)

Comments

Comment #1

Wesley Tanaka commented 17 February 2007 at 11:31

http://citeseer.ist.psu.edu/594545.html

"context information does nothelp in the Chinese data beyond 2-grams. The performance increase 3-4% from 1-gram to 2-gram, but does not increase any more."

"For many Asian languages such as Chinese and Japanese, where word segmentation is a hard, our character level CAN Bayes model is well suited for text classiﬁcation because it avoids the need for word segmentation. For Western languages such as Greek and English, we can work at both the word and character levels. In our experiments, we actually found that the character level models worked slightly better than the word level
models in the English 20 Newsgroup data set (89% vs. 88%)."

It sounds like either method for tokenizing described above should be a reasonable choice, and either would be a dramatic improvement over the current tokenizer for many asian languages.

Comment #2

jeremy commented 15 October 2007 at 14:42

Status:

Active

» Postponed

I hope to improve international support in the upcoming 5.x-2.x version of the module. Postponing this issue until then.

Comment #3

jeremy commented 28 November 2007 at 14:31

Version:	5.x-1.x-dev	» 5.x-3.x-dev
Status:	Postponed	» Active

Re-opening against the 5.x-3.x development branch. Help on ensuring that this new version of the modules offers better international support would be much appreciated.

Comment #4

jeremy commented 24 April 2008 at 13:37

Assigned:

Unassigned

» jeremy

Assigning this to myself, as I'd like to improve the tokenizer to support other languages.

Comment #5

jeremy commented 17 September 2008 at 15:32

Assigned:	jeremy	» Unassigned
Status:	Active	» Postponed

I want to see better support for other languages in the tokenizer, but this will not be happening until at least after we have a beta release. Unassigning and postponing this issue until at least then, or until someone comes along with a patch.

Comment #6

killes@www.drop.org commented 7 June 2011 at 14:05

Category:	feature	» bug
Status:	Postponed	» Active

Inspiration should come from the core search module which used to have the same problem.

Not supporting more than a subset of languages is a bug.

Comment #7

AlexisWilke commented 10 June 2011 at 23:27

Version:

5.x-3.x-dev

» 6.x-1.x-dev

Bumping to 6.x since we don't support 5.x anymore.

Thank you.
Alexis

Comment #8

killes@www.drop.org commented 20 June 2011 at 19:13

This is now in the dev version, I'd appreciate some testing.

Comment #9

killes@www.drop.org commented 27 June 2011 at 13:34

Status:

Active

» Fixed

I've tested a bit myself, considering this fixed.

Comment #10

11 July 2011 at 13:41

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Simple Chinese support

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

News items

Our community

Documentation

Drupal code base

Governance of community