How does csplitter checks for Chinese characters

kcpau - February 23, 2009 - 19:18
Project:Chinese Word Splitter(中文分词)
Version:6.x-1.0
Component:Code
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Description

There are a few places where the code checks whether the first byte of a UTF-8 string is greater than 176 (0xB0). In two occasions (lines 281 and 446) there is a comment that says "not Chinese". Can someone explains why there is such a check? If 0xB0 appears as the first character, the UTF-8 string is mal-formed. The valid values of the first byte of a UTF-8 strings are :
0x00 - 0x7F (single byte UTF-8 character)
0xC0 - 0xDF (two bytes UTF-8 character)
0xE0 - 0xEF (three bytes UTF-8 character)
0xF0 - 0xFF (four bytes UTF-8 character)

The first byte of a Chinese character would therefore be greater than 0xDF and has nothing to do with 0xB0.

Thank you.

 
 

Drupal is a registered trademark of Dries Buytaert.