How does csplitter checks for Chinese characters
kcpau - February 23, 2009 - 19:18
| Project: | Chinese Word Splitter(中文分词) |
| Version: | 6.x-1.0 |
| Component: | Code |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Description
There are a few places where the code checks whether the first byte of a UTF-8 string is greater than 176 (0xB0). In two occasions (lines 281 and 446) there is a comment that says "not Chinese". Can someone explains why there is such a check? If 0xB0 appears as the first character, the UTF-8 string is mal-formed. The valid values of the first byte of a UTF-8 strings are :
0x00 - 0x7F (single byte UTF-8 character)
0xC0 - 0xDF (two bytes UTF-8 character)
0xE0 - 0xEF (three bytes UTF-8 character)
0xF0 - 0xFF (four bytes UTF-8 character)
The first byte of a Chinese character would therefore be greater than 0xDF and has nothing to do with 0xB0.
Thank you.
