How does csplitter checks for Chinese characters [#381318]

There are a few places where the code checks whether the first byte of a UTF-8 string is greater than 176 (0xB0). In two occasions (lines 281 and 446) there is a comment that says "not Chinese". Can someone explains why there is such a check? If 0xB0 appears as the first character, the UTF-8 string is mal-formed. The valid values of the first byte of a UTF-8 strings are :
0x00 - 0x7F (single byte UTF-8 character)
0xC0 - 0xDF (two bytes UTF-8 character)
0xE0 - 0xEF (three bytes UTF-8 character)
0xF0 - 0xFF (four bytes UTF-8 character)

The first byte of a Chinese character would therefore be greater than 0xDF and has nothing to do with 0xB0.

Thank you.

How does csplitter checks for Chinese characters

News items

Our community

Documentation

Drupal code base

Governance of community