Chinese Word Splitter(中文分词)

This project is not covered by Drupal’s security advisory policy.

Support _search_preprocess interface. This module split chinese word with space. So it make search module to add correct chinese word into index table. You need re-index your site after active this module.
~~Module works with a user-defined dictionary. So in fact it can support split other languages.~~
The dictionary have recreated with B-tree index. So if you use this module with indexed dictionary, it will 10 times faster than ever and take only a little memory.
Now there are two match arithmetic in module.

Using with 4.7 above, you should disable "simple Chinese/Japanese/Korean tokenizer" in search.module setting.

BTW: If you have Japanese or Korean dictionary, please kindly contact me with i.zealy ~at~ gmail.com. It's possible to make this module to process Japanese or Korean words.

最新更新：

drupal 6已经推出许久，很长时间没有精力为开源社区做点什么。籍着此次drupal 6.1升级的机会，将中文分词模块按照原来的理想做了有史以来最大的一次改进，相信改进的内容还是能让人振奋一下：

终于实现了预索引的分词字典文件，使用B-Tree算法组织，可以快速进行基于文件的查找。获得的好处有：现在字典文件可以不再载入内存，使用B-树字典时基本不消耗内存，这样可以采用巨型字典，也可以避免大家的php内存超限制。
提供了B-树搜索用的简体/繁体两用中文巨型字典，本人专门生成的，准确性大大提高。
优化了算法，现在匹配循环比原来至少少三分之一。
提供了正向最小化和逆向最小化两种新的匹配算法，相对最大化匹配算法，其匹配循环可以减少一半以上，而结果也在可接受的范围。
提供类搜索的词长度选项，这个对性能有一定的影响，需要大家测试下看多少最为合理，因为目前提供的词库最长只有四个字，因此也只有2，3，4的长度选项才有意义。因为诗词的关系，今后也许会提供最长7个字的词库
修正了原来程序中的分词错误，现在对中英文数字混合字符串处理的正确率大大提高了。
结合上面这些改进，性能至少超过原来的十倍，内存消耗从巨大降到很小，CPU占用率也很低（这些都基于我的VPS，我是lighttpd，大家可以提供反馈，看看你们的情况）。使用时请关闭搜索设置里的“简单中日韩处理”。

此模块支持_search_preprocess接口，可对中文进行分词，以便在search模块的预索引和搜索时获得正确的中文结果，避免使用简单中日韩处理时产生巨量的搜索条目。安装此模块后，需要重新生成Search索引，建议索引词长度为1或2。
模块使用用户定义字典，因此实际上使用合适的字典可以支持其他的语言。
目前提供正向最大匹配和逆向最大匹配两种算法。
在4.7下使用时，需要关闭管理-〉设置-〉搜索中的“简单CJK（中日韩字符）处理”选项。

~~注意：字典文件是UTF-8格式（带BOM头标）。在有些系统上你可能需要去掉BOM头标，模块才能正确的读取字典并匹配分词，否则可能不能分词成功。~~

now support 4.7, 5.x, and 6.x

Project information

Module categories: Site Search
8 sites report using this module
Created by zealy on 14 March 2006, updated 2 December 2014
This project is not covered by the security advisory policy.
Use at your own risk! It may have publicly disclosed vulnerabilities.