https://github.com/chatopera/chop

Chinese Tokenizer module for Python
https://github.com/chatopera/chop

chinese-nlp chinese-segmenter nlp parser segment segmenter

Last synced: 11 months ago
JSON representation

Chinese Tokenizer module for Python

Host: GitHub
URL: https://github.com/chatopera/chop
Owner: chatopera
License: mit
Created: 2017-07-20T04:09:38.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-07-03T11:55:28.000Z (over 7 years ago)
Last Synced: 2025-02-24T06:50:00.370Z (11 months ago)
Topics: chinese-nlp, chinese-segmenter, nlp, parser, segment, segmenter
Language: Python
Homepage: https://github.com/Samurais/chop-evaluate
Size: 9.32 MB
Stars: 15
Watchers: 10
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          [![chatoper banner][co-banner-image]][co-url]

[co-banner-image]: https://user-images.githubusercontent.com/3538629/42217321-3d5e44f6-7ef7-11e8-94e7-1574bfa1dbb8.png

[co-url]: https://www.chatopera.com

# chop

Python 中文分词工具包

## 欢迎

GitHub: https://github.com/samurais/chop

Pypi: https://pypi.python.org/pypi/chop

## 依赖

Python3

## 使用说明

代码对 Python 3 兼容

* 全自动安装： ``easy_install chop`` 或者 ``pip install chop`` / ``pip3 install chop``

* 接口

```

from chop.hmm import Tokenizer as HMMTokenizer

from chop.mmseg import Tokenizer as MMSEGTokenizer

sentence = "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。"

def main():

    HT = HMMTokenizer()

    MT = MMSEGTokenizer()

    print('HMM Tokenizer:', ' '.join(HT.cut(sentence)))

    print('MMSEG Tokenizer:', ' '.join(MT.cut(sentence)))

```

* 代码通俗易懂，方便掌握算法

## API

* chop.*[mmseg|hmm]*.Tokenizer Object

t = chop.mmseg.Tokenizer([dict_path="自定义词典位置"])

* t#cut(sentence[, punctuation = True])

参数:

sentence 中文句子

*punctuation=True* 分词输出标点.

返回:

Token 使用*yield*返回的*generator*

## 测试

```

./scripts/test-badcase.sh "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"

```

## 算法

* MMSEG: 

A Word Identification System for Mandarin Chinese Text Based on Two

Variants of the Maximum Matching Algorithm

http://technology.chtsai.org/mmseg/

Other references:

http://blog.csdn.net/nciaebupt/article/details/8114460

http://www.codes51.com/itwd/1802849.html

* HMM & Viterbi:

[基于层叠隐马尔可夫模型的中文命名实体识别](http://xueshu.baidu.com/s?wd=%E5%9F%BA%E4%BA%8E%E5%B1%82%E5%8F%A0%E9%9A%90%E9%A9%AC%E5%B0%94%E5%8F%AF%E5%A4%AB%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%AD%E6%96%87%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8&sc_hit=1)

## 词典

Dict:

https://github.com/Samurais/jieba/blob/master/jieba/dict.txt

## 评测

[chop-evaluate](https://github.com/Samurais/chop-evaluate)

## 贡献代码

```

virtualenv --no-site-packages -p /usr/local/bin/python3.6 ~/venv-py3

CHOP_LOG_LVL=DEBUG

./scripts/test.sh

```

## 感谢

[hanlp](http://www.hankcs.com/nlp/ner/) 

[jieba](https://github.com/fxsjy/jieba)

[mmseg](http://technology.chtsai.org/mmseg/)

[Python实现mmseg分词算法和吐嘈](http://blog.csdn.net/acceptedxukai/article/details/7390300)

## 测评

[中文分词工具测评](http://rsarxiv.github.io/2016/11/29/%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%B7%A5%E5%85%B7%E6%B5%8B%E8%AF%84/)

## 授权协议

[MIT](./LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chatopera/chop

Awesome Lists containing this project

README