{"id":26483826,"url":"https://github.com/chatopera/chop","last_synced_at":"2025-03-20T04:58:13.575Z","repository":{"id":62562073,"uuid":"97789404","full_name":"chatopera/chop","owner":"chatopera","description":"Chinese Tokenizer module for Python","archived":false,"fork":false,"pushed_at":"2018-07-03T11:55:28.000Z","size":9772,"stargazers_count":15,"open_issues_count":1,"forks_count":7,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-02-24T06:50:00.370Z","etag":null,"topics":["chinese-nlp","chinese-segmenter","nlp","parser","segment","segmenter"],"latest_commit_sha":null,"homepage":"https://github.com/Samurais/chop-evaluate","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chatopera.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-20T04:09:38.000Z","updated_at":"2025-01-02T13:15:50.000Z","dependencies_parsed_at":"2022-11-03T15:30:27.137Z","dependency_job_id":null,"html_url":"https://github.com/chatopera/chop","commit_stats":null,"previous_names":["samurais/chop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fchop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fchop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fchop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fchop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chatopera","download_url":"https://codeload.github.com/chatopera/chop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244554067,"owners_count":20471173,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese-nlp","chinese-segmenter","nlp","parser","segment","segmenter"],"created_at":"2025-03-20T04:58:12.972Z","updated_at":"2025-03-20T04:58:13.564Z","avatar_url":"https://github.com/chatopera.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![chatoper banner][co-banner-image]][co-url]\n\n[co-banner-image]: https://user-images.githubusercontent.com/3538629/42217321-3d5e44f6-7ef7-11e8-94e7-1574bfa1dbb8.png\n[co-url]: https://www.chatopera.com\n\n# chop\nPython 中文分词工具包\n\n## 欢迎\n\nGitHub: https://github.com/samurais/chop\n\nPypi: https://pypi.python.org/pypi/chop\n\n## 依赖\n\nPython3\n\n## 使用说明\n\n代码对 Python 3 兼容\n\n* 全自动安装： ``easy_install chop`` 或者 ``pip install chop`` / ``pip3 install chop``\n\n* 接口\n\n```\nfrom chop.hmm import Tokenizer as HMMTokenizer\nfrom chop.mmseg import Tokenizer as MMSEGTokenizer\n\nsentence = \"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。\"\n\ndef main():\n    HT = HMMTokenizer()\n    MT = MMSEGTokenizer()\n    print('HMM Tokenizer:', ' '.join(HT.cut(sentence)))\n    print('MMSEG Tokenizer:', ' '.join(MT.cut(sentence)))\n\n```\n\n* 代码通俗易懂，方便掌握算法\n\n## API\n\n* chop.*[mmseg|hmm]*.Tokenizer Object\n\nt = chop.mmseg.Tokenizer([dict_path=\"自定义词典位置\"])\n\n* t#cut(sentence[, punctuation = True])\n\n参数:\n\nsentence 中文句子\n*punctuation=True* 分词输出标点.\n\n返回:\n\nToken 使用*yield*返回的*generator*\n\n## 测试\n\n```\n./scripts/test-badcase.sh \"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作\"\n```\n\n## 算法\n\n* MMSEG: \nA Word Identification System for Mandarin Chinese Text Based on Two\nVariants of the Maximum Matching Algorithm\nhttp://technology.chtsai.org/mmseg/\n\nOther references:\nhttp://blog.csdn.net/nciaebupt/article/details/8114460\nhttp://www.codes51.com/itwd/1802849.html\n\n* HMM \u0026 Viterbi:\n\n[基于层叠隐马尔可夫模型的中文命名实体识别](http://xueshu.baidu.com/s?wd=%E5%9F%BA%E4%BA%8E%E5%B1%82%E5%8F%A0%E9%9A%90%E9%A9%AC%E5%B0%94%E5%8F%AF%E5%A4%AB%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%AD%E6%96%87%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB\u0026tn=SE_baiduxueshu_c1gjeupa\u0026ie=utf-8\u0026sc_hit=1)\n\n## 词典\n\nDict:\nhttps://github.com/Samurais/jieba/blob/master/jieba/dict.txt\n\n## 评测\n\n[chop-evaluate](https://github.com/Samurais/chop-evaluate)\n\n## 贡献代码\n\n```\nvirtualenv --no-site-packages -p /usr/local/bin/python3.6 ~/venv-py3\nCHOP_LOG_LVL=DEBUG\n./scripts/test.sh\n```\n\n## 感谢\n\n[hanlp](http://www.hankcs.com/nlp/ner/) \n\n[jieba](https://github.com/fxsjy/jieba)\n\n[mmseg](http://technology.chtsai.org/mmseg/)\n\n[Python实现mmseg分词算法和吐嘈](http://blog.csdn.net/acceptedxukai/article/details/7390300)\n\n## 测评\n\n[中文分词工具测评](http://rsarxiv.github.io/2016/11/29/%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%B7%A5%E5%85%B7%E6%B5%8B%E8%AF%84/)\n\n## 授权协议\n[MIT](./LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fchop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchatopera%2Fchop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fchop/lists"}