{"id":20693017,"url":"https://github.com/deepcs233/jieba_fast","last_synced_at":"2025-04-08T12:12:58.060Z","repository":{"id":43557291,"uuid":"114853170","full_name":"deepcs233/jieba_fast","owner":"deepcs233","description":"Use C Api and Swig to Speed up jieba 高效的中文分词库","archived":false,"fork":false,"pushed_at":"2021-08-27T18:32:21.000Z","size":41742,"stargazers_count":639,"open_issues_count":4,"forks_count":74,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-01T11:04:17.045Z","etag":null,"topics":["dag","jieba","python","swig","viterbi-hmm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepcs233.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-12-20T06:47:50.000Z","updated_at":"2025-03-26T14:42:53.000Z","dependencies_parsed_at":"2022-07-12T18:19:06.103Z","dependency_job_id":null,"html_url":"https://github.com/deepcs233/jieba_fast","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepcs233%2Fjieba_fast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepcs233%2Fjieba_fast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepcs233%2Fjieba_fast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepcs233%2Fjieba_fast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepcs233","download_url":"https://codeload.github.com/deepcs233/jieba_fast/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247838446,"owners_count":21004580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dag","jieba","python","swig","viterbi-hmm"],"created_at":"2024-11-16T23:24:55.161Z","updated_at":"2025-04-08T12:12:58.024Z","avatar_url":"https://github.com/deepcs233.png","language":"Python","readme":"[![Downloads](https://pepy.tech/badge/jieba-fast)](https://pepy.tech/project/jieba-fast)\n[![Downloads](https://pepy.tech/badge/jieba-fast/month)](https://pepy.tech/project/jieba-fast)\n[![Downloads](https://pepy.tech/badge/jieba-fast/week)](https://pepy.tech/project/jieba-fast)\n\njieba_fast\n========\n使用`cpython`重写了jieba分词库中计算DAG和HMM中的vitrebi函数，速度得到大幅提升。\n使用`import jieba_fast as jieba` 可以无缝衔接源代码。\n\n特点\n========\n* 对两种分词模式进行的加速：精确模式，搜索引擎模式\n* 利用`cython`重新实现了viterbi算法，使默认带HMM的切词模式速度大幅提升\n* 利用`cython`重新实现了生成DAG以及从DAG计算最优路径的算法，速度大幅提升\n* 基本只是替换了核心函数，对源代码的侵入型很小\n* MIT 授权协议\n\n\n\n\n安装说明\n=======\n\n代码目前对 Python 2/3 兼容，对*unix兼容良好，windows本地编译测试通过，但不保证。\n\n* 全自动安装：`pip install jieba_fast`\n* 半自动安装：先下载 http://pypi.python.org/pypi/jieba_fast/ ，解压后运行 `python setup.py install`\n\n关于windows的编译过程中可能会有一些坑，可以尝试我编译好的版本，将编译好的放在了windows/下，分别对应的是python2.7与python3.5。\n如果你想安装python2版本的jiaba_fast，将python2下的所有目录与文件拷至对应python的lib/site-packages下就ok。\n\n算法\n========\n\n* 基于前缀词典实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG)\n* 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合\n* 对于未登录词，采用了基于汉字成词能力的 HMM 模型，使用了 Viterbi 算法\n\n\n\n\n主要功能\n=======\n\n详情见 https://github.com/fxsjy/jieba\n\n\n代码示例\n\n```python\n# encoding=utf-8\nimport jieba_fast as jieba\n\ntext = u'在输出层后再增加CRF层，加强了文本间信息的相关性，针对序列标注问题，每个句子的每个词都有一个标注结果，对句子中第i个词进行高维特征的抽取，通过学习特征到标注结果的映射，可以得到特征到任\u003e      意标签的概率，通过这些概率，得到最优序列结果'\n\nprint(\"-\".join(jieba.lcut(text, HMM=True))\nprint('-'.join(jieba.lcut(text, HMM=False)))\n\n```\n\n输出:\n\n```python\n在-输出-层后-再-增加-CRF-层-，-加强-了-文本-间-信息-的-相关性-，-针对-序列-标注-问题-，-每个-句子-的-每个-词-都-有-一个-标注-结果-，-对-句子-中-第-i-个-词-进行-高维-特征-的-抽取-，-通过-学习-特征-到-标注-结果-的-映射-，-可以-得到-特征-到-任意-标签-的-概率-，-通过-这些-概率-，-得到-最优-序列-结果\n```\n\n```python\n在-输出-层-后-再-增加-CRF-层-，-加强-了-文本-间-信息-的-相关性-，-针对-序列-标注-问题-，-每个-句子-的-每个-词-都-有-一个-标注-结果-，-对-句子-中-第-i-个-词-进行-高维-特征-的-抽取-，-通过-学习-特征-到-标注-结果-的-映射-，-可以-得到-特征-到-任意-标签-的-概率-，-通过-这些-概率-，-得到-最优-序列-结果\n```\n\n\n\n\n性能测试\n=======\n测试机器 mbp17， i7， 16G\n\n测试过程：\n先按行读取文本《围城》到一个数组里，然后循环对《围城》每行文字作为一个句子进行分词。然后循环对围城这本书分词50次。分词算法分别采用【开启HMM的精确模式】、【关闭HMM的精确模式】、【开启HMM的搜索引擎模式】、【开启HMM的搜索引擎模式】\n具体测试数据如下：\n\n\n|            | 开启HMM的精确模式 | 关闭HMM的精确模式 | 开启HMM的搜索引擎模式 | 关闭HMM的搜索引擎模式 |\n| ---------- | ---------- | ---------- | ------------ | ------------ |\n| jieba      | 65.1s      | 39.9s      | 67.5s        | 40.5s        |\n| jieba_fast | 24.5s      | 18.2s      | 25.3s        | 20.4s        |\n\n可以看出在开启HMM模式下时间缩减了60%左右，关闭HMM时时间缩减了50%左右。\n\n\n\n 一致性测试\n======\n\n为了保证jieba_fast和jieba分词结果相同，做了如下测试。\n\n对《围城》，《红楼梦》分词结果进行比较，其分词结果完全一致\n\n```python\n---- Test of 围城 ----\nnums of jieba      results:  164821\nnums of jieba_fast results:  164821\nAre they exactly the same?  True\n----Test of 红楼梦 ----\nnums of jieba      results:  597151\nnums of jieba_fast results:  597151\nAre they exactly the same?  True\n```\n\n\n\n鸣谢\n======\n\n\"结巴\"中文分词作者: [SunJunyi](https://github.com/fxsjy)\n\n源码见 source/\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepcs233%2Fjieba_fast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepcs233%2Fjieba_fast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepcs233%2Fjieba_fast/lists"}