{"id":13849514,"url":"https://github.com/duanhongyi/genius","last_synced_at":"2026-04-09T04:04:24.263Z","repository":{"id":10156844,"uuid":"12236614","full_name":"duanhongyi/genius","owner":"duanhongyi","description":"a chinese segment base on crf","archived":false,"fork":false,"pushed_at":"2018-12-19T16:03:58.000Z","size":101484,"stargazers_count":234,"open_issues_count":0,"forks_count":65,"subscribers_count":26,"default_branch":"master","last_synced_at":"2024-11-10T10:51:54.598Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/duanhongyi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.txt","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-08-20T06:54:42.000Z","updated_at":"2024-11-06T09:37:29.000Z","dependencies_parsed_at":"2022-09-17T18:19:43.019Z","dependency_job_id":null,"html_url":"https://github.com/duanhongyi/genius","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duanhongyi%2Fgenius","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duanhongyi%2Fgenius/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duanhongyi%2Fgenius/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duanhongyi%2Fgenius/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/duanhongyi","download_url":"https://codeload.github.com/duanhongyi/genius/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225829364,"owners_count":17530663,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-04T19:01:20.769Z","updated_at":"2025-12-18T11:07:33.771Z","avatar_url":"https://github.com/duanhongyi.png","language":"Python","funding_links":[],"categories":["Python","Natural Language Processing","Chinese NLP Toolkits 中文NLP工具"],"sub_categories":["General-Purpose Machine Learning","Chinese Word Segment 中文分词"],"readme":"Genius\n========\nGenius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。\n\nFeature\n========\n\n* 支持python2.x、python3.x以及pypy2.x。\n* 支持简单的pinyin分词\n* 支持用户自定义break\n* 支持用户自定义合并词典\n* 支持词性标注\n\nSource Install\n==========\n* 安装git: 1) ubuntu or debian `apt-get install git` 2) fedora or redhat `yum install git`\n* 下载代码：`git clone https://github.com/duanhongyi/genius.git`\n* 安装代码：`python setup.py install`\n\nPypi Install\n============\n* 执行命令：`easy_install genius`或者`pip install genius`\n\n\nAlgorithm\n==========\n* 采用trie树进行合并词典查找\n* 基于wapiti实现条件随机场分词\n* 可以通过genius.loader.ResourceLoader来重载默认的字典\n\n功能 1)：分词`genius.seg_text`方法\n==============\n\n* `genius.seg_text`函数接受5个参数，其中text是必填参数: \n* `text`第一个参数为需要分词的字符\n* `use_break`代表对分词结构进行打断处理，默认值`True`\n* `use_combine`代表是否使用字典进行词合并，默认值`False`\n* `use_tagging`代表是否进行词性标注，默认值`True`\n* `use_pinyin_segment`代表是否对拼音进行分词处理，默认值`True`\n\n代码示例( 全功能分词 )\n\n    #encoding=utf-8\n    import genius\n    text = u\"\"\"昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。\"\"\"\n    seg_list = genius.seg_text(\n        text,\n        use_combine=True,\n        use_pinyin_segment=True,\n        use_tagging=True,\n        use_break=True\n    )\n    print('\\n'.join(['%s\\t%s' % (word.text, word.tagging) for word in seg_list]))\n\n功能 2)：面向索引分词\n==============\n* `genius.seg_keywords`方法专门为搜索引擎索引准备，保留歧义分割，其中text是必填参数。\n* `text`第一个参数为需要分词的字符 \n* `use_break`代表对分词结构进行打断处理，默认值`True`\n* `use_tagging`代表是否进行词性标注，默认值`False`\n* `use_pinyin_segment`代表是否对拼音进行分词处理，默认值`False`\n* 由于合并操作与此方法有意义上的冲突，此方法并不提供合并功能；并且如果采用此方法做索引时候，检索时不推荐`genius.seg_text`使用`use_combine=True`参数。\n\n代码示例\n\n    #encoding=utf-8\n    import genius\n\n    seg_list = genius.seg_keywords(u'南京市长江大桥')\n    print('\\n'.join([word.text for word in seg_list]))\n\n功能 3)：关键词提取\n==============\n* `genius.extract_tag`方法专门为提取tag关键字准备，其中text是必填参数。\n* `text`第一个参数为需要分词的字符 \n* `use_break`代表对分词结构进行打断处理，默认值`True`\n* `use_combine`代表是否使用字典进行词合并，默认值`False`\n* `use_pinyin_segment`代表是否对拼音进行分词处理，默认值`False`\n\n代码示例\n\n    #encoding=utf-8\n    import genius\n\n    tag_list = genius.extract_tag(u'南京市长江大桥')\n    print('\\n'.join(tag_list))\n\n其他说明 4)：\n=================\n* 目前分词语料出自人民日报1998年1月份，所以对于新闻类文章分词较为准确。\n* CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度，若解决语料问题分词和标注效果还有很大的提升空间。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduanhongyi%2Fgenius","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fduanhongyi%2Fgenius","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduanhongyi%2Fgenius/lists"}