{"id":33209143,"url":"https://github.com/jannson/yaha","last_synced_at":"2025-11-21T04:03:25.186Z","repository":{"id":62590373,"uuid":"12047992","full_name":"jannson/yaha","owner":"jannson","description":"yaha","archived":false,"fork":false,"pushed_at":"2018-09-13T16:35:40.000Z","size":4625,"stargazers_count":266,"open_issues_count":5,"forks_count":117,"subscribers_count":38,"default_branch":"master","last_synced_at":"2025-03-30T09:13:16.409Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jannson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-08-12T04:36:38.000Z","updated_at":"2025-01-21T04:04:56.000Z","dependencies_parsed_at":"2022-11-04T08:09:13.650Z","dependency_job_id":null,"html_url":"https://github.com/jannson/yaha","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jannson/yaha","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jannson%2Fyaha","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jannson%2Fyaha/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jannson%2Fyaha/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jannson%2Fyaha/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jannson","download_url":"https://codeload.github.com/jannson/yaha/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jannson%2Fyaha/sbom","scorecard":{"id":505450,"data":{"date":"2025-08-11","repo":{"name":"github.com/jannson/yaha","commit":"b3d1f0278617705915cfd79e4b716205a40eccb4"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.6,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-19T23:09:45.026Z","repository_id":62590373,"created_at":"2025-08-19T23:09:45.026Z","updated_at":"2025-08-19T23:09:45.026Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285553649,"owners_count":27191359,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-21T02:00:06.175Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-16T11:00:19.859Z","updated_at":"2025-11-21T04:03:25.180Z","avatar_url":"https://github.com/jannson.png","language":"Python","readme":"Yaha分词\n========\n\"哑哈\"中文分词，更快或更准确，由你来定义。通过简单定制，让分词模块更适用于你的需求。\n\"Yaha\" You can custom your Chinese Word Segmentation efficiently by using Yaha\n\nPS. 这里有一个[crfseg](https://github.com/jannson/crfseg) 是对crf++的封装。目前丑陋适合拿来学习。\n\n词语生成(NEWS!)\n========\n以前在extra/seqword.cpp实现词语发现功能，现在已升级优化，并独立出来：[项目地址](https://github.com/jannson/wordmaker)\n\n使用多线程，以及类似MapReduce的思想，可以处理50M+的文本，自动得到文本当中的专业名词、名字、地点名词等等词语。得到词语后可以加到分词工库的字典中。\n\n\n安装\n======\npip install yaha\n\nQQ交流群（同时也是vxworks-kernel-like项目的交流群）: 2749-83126\n\n在线演示\n========\n代码部署在GAE上：http://yahademo.appspot.com\n\n代码部署在SAE上：http://yaha.sinaapp.com\n\n原本的这个地址已不再使用：http://yaha.v-find.com/\n\n示例代码：https://github.com/jannson/yaha/blob/master/tests/test_cuttor.py\n\n\nFeature\n========\n* 基本功能：\n  * 精确模式，将句子切成最合理的词。\n  * 全模式，所有的可能词都被切成词，不消除歧义。\n  * 搜索引擎模式，在精确的基础上再次驿长词进行切分，提高召回率，适合搜索引擎创建索引。\n  * 备选路径，可生成最好的多条切词路径，可在此基础上根据其它信息得到更精确的分词模式。\n\n* 可用插件：\n  * 正则表达式插件\n  * 人名前缀插件\n  * 地名后缀插件\n  * 定制功能。分词过程产生4种阶段，每个阶段都可以加入个人的定制。\n\n* 附加功能：\n  * 新词学习功能。通过输入大段文字，学习到此内容产生的新老词语。 （添加了一个由我朋友实现的C++版本的最大熵新词发现功能，速度是python的10倍吧）\n  * 获取大段文本的关键字。\n  * 获取大段文本的摘要。\n  * 词语纠错功能（新！常用在搜索里对用户的错误输入进行纠正）\n  * 支持用户自定义词典 （TODO目前还没有实现得很好）\n\n\n\nAlgorithm\n=========\n* 核心是基于查找句子的最大概率路径来进行分词。\n* 保证效率的基础上，对分词的各个阶段进行定义，方便用户添加属于自己的分词方法(默认有正则，前缀名字与后缀地名)。\n* 用户可自定义使用动态规划或Dijdstra算法得到最优的一条或多条路径，再次可根据词性(中科大ictclas的作法)等其它信息得获得最优路径。\n* 使用“最大熵”算法来实现对大文本的新词发现能力，很适合使用它来创建自定义词典，或在SNS等场合进行数据挖掘的工作。\n* 相比已存在的结巴分词，去掉了很消耗内存的Trie树结构，以及新词发现能力并不强的HMM模型(未来此模型可能当成一个备选插件加入到此模块)。\n\n\n阶段讲解\n========\n* stage 1是在分句中实现，通过正则可直接将数字或英文单词分成独立的词，生成独立的这些词不再参与下一步的分词。\n* stage 2在创建有向无环图之前实现，对分句进行预扫描，加入一些可能形成的词，并赋予一定的概率。\n* stage 3在创建有向无环图期间实现，从字典得到词的概率，或通过一些匹配模式得到可能的词，赋予一定概率。\n* stage 4在得到有向无环图的最大概率之后（程序实现当中是最短路径），对一些不能成词的单字再继续进行处理；或得到最短的多条路径之后，根据用户的兴趣得到最终的一条路径。若用户有兴趣，可以在这一步实现对词性的分析。\n\n\n目前状态\n========\n一直在用，貌似没有什么问题。最后要感谢jieba的作者，目前的字典是直接从jieba项目拷贝过来的。\n","funding_links":[],"categories":["Chinese NLP Toolkits 中文NLP工具"],"sub_categories":["Chinese Word Segment 中文分词"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjannson%2Fyaha","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjannson%2Fyaha","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjannson%2Fyaha/lists"}