{"id":44689721,"url":"https://github.com/Sweetiee-yi/Jaba","last_synced_at":"2026-03-13T06:00:57.750Z","repository":{"id":40617099,"uuid":"161049853","full_name":"Sweetiee-yi/Jaba","owner":"Sweetiee-yi","description":"结巴分词(java版) ","archived":false,"fork":false,"pushed_at":"2020-10-13T11:11:59.000Z","size":2147,"stargazers_count":95,"open_issues_count":2,"forks_count":25,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-10-20T23:18:31.304Z","etag":null,"topics":["java","jieba"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sweetiee-yi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-12-09T14:51:04.000Z","updated_at":"2023-07-20T02:21:24.000Z","dependencies_parsed_at":"2022-08-27T01:00:20.907Z","dependency_job_id":null,"html_url":"https://github.com/Sweetiee-yi/Jaba","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"purl":"pkg:github/Sweetiee-yi/Jaba","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sweetiee-yi%2FJaba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sweetiee-yi%2FJaba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sweetiee-yi%2FJaba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sweetiee-yi%2FJaba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sweetiee-yi","download_url":"https://codeload.github.com/Sweetiee-yi/Jaba/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sweetiee-yi%2FJaba/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30459760,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T03:55:51.346Z","status":"ssl_error","status_checked_at":"2026-03-13T03:55:33.055Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","jieba"],"created_at":"2026-02-15T07:00:30.141Z","updated_at":"2026-03-13T06:00:57.743Z","avatar_url":"https://github.com/Sweetiee-yi.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"readme":"结巴分词(java版) jaba\n===============================\n\n\n感谢jieba分词原作者[fxsjy](https://github.com/fxsjy)，本项目实现了 java 版本的 jieba。\n\n创建此项目起因：[jieba-analysis](https://github.com/huaban/jieba-analysis)这个项目分词的结果和python版本不一致，还会把英文字母全部改为小写。所以我重新实现了一下 java 版本的 jieba，保证了分词结果和 python 版本一致，并且分词速度快一倍（不算加载字典时间）。\n\n\n简介\n====\n\n支持分词模式\n------------\n\n-   CUT——精确模式，试图将句子最精确地切开，适合文本分析。\n-   CUT_ALL——全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义。\n-   CUT_WITHOUT_HMM——精确模式，但不使用HMM识别未登录词。\n\n\n支持提取关键词\n------------\n\n新增支持TF-IDF模式的关键词提取，保持和python版本结果一致。默认提取top 20个。\n``` java\nTFIDFAnalyzer.getInstance().extractTags(List\u003cString\u003e words, int topK)\nTFIDFAnalyzer.getInstance().extractTags(String sentence, int topK)\n```\n自定义IDF文件或停顿词文件：\n```java\nTFIDFAnalyzer.getInstance().loadStopWords(InputStream resourceStream) \nTFIDFAnalyzer.getInstance().loadIdfMap(InputStream resourceStream) \n```\n\n如何使用\n========\n\n-   Demo\n\n``` {.java}\n\n@Test\npublic void testDemo() {\n    Jaba jaba = Jaba.getInstance();\n    String[] sentences =\n            new String[] {\"这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。\", \"我不喜欢日本和服。\", \"雷猴回归人间。\",\n                    \"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作\", \"结果婚的和尚未结过婚的\"};\n    for (String sentence : sentences) {\n        System.out.println(jaba.cut(sentence, CutModeEnum.CUT).toString());\n    }\n\n    // td-idf 关键词提取\n    String sentence = \"此外，公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元，增资后，吉林欧亚置业注册资本由7000万元增加到5亿元。吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城市商业综合体项目。2013年，实现营业收入0万元，实现净利润-139.13万元。\";\n    TFIDFAnalyzer.getInstance().extractTags(sentence).forEach(System.out::println);\n}\n```\n\n算法\n=================\n\n-   \\[ \\] 基于 `AhoCorasickDoubleArrayTrie` 树结构存储的词典，性能比 `trie` 树更好\n-   \\[ \\] 生成所有切词可能的有向无环图 `DAG`\n-   \\[ \\] 采用动态规划算法计算最佳切词组合\n-   \\[ \\] 基于 `HMM` 模型，采用 `Viterbi` (维特比)算法实现未登录词识别\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSweetiee-yi%2FJaba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSweetiee-yi%2FJaba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSweetiee-yi%2FJaba/lists"}