{"id":18600593,"url":"https://github.com/houbb/pinyin","last_synced_at":"2025-04-05T00:06:17.024Z","repository":{"id":43179146,"uuid":"234240960","full_name":"houbb/pinyin","owner":"houbb","description":"The high performance pinyin tool for java.(java 高性能中文转拼音工具。支持同音字。) ","archived":false,"fork":false,"pushed_at":"2023-03-27T10:12:12.000Z","size":1805,"stargazers_count":262,"open_issues_count":12,"forks_count":38,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-28T23:04:12.163Z","etag":null,"topics":["dfa","high-performance","nlp","pinyin","pinyin-analysis","pinyin-data","pinyin-segmentation","pinyin4j","segment","tiny","tiny-pinyin","tongyinzi"],"latest_commit_sha":null,"homepage":"https://houbb.github.io/opensource/pinyin","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/houbb.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-01-16T05:15:22.000Z","updated_at":"2025-03-18T14:46:38.000Z","dependencies_parsed_at":"2024-01-16T09:52:26.802Z","dependency_job_id":"ed30e52c-0158-45c9-90c9-4dcb04dcad9a","html_url":"https://github.com/houbb/pinyin","commit_stats":{"total_commits":61,"total_committers":4,"mean_commits":15.25,"dds":"0.16393442622950816","last_synced_commit":"839e4a972b539d480101841cd663f4f773d3a4cb"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houbb%2Fpinyin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houbb%2Fpinyin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houbb%2Fpinyin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houbb%2Fpinyin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/houbb","download_url":"https://codeload.github.com/houbb/pinyin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266563,"owners_count":20910836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dfa","high-performance","nlp","pinyin","pinyin-analysis","pinyin-data","pinyin-segmentation","pinyin4j","segment","tiny","tiny-pinyin","tongyinzi"],"created_at":"2024-11-07T02:04:34.751Z","updated_at":"2025-04-05T00:06:17.010Z","avatar_url":"https://github.com/houbb.png","language":"Java","readme":"# pinyin\n\n[pinyin](https://github.com/houbb/pinyin) 是 java 实现的高性能中文拼音转换工具。\n\n[![Build Status](https://travis-ci.com/houbb/segment.svg?branch=master)](https://travis-ci.com/houbb/pinyin)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.github.houbb/pinyin/badge.svg)](http://mvnrepository.com/artifact/com.github.houbb/pinyin)\n[![](https://img.shields.io/badge/license-Apache2-FF0080.svg)](https://github.com/houbb/pinyin/blob/master/LICENSE.txt)\n[![Open Source Love](https://badges.frapsoft.com/os/v2/open-source.svg?v=103)](https://github.com/houbb/nlp-common)\n\n\u003e [在线体验](https://houbb.github.io/opensource/pinyin)\n\n## 创作目的\n\n想为 java 设计一款便捷易用的拼音工具。\n\n[如何为 java 设计一款高性能的拼音转换工具 pinyin4j](https://houbb.github.io/2020/01/09/how-to-design-pinyin4j)\n\n## 特性\n\n- [性能是 pinyin4j 的两倍](#benchmark)\n\n- 极简的 api 设计\n\n- 支持转换长文本\n\n- 支持多音字\n\n- 支持多种拼音标注方式\n\n- 支持中文分词\n\n- 支持中文繁简体\n\n- 支持自定义拼音词库\n\n- 支持判断是否为同音字\n\n- 支持同音字\n\n### v0.4.0 主要变更\n\n- 更新依赖版本，移除控台日志\n\n# 快速开始\n\n## 准备\n\njdk 1.7+\n\n## maven 引入\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.github.houbb\u003c/groupId\u003e\n    \u003cartifactId\u003epinyin\u003c/artifactId\u003e\n    \u003cversion\u003e0.4.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## 快速开始\n\n参考 [PinyinHelperTest](https://github.com/houbb/pinyin/blob/master/src/test/java/com/github/houbb/pinyin/test/util/PinyinHelperTest.java)\n\n### 方法概览\n\n| 方法 | 返回值 | 说明 |\n|:----|:----|:----|\n| toPinyin(String) | String | 文本转换为拼音 |\n| toPinyin(String, PinyinStyleEnum) | String | 文本转换为拼音，可指定拼音样式 |\n| toPinyin(String, PinyinStyleEnum, String) | String | 文本转换为拼音，可指定拼音样式，可指定连接符号 |\n| toPinyinList(char) | List\u003cString\u003e | 返回汉字所有拼音列表 |\n| toPinyinList(char, PinyinStyleEnum) | List\u003cString\u003e | 返回汉字所有拼音列表，指定拼音样式 |\n| hasSamePinyin(char, char) | boolean | 判断两个汉字是否有相同的读音 |\n| samePinyinMap(char) | Map\u003cString, List\u003cString\u003e\u003e | 返回汉字的同音字MAP，key 为拼音 NUM_LAST 模式 |\n| samePinyinList(String) | List\u003cString\u003e | 返回拼音 NUM_LAST 模式对应的同音字 |\n\n### 返回中文的拼音\n\n使用 `PinyinHelper.toPinyin(string)` 进行中文转换。\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"我爱中文\");\nAssert.assertEquals(\"wǒ ài zhōng wén\", pinyin);\n```\n\n### 返回多音字列表\n\n使用 `PinyinHelper.toPinyinList(char)` 获取多音字的读音列表。\n\n```java\nList\u003cString\u003e pinyinList = PinyinHelper.toPinyinList('重');\nAssert.assertEquals(\"[zhòng, chóng, tóng]\", pinyinList.toString());\n```\n\n### 分词特性\n\n默认支持中文分词，对用户透明。\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"重庆火锅\");\nAssert.assertEquals(\"chóng qìng huǒ guō\", pinyin);\n\nString pinyin2 = PinyinHelper.toPinyin(\"分词也很重要\");\nAssert.assertEquals(\"fēn cí yě hěn zhòng yào\", pinyin2);\n```\n\n# 指定拼音标注形式\n\n## api \n\n```java\n/**\n * 转换为拼音\n * @param string 原始信息\n * @param styleEnum 样式枚举\n * @return 结果\n * @since 0.0.3\n */\npublic static String toPinyin(final String string, final PinyinStyleEnum styleEnum)\n```\n\n### PinyinStyleEnum 样式枚举\n\n| 枚举 | 说明 | 例子 |\n|:---|:---|:---|\n| `DEFAULT` | 默认模式，拼音声调在韵母第一个字母上。| pīn yīn |\n| `NORMAL` | 普通模式，即不带声调。| pin yin |\n| `NUM_LAST` | 数字标注模式，即拼音声调以数字形式在各个拼音之后，用数字 1-5 进行表示。| pin1 yin1 |\n| `FIRST_LETTER` | 首字母模式，只返回拼音的首字母部分。| p y |\n| `INPUT` | 键盘输入模式，使用 v 替代 ü。| nv hai |\n\n## 测试案例\n\n### DEFAULT\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"我爱中文\", PinyinStyleEnum.DEFAULT);\nAssert.assertEquals(\"wǒ ài zhōng wén\", pinyin);\n```\n\n### NORMAL\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"我爱中文\", PinyinStyleEnum.NORMAL);\nAssert.assertEquals(\"wo ai zhong wen\", pinyin);\n```\n\n### NUM_LAST\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"我爱中文\", PinyinStyleEnum.NUM_LAST);\nAssert.assertEquals(\"wo3 ai4 zhong1 wen2\", pinyin);\n```\n\n### FIRST_LETTER\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"我爱中文\", PinyinStyleEnum.FIRST_LETTER);\nAssert.assertEquals(\"w a z w\", pinyin);\n```\n\n### 指定连接符号\n\n有时候使用者希望指定特定的连接符号。\n\n```java\nfinal String text = \"我爱中文\";\nAssert.assertEquals(\"wazw\", PinyinHelper.toPinyin(text, PinyinStyleEnum.FIRST_LETTER, StringUtil.EMPTY));\n```\n\n第三个参数用于指定一个非 null 的字符串作为拼音连接符号。 （默认是空格进行连接）\n\n# 更多特性\n\n## 是否为同音字\n\n`PinyinHelper.hasSamePinyin()` 用来判断两个汉字是否为同音字，包括对多音字的处理。\n\n```java\nchar one = '花';\nchar two = '重';\nchar three = '中';\nchar four = '虫';\n\nAssert.assertFalse(PinyinHelper.hasSamePinyin(one, three));\nAssert.assertTrue(PinyinHelper.hasSamePinyin(two, three));\nAssert.assertTrue(PinyinHelper.hasSamePinyin(two, four));\n```\n\n## 支持繁体中文\n\n本框架支持繁体中文获取对应拼音。\n\n当然你也可以使用 [opencc4j](https://github.com/houbb/opencc4j) 统一转换为简体再做拼音获取，从而提高准确率。\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"奮斗\");\nAssert.assertEquals(\"fèn dòu\", pinyin);\n```\n\n# 同音字\n\n## 同音字 map\n\n返回一个汉字，所有拼音对应的同音字列表。\n\n```java\nfinal char hanzi2 = '重';\nMap\u003cString,List\u003cString\u003e\u003e map2 = PinyinHelper.samePinyinMap(hanzi2);\n```\n\n对应的同音字结果为：\n\n```\n{tong2=[㠉, 㠽, 㣚, 㣠, 㤏, 㮔, 㸗, 㼧, 㼿, 䂈, 䆚, 䮵, 䳋, 䴀, 䶱, 仝, 佟, 侗, 偅, 僮, 勭, 同, 哃, 垌, 峂, 峒, 峝, 庝, 彤, 晍, 曈, 朣, 桐, 橦, 氃, 洞, 浵, 湩, 潼, 烔, 燑, 爞, 犝, 狪, 獞, 痌, 眮, 瞳, 砼, 硐, 硧, 秱, 穜, 童, 筒, 筩, 粡, 絧, 膧, 艟, 茼, 蚒, 蜼, 蟲, 衕, 詷, 赨, 酮, 重, 鉖, 鉵, 銅, 铜, 餇, 鮦, 鲖, 鼕, 𠖄, 𡦜, 𢈉, 𢏕, 𢓘, 𣑸, 𣪯, 𤱇, 𤺄, 𥩌, 𥫂, 𦏆, 𦒍, 𦨴, 𧇌, 𧊚, 𧋒, 𧋚, 𧌝, 𧳆, 𨚯, 𨜳, 𨝯, 𨠌, 𩍅, 𩩅, 𩻡, 𪀭, 𫍣], zhong4=[㐺, 㲴, 㼿, 䱰, 中, 乑, 仲, 众, 偅, 堹, 妕, 媑, 狆, 眾, 祌, 种, 種, 穜, 筗, 緟, 茽, 蚛, 蟲, 衆, 衶, 衷, 褈, 諥, 踵, 重, 𠱧, 𡥿, 𢝆, 𣱧, 𤚏, 𥻝, 𦌋, 𦔉, 𧬤, 𧳮, 𨉢, 𩾋, 𩿀], chong2=[㓽, 㹐, 䌬, 䖝, 䳯, 崇, 崈, 漴, 烛, 爞, 痋, 种, 種, 緟, 茧, 虫, 蝩, 蟲, 褈, 酮, 重, 隀, 𡿂, 𢖄, 𢝈, 𣐯, 𧝎, 𨛱, 𩅃, 𩌨, 𩜖, 𩞉, 𩞋]}\n```\n\n每一个读音作为 key，对应的同音字作为 list。\n\n当然，有时候我们希望获取指定拼音的同音字列表。\n\n## 同音字 List\n\n```java\nfinal String pinyinNumLast = \"zhong4\";\nList\u003cString\u003e pinyinList = PinyinHelper.samePinyinList(pinyinNumLast);\n```\n\n对应结果：\n\n```\n[㐺, 㲴, 㼿, 䱰, 中, 乑, 仲, 众, 偅, 堹, 妕, 媑, 狆, 眾, 祌, 种, 種, 穜, 筗, 緟, 茽, 蚛, 蟲, 衆, 衶, 衷, 褈, 諥, 踵, 重, 𠱧, 𡥿, 𢝆, 𣱧, 𤚏, 𥻝, 𦌋, 𦔉, 𧬤, 𧳮, 𨉢, 𩾋, 𩿀]\n```\n\n# 自定义拼音词库\n\n已有的词库很难满足各种各样的场景，本工具提供自定义拼音词库的功能。\n\n## 自定义单个字的拼音\n\n### 自定义字典\n\n自定义 `resources/pinyin_dict_char_define.txt` 文件内容，格式如下：\n\n```\n莪:wǒ\n噯:ài,āi,ǎi\n```\n\n汉字与拼音使用英文`:` 分割，多音字使用英文`,`做拼音的分割。\n\n### 测试\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"莪\");\nAssert.assertEquals(\"wǒ\", pinyin);\n```\n\n## 自定义词组的拼音\n\n### 自定义字典\n\n自定义 `resources/pinyin_dict_phrase_define.txt` 文件内容，格式如下：\n\n```\n褈慶炎鍋:chóng qìng huǒ guō\n```\n\n### 测试\n\n以一串火星文为例。\n\n```java\nString pinyin = PinyinHelper.toPinyin(\"莪噯褈慶炎鍋\");\nAssert.assertEquals(\"wǒ ài chóng qìng huǒ guō\", pinyin);\n```\n\n## 注意点\n\n1. 仅支持汉语的自定义拼音。\n\n2. 为了保持功能的一致性，如果你自定义的是繁体字（词），对应的简体也会变成自定义注音。 \n\n# Benchmark\n\n测试代码见 [BenchmarkTest.java](https://github.com/houbb/pinyin/blob/master/src/test/java/com/github/houbb/pinyin/test/benchmark/BenchmarkTest.java)\n\n性能对比时使用相同的机器，相同测试文本，验证相同的次数。\n\n均提前做好预热处理，可供参考。\n\n对比 pinyin4j 版本为 v2.5.1\n\n## 单个分词\n\n| 对比函数 | 对比次数 | 对比内容 | 耗时 |\n|:---|:---|:---|:---|\n| `Pinyin4j toHanyuPinyinStringArray()` | 100w 次 | 相同文本随机选择一个字符 | 650 ms |\n| `pinyin toPinyin()` | 100w 次 | 相同文本随机选择一个字符 | 410 ms |\n\n## 字符串分词\n\n| 对比函数 | 对比次数 | 对比内容 | 耗时 |\n|:---|:---|:---|:---|\n| `Pinyin4j toHanyuPinyinString()` | 1w 次 | 相同长文本 | 26324 ms |\n| `pinyin toPinyin()` | 1w 次 | 相同长文本 | 16260 ms |\n| `pinyin toPinyin()` | 1w 次 | 相同长文本, chars 分词模式 | 14804 ms |\n\npinyin4j 的汉语字符串转换是不支持分词的，本项目在支持分词的情况下速度基本是 pinyin4j 的两倍。\n\n# 技术鸣谢\n\n[pinyin-data](https://github.com/mozillazg/pinyin-data) 与 [phrase-pinyin-data](https://github.com/mozillazg/phrase-pinyin-data) 提供的拼音数据。\n\n[segment](https://github.com/houbb/segment) 提供的中文分词。\n\n# NLP 开源矩阵\n\n[pinyin 汉字转拼音](https://github.com/houbb/pinyin)\n\n[pinyin2hanzi 拼音转汉字](https://github.com/houbb/pinyin2hanzi)\n\n[segment 高性能中文分词](https://github.com/houbb/segment)\n\n[opencc4j 中文繁简体转换](https://github.com/houbb/opencc4j)\n\n[nlp-hanzi-similar 汉字相似度](https://github.com/houbb/nlp-hanzi-similar)\n\n[word-checker 拼写检测](https://github.com/houbb/word-checker)\n\n[sensitive-word 敏感词](https://github.com/houbb/sensitive-word)\n    \n# 后期 Road-Map\n\n- [x] 键盘输入拼音形式支持\n\n- [x] 引导类开放分词的自定义配置\n\n- [x] 同音字列表返回\n\n- [ ] 同韵字列表返回\n\n- [ ] 音近字\n\n- [ ] 拼音转汉字\n","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoubb%2Fpinyin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhoubb%2Fpinyin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoubb%2Fpinyin/lists"}