{"id":28558609,"url":"https://github.com/yongzhuo/char-similar","last_synced_at":"2026-03-06T14:12:26.367Z","repository":{"id":223466621,"uuid":"760401254","full_name":"yongzhuo/char-similar","owner":"yongzhuo","description":"字符相似度, 汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集)) Chinese character font/pinyin/semantic similarity (single character, can be used for data augmentation, CSC misclassified character detection and recognition tasks (building confusion sets))","archived":false,"fork":false,"pushed_at":"2025-07-05T01:55:45.000Z","size":4886,"stargazers_count":17,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-02T09:57:01.465Z","etag":null,"topics":["char-similar","chinese-spelling-correction","corpus","csc","dataset","similarity","text-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yongzhuo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-20T10:57:33.000Z","updated_at":"2025-08-26T01:42:11.000Z","dependencies_parsed_at":"2024-02-20T12:25:42.909Z","dependency_job_id":"d3c09067-aad9-4d7a-a830-868baecd5b62","html_url":"https://github.com/yongzhuo/char-similar","commit_stats":null,"previous_names":["yongzhuo/char-similar"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/yongzhuo/char-similar","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yongzhuo%2Fchar-similar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yongzhuo%2Fchar-similar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yongzhuo%2Fchar-similar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yongzhuo%2Fchar-similar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yongzhuo","download_url":"https://codeload.github.com/yongzhuo/char-similar/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yongzhuo%2Fchar-similar/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30180644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T12:39:21.703Z","status":"ssl_error","status_checked_at":"2026-03-06T12:36:09.819Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["char-similar","chinese-spelling-correction","corpus","csc","dataset","similarity","text-similarity"],"created_at":"2025-06-10T08:09:28.105Z","updated_at":"2026-03-06T14:12:26.340Z","avatar_url":"https://github.com/yongzhuo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# char-similar\n\u003e\u003e\u003e 汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))\n\n# 一、安装\n```\n0. 注意事项\n   默认不指定numpy版本(标准版numpy==1.22.4), 过高或者过低的版本可能不支持\n   标准版本的依赖包详见 requirements-all.txt\n   \n1. 通过PyPI安装\n   pip install char-similar\n   使用镜像源, 如：\n   pip install -i https://pypi.tuna.tsinghua.edu.cn/simple char-similar\n```\n\n# 二、使用方式\n\n## 2.1 快速使用\n```python3\nfrom char_similar import std_cal_sim\nchar1 = \"我\"\nchar2 = \"他\"\nres = std_cal_sim(char1, char2)\nprint(res)\n# output:\n# 0.5821\n```\n\n## 2.2 详细使用\n```python3\nfrom char_similar import std_cal_sim\n# \"all\"(字形:拼音:字义=1:1:1)  # \"w2v\"(字形:字义=1:1)  # \"pinyin\"(字形:拼音=1:1)  # \"shape\"(字形=1)\nkind = \"shape\"\nrounded = 4  # 保留x位小数\nchar1 = \"我\"\nchar2 = \"他\"\nres = std_cal_sim(char1, char2, rounded=rounded, kind=kind)\nprint(res)\n# output:\n# 0.5821\n```\n\n\n## 2.3 多线程使用\n```python3\nfrom char_similar import pool_cal_sim\n# \"all\"(字形:拼音:字义=1:1:1)  # \"w2v\"(字形:字义=1:1)  # \"pinyin\"(字形:拼音=1:1)  # \"shape\"(字形=1)\nkind = \"shape\"\nrounded = 4  # 保留x位小数\nchar1 = \"我\"\nchar2 = \"他\"\nres = pool_cal_sim(char1, char2, rounded=rounded, kind=kind)\nprint(res)\n# output:\n# 0.5821\n```\n\n\n## 2.4 多进程使用(不建议, 实现得较慢)\n```python3\nif __name__ == '__main__':\n    from char_similar import multi_cal_sim\n    # \"all\"(字形:拼音:字义=1:1:1)  # \"w2v\"(字形:字义=1:1)  # \"pinyin\"(字形:拼音=1:1)  # \"shape\"(字形=1)\n    kind = \"shape\"\n    rounded = 4  # 保留x位小数\n    char1 = \"我\"\n    char2 = \"他\"\n    res = multi_cal_sim(char1, char2, rounded=rounded, kind=kind)\n    print(res)\n    # output:\n    # 0.5821\n```\n\n# 三、技术原理\n```\nchar-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.\n\n# 四角码(code=4, 共5位), 统计四个数字中的相同数/4\n# 偏旁部首, 相同为1\n# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)\n# 笔画数, 1-(差的绝对值/两数中的最大值)\n# 拆字, 集合的与 / 集合的并\n# 构造结构, 相同为1\n# 笔顺(实际为最小的集合), 集合的与 / 集合的并\n# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4\n# 词向量, char-word2vec, cosine\n```\n\n\n# 四、参考(部分字典来源以下项目)\n - [https://github.com/contr4l/SimilarCharacter](https://github.com/contr4l/SimilarCharacter)\n - [https://github.com/houbb/nlp-hanzi-similar](https://github.com/houbb/nlp-hanzi-similar)\n - [https://github.com/mozillazg/python-pinyin](https://github.com/mozillazg/python-pinyin)\n - [https://github.com/CNMan/UnicodeCJK-WuBi](https://github.com/CNMan/UnicodeCJK-WuBi)\n - [https://github.com/yongzhuo/Macropodus](https://github.com/yongzhuo/Macropodus)\n - [https://github.com/kfcd/chaizi](https://github.com/kfcd/chaizi)\n \n\n\n# Reference\nFor citing this work, you can refer to the present GitHub project. For example, with BibTeX:\n```\n@misc{Macropodus,\n    howpublished = {https://github.com/yongzhuo/char-similar},\n    title = {char-similar},\n    author = {Yongzhuo Mo},\n    publisher = {GitHub},\n    year = {2024}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyongzhuo%2Fchar-similar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyongzhuo%2Fchar-similar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyongzhuo%2Fchar-similar/lists"}