{"id":15017708,"url":"https://github.com/gowee/zhconv-rs","last_synced_at":"2025-04-06T02:07:28.013Z","repository":{"id":39638309,"uuid":"434593979","full_name":"Gowee/zhconv-rs","owner":"Gowee","description":"🦀Fastest ever Trad/Simp and regional Chinese variants converter  | 中文简繁及地區詞轉換","archived":false,"fork":false,"pushed_at":"2025-01-15T05:46:43.000Z","size":19318,"stargazers_count":37,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-06T02:07:19.361Z","etag":null,"topics":["chinese","chinese-translation","mediawiki","opencc","simplified-chinese","traditional-chinese","wikipedia"],"latest_commit_sha":null,"homepage":"https://zhconv.pages.dev","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gowee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-03T12:47:56.000Z","updated_at":"2025-03-31T03:01:14.000Z","dependencies_parsed_at":"2024-05-15T18:23:46.933Z","dependency_job_id":"3d909d3b-59eb-4e82-b21a-bd5c17225cb9","html_url":"https://github.com/Gowee/zhconv-rs","commit_stats":{"total_commits":210,"total_committers":4,"mean_commits":52.5,"dds":0.01904761904761909,"last_synced_commit":"1d810a019b807e7d7afa58ad81a003a88a760296"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gowee%2Fzhconv-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gowee%2Fzhconv-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gowee%2Fzhconv-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gowee%2Fzhconv-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gowee","download_url":"https://codeload.github.com/Gowee/zhconv-rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247423513,"owners_count":20936626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","chinese-translation","mediawiki","opencc","simplified-chinese","traditional-chinese","wikipedia"],"created_at":"2024-09-24T19:50:53.057Z","updated_at":"2025-04-06T02:07:27.995Z","avatar_url":"https://github.com/Gowee.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![CI status](https://github.com/Gowee/zhconv-rs/actions/workflows/main.yml/badge.svg)](https://github.com/Gowee/zhconv-rs/actions)\n[![docs.rs](https://docs.rs/zhconv/badge.svg)](https://docs.rs/zhconv)\n[![Crates.io](https://img.shields.io/crates/v/zhconv.svg)](https://crates.io/crates/zhconv)\n[![PyPI version](https://img.shields.io/pypi/v/zhconv-rs)](https://pypi.org/project/zhconv-rs/)\n[![NPM version](https://badge.fury.io/js/zhconv.svg)](https://www.npmjs.com/package/zhconv)\n\n# zhconv-rs 中文简繁及地區詞轉換\n\nzhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. `zh-TW \u003c-\u003e zh-CN \u003c-\u003e zh-HK \u003c-\u003e zh-Hans \u003c-\u003e zh-Hant`), backed by rulesets from MediaWiki/Wikipedia and OpenCC.\n\nIt leverages the [Aho-Corasick](https://github.com/daac-tools/daachorse) algorithm for linear time complexity with respect to the length of input text and conversion rules (`O(n+m)`), processing dozens of MiBs text per second.\n\n🔗 **Web app (Wasm):** https://zhconv.pages.dev (w/ OpenCC dicts)\n\n⚙️ **Cli**: `cargo install zhconv-cli` or check [releases](https://github.com/Gowee/zhconv-rs/releases).\n\n🦀 **Rust crate**: `cargo add zhconv` (check [docs](https://docs.rs/zhconv/latest/zhconv/) for examples)\n\n🐍 **Python package w/ wheels**: `pip install zhconv-rs` or `pip install zhconv-rs-opencc` (w/ OpenCC dicts)\n\n\u003ca href=\"https://deploy.workers.cloudflare.com/?url=https://github.com/gowee/zhconv-rs\"\u003e\n    \u003cimg src=\"https://deploy.workers.cloudflare.com/button\" align=\"right\" alt=\"Deploy to Cloudflare Workers\"\u003e\n\u003c/a\u003e\n\n🧩 **API demo**: https://zhconv.bamboo.workers.dev\n\n\u003cdetails open\u003e\n \u003csummary\u003ePython snippet\u003c/summary\u003e\n\n```python\n# \u003e pip install zhconv_rs\n# Convert with builtin rulesets:\nfrom zhconv_rs import zhconv\nassert zhconv(\"天干物燥 小心火烛\", \"zh-tw\") == \"天乾物燥 小心火燭\"\nassert zhconv(\"霧失樓臺，月迷津渡\", \"zh-hans\") == \"雾失楼台，月迷津渡\"\nassert zhconv(\"《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。\", \"zh-cn\", mediawiki=True) == \"《三个火枪手》是亚历山大·仲马的作品。\"\nassert zhconv(\"-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。\", \"zh-tw\", True) == \"《孤雛淚》是查爾斯·狄更斯的作品。\"\n\n# Convert with custom rules:\nfrom zhconv_rs import make_converter\nassert make_converter(None, [(\"天\", \"地\"), (\"水\", \"火\")])(\"甘肅天水\") == \"甘肅地火\"\n\nimport io\nconvert = make_converter(\"zh-hans\", io.StringIO(\"䖏 处\\n罨畫 掩画\")) # or path to rule file\nassert convert(\"秀州西去湖州近 幾䖏樓臺罨畫間\") == \"秀州西去湖州近 几处楼台掩画间\"\n```\n\n\u003c/details\u003e\n\n**JS (Webpack)**: `npm install zhconv` or `yarn add zhconv` (Wasm, [instructions](https://rustwasm.github.io/wasm-pack/book/tutorials/npm-browser-packages/using-your-library.html))\n\n**JS in browser**: https://cdn.jsdelivr.net/npm/zhconv-web@latest (Wasm)\n\n\u003cdetails\u003e\n \u003csummary\u003eHTML snippet\u003c/summary\u003e\n\n```html\n\u003cscript type=\"module\"\u003e\n    // Use ES module import syntax to import functionality from the module\n    // that we have compiled.\n    //\n    // Note that the `default` import is an initialization function which\n    // will \"boot\" the module and make it ready to use. Currently browsers\n    // don't support natively imported WebAssembly as an ES module, but\n    // eventually the manual initialization won't be required!\n    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod\n\n    async function run() {\n        await init();\n\n        alert(zhconv(prompt(\"Text to convert to zh-hans:\"), \"zh-hans\"));\n    }\n\n    run();\n\u003c/script\u003e\n```\n\n\u003c/details\u003e\n\n## Supported variants\n\n\u003cdetails\u003e\n \u003csummary\u003ezh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY\u003c/summary\u003e\n\n| Target                                 | Tag       | Script  | Description                                   |\n| -------------------------------------- | --------- | ------- | --------------------------------------------- |\n| **S**implified **C**hinese / 简体中文  | `zh-Hans` | SC / 简 | W/O substituing region-specific phrases.      |\n| **T**raditional **C**hinese / 繁體中文 | `zh-Hant` | TC / 繁 | W/O substituing region-specific phrases.      |\n| Chinese (Taiwan) / 臺灣正體            | `zh-TW`   | TC / 繁 | With Taiwan-specific phrases adapted.         |\n| Chinese (Hong Kong) / 香港繁體         | `zh-HK`   | TC / 繁 | With Hong Kong-specific phrases adapted.      |\n| Chinese (Macau) / 澳门繁體             | `zh-MO`   | TC / 繁 | Same as `zh-HK` for now.                      |\n| Chinese (Mainland China) / 大陆简体    | `zh-CN`   | SC / 简 | With mainland China-specific phrases adapted. |\n| Chinese (Singapore) / 新加坡简体       | `zh-SG`   | SC / 简 | Same as `zh-CN` for now.                      |\n| Chinese (Malaysia) / 大马简体          | `zh-MY`   | SC / 简 | Same as `zh-CN` for now.                      |\n\n*Note:*  `zh-TW` and `zh-HK` are based on `zh-Hant`. `zh-CN` are based on `zh-Hans`. Currently, `zh-MO` shares the same rulesets with `zh-HK` unless additional rules are manually configured; `zh-MY` and `zh-SG` shares the same rulesets with `zh-CN` unless additional rules are manually configured. \n\u003c/details\u003e\n\n## Performance\n\n`cargo bench` on `AMD EPYC 7B13` (GitPod) by v0.3:\n\n\u003cdetails\u003e\n\u003csummary\u003ew/ default features\u003c/summary\u003e\n\n```\nload/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]\nload/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]\nload/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]\nload/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]\nload/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]\nload/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]\nload/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]\nload/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]\nzh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]\nzh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]\nzh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]\nzh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]\nzh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]\nzh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]\nzh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]\nzh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]\nzh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]\nzh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]\nzh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]\nis_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]\ninfer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]\nis_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]\ninfer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]\n``` \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ew/ the additional non-default `opencc` feature\u003c/summary\u003e\n\n```\nload/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]\nload/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]\nload/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]\nload/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]\nload/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]\nload/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]\nload/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]\nload/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]\nzh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]\nzh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]\nzh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]\nzh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]\nzh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]\nzh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]\nzh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]\nzh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]\nzh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]\nzh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]\nzh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]\nis_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]\ninfer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]\nis_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]\ninfer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]\n```\n\n\u003c/details\u003e\n\u003c!--\n## Upstream rulesets\n\nzhconv-rs does not maintain any conversion rulesets/dicts. Instead, it relies on two upstream sources: MediaWiki and OpenCC. These rulesets are merged and compiled into an automaton at compile-time for optimal performance, which means rulesets cannot be dynamically selected at runtime. However, it is possible to load custom rulesets manually.\n\nBy default, only MediaWiki rulesets are used. For a Rust project, to enable additional OpenCC rulesets, activate the `opencc` feature: `zhconv = { version = \"...\", features = [ \"opencc\" ] }`. For a Python project, there are two standalone packages `zhconv-rs` (w/ MediaWiki rulesets only) and `zhconv-rs-opencc` (w/ additional OpenCC rulesets) to be installed as needed. For the API on Workers, check [worker.yml](.github/workflows/worker.yml) for instructions on configuring OpenCC rulesets. The web app is always shiped with additional OpenCC rulesets for now.--\u003e\n\n**Note:** Enabling OpenCC rulesets increases the build size by several MiBs and noticeably impacts performance, even though it still outperforms other implementations.\n\n\u003c!--\n## Differences with other converters\n* `ZhConver{sion,ter}.php` of MediaWiki: zhconv-rs just takes conversion tables listed in [`ZhConversion.php`](https://github.com/wikimedia/mediawiki/blob/master/includes/languages/data/ZhConversion.php#L14). MediaWiki relies on the inefficient PHP built-in function [`strtr`](https://github.com/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974). Under the basic mode, zhconv-rs guarantees linear time complexity (`T = O(n+m)` instead of `O(nm)`) and single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.\n* OpenCC: The [conversion rulesets](https://github.com/BYVoid/OpenCC/tree/master/data/dictionary) of OpenCC is independent of MediaWiki. The core [conversion implementation](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) of OpenCC is kinda similar to the aforementioned `strtr`. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.\n## Comparisions with other tools\n- OpenCC: Dict::MatchPrefix (iterating from maxlen to minlen character by character to match) [https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Dict.cpp#L25](MatchPrefix), [segments converter](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) [segmentizer](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/MaxMatchSegmentation.cpp#L34)\n- zhConversion.php: strtr (iterating from maxlen to minlen for every known key length to match) [https://github.dev/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974]\n- zhconv-rs regex-based automaton\n--\u003e\n\n## Limitations\n\n### Accuracy\n\nA rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both `干 -\u003e 幹` and `天干物燥 -\u003e 天乾物燥`, the converter would prioritize `天乾物燥` because `天干物燥` gets matched earlier compared to `干` at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.\n\n### Wikitext support\n\nWhile the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.\n\nFor wikitext inputs containing global conversion rules (e.g., `-{H|zh-hans:鹿|zh-hant:马}-` in MediaWiki syntax), the implementation's time complexity may degrade to `O(n*m)` in the worst case, where `n` is the input text length and `m` is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.\n\n## Credits\n\nRulesets/Dictionaries: [MediaWiki](https://github.com/wikimedia/mediawiki) and [OpenCC](https://github.com/BYVoid/OpenCC).\n\nReferences:\n- https://github.com/gumblex/zhconv : Python implementation of `zhConver{ter,sion}.php`.\n- https://github.com/BYVoid/OpenCC/ : Widely adopted Chinese converter.\n- https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理\n- https://zh.wikipedia.org/wiki/Help:高级字词转换语法\n- https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php\n\u003c!--- https://www.hankcs.com/nlp/simplified-traditional-chinese-conversion.html--\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgowee%2Fzhconv-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgowee%2Fzhconv-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgowee%2Fzhconv-rs/lists"}