https://github.com/gowee/zhconv-rs
🦀Fastest ever Trad/Simp and regional Chinese variants converter | 中文简繁及地區詞轉換
https://github.com/gowee/zhconv-rs
chinese chinese-translation mediawiki opencc simplified-chinese traditional-chinese wikipedia
Last synced: 6 months ago
JSON representation
🦀Fastest ever Trad/Simp and regional Chinese variants converter | 中文简繁及地區詞轉換
- Host: GitHub
- URL: https://github.com/gowee/zhconv-rs
- Owner: Gowee
- License: gpl-3.0
- Created: 2021-12-03T12:47:56.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-01-15T05:46:43.000Z (9 months ago)
- Last Synced: 2025-04-06T02:07:19.361Z (6 months ago)
- Topics: chinese, chinese-translation, mediawiki, opencc, simplified-chinese, traditional-chinese, wikipedia
- Language: Rust
- Homepage: https://zhconv.pages.dev
- Size: 18.4 MB
- Stars: 37
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://github.com/Gowee/zhconv-rs/actions)
[](https://docs.rs/zhconv)
[](https://crates.io/crates/zhconv)
[](https://pypi.org/project/zhconv-rs/)
[](https://www.npmjs.com/package/zhconv)# zhconv-rs 中文简繁及地區詞轉換
zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. `zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant`), backed by rulesets from MediaWiki/Wikipedia and OpenCC.
It leverages the [Aho-Corasick](https://github.com/daac-tools/daachorse) algorithm for linear time complexity with respect to the length of input text and conversion rules (`O(n+m)`), processing dozens of MiBs text per second.
🔗 **Web app (Wasm):** https://zhconv.pages.dev (w/ OpenCC dicts)
⚙️ **Cli**: `cargo install zhconv-cli` or check [releases](https://github.com/Gowee/zhconv-rs/releases).
🦀 **Rust crate**: `cargo add zhconv` (check [docs](https://docs.rs/zhconv/latest/zhconv/) for examples)
🐍 **Python package w/ wheels**: `pip install zhconv-rs` or `pip install zhconv-rs-opencc` (w/ OpenCC dicts)
🧩 **API demo**: https://zhconv.bamboo.workers.dev
Python snippet
```python
# > pip install zhconv_rs
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"
```**JS (Webpack)**: `npm install zhconv` or `yarn add zhconv` (Wasm, [instructions](https://rustwasm.github.io/wasm-pack/book/tutorials/npm-browser-packages/using-your-library.html))
**JS in browser**: https://cdn.jsdelivr.net/npm/zhconv-web@latest (Wasm)
HTML snippet
```html
// Use ES module import syntax to import functionality from the module
// that we have compiled.
//
// Note that the `default` import is an initialization function which
// will "boot" the module and make it ready to use. Currently browsers
// don't support natively imported WebAssembly as an ES module, but
// eventually the manual initialization won't be required!
import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prodasync function run() {
await init();alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
}run();
```
## Supported variants
zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
| Target | Tag | Script | Description |
| -------------------------------------- | --------- | ------- | --------------------------------------------- |
| **S**implified **C**hinese / 简体中文 | `zh-Hans` | SC / 简 | W/O substituing region-specific phrases. |
| **T**raditional **C**hinese / 繁體中文 | `zh-Hant` | TC / 繁 | W/O substituing region-specific phrases. |
| Chinese (Taiwan) / 臺灣正體 | `zh-TW` | TC / 繁 | With Taiwan-specific phrases adapted. |
| Chinese (Hong Kong) / 香港繁體 | `zh-HK` | TC / 繁 | With Hong Kong-specific phrases adapted. |
| Chinese (Macau) / 澳门繁體 | `zh-MO` | TC / 繁 | Same as `zh-HK` for now. |
| Chinese (Mainland China) / 大陆简体 | `zh-CN` | SC / 简 | With mainland China-specific phrases adapted. |
| Chinese (Singapore) / 新加坡简体 | `zh-SG` | SC / 简 | Same as `zh-CN` for now. |
| Chinese (Malaysia) / 大马简体 | `zh-MY` | SC / 简 | Same as `zh-CN` for now. |*Note:* `zh-TW` and `zh-HK` are based on `zh-Hant`. `zh-CN` are based on `zh-Hans`. Currently, `zh-MO` shares the same rulesets with `zh-HK` unless additional rules are manually configured; `zh-MY` and `zh-SG` shares the same rulesets with `zh-CN` unless additional rules are manually configured.
## Performance
`cargo bench` on `AMD EPYC 7B13` (GitPod) by v0.3:
w/ default features
```
load/zh2Hant time: [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans time: [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW time: [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK time: [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO time: [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN time: [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG time: [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY time: [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic time: [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic time: [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time: [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥 time: [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k time: [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k time: [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k time: [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k time: [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k time: [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k time: [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m time: [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k time: [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k time: [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k time: [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time: [60.205 ms 60.412 ms 60.627 ms]
```w/ the additional non-default `opencc` feature
```
load/zh2Hant time: [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans time: [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW time: [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK time: [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO time: [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN time: [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG time: [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY time: [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic time: [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic time: [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time: [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥 time: [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k time: [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k time: [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k time: [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k time: [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k time: [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k time: [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m time: [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k time: [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k time: [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k time: [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time: [74.878 ms 76.262 ms 77.818 ms]
```**Note:** Enabling OpenCC rulesets increases the build size by several MiBs and noticeably impacts performance, even though it still outperforms other implementations.
## Limitations
### Accuracy
A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both `干 -> 幹` and `天干物燥 -> 天乾物燥`, the converter would prioritize `天乾物燥` because `天干物燥` gets matched earlier compared to `干` at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.
### Wikitext support
While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.
For wikitext inputs containing global conversion rules (e.g., `-{H|zh-hans:鹿|zh-hant:马}-` in MediaWiki syntax), the implementation's time complexity may degrade to `O(n*m)` in the worst case, where `n` is the input text length and `m` is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.
## Credits
Rulesets/Dictionaries: [MediaWiki](https://github.com/wikimedia/mediawiki) and [OpenCC](https://github.com/BYVoid/OpenCC).
References:
- https://github.com/gumblex/zhconv : Python implementation of `zhConver{ter,sion}.php`.
- https://github.com/BYVoid/OpenCC/ : Widely adopted Chinese converter.
- https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
- https://zh.wikipedia.org/wiki/Help:高级字词转换语法
- https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php