{"id":13958573,"url":"https://github.com/wenet-e2e/WeTextProcessing","last_synced_at":"2025-07-21T00:31:14.600Z","repository":{"id":58172339,"uuid":"528029978","full_name":"wenet-e2e/WeTextProcessing","owner":"wenet-e2e","description":"Text Normalization \u0026 Inverse Text Normalization","archived":false,"fork":false,"pushed_at":"2025-07-20T07:06:23.000Z","size":916,"stargazers_count":614,"open_issues_count":25,"forks_count":87,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-07-20T09:19:39.884Z","etag":null,"topics":["normalization","production-ready","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wenet-e2e.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-08-23T14:33:26.000Z","updated_at":"2025-07-20T07:12:36.000Z","dependencies_parsed_at":"2023-12-26T15:56:10.093Z","dependency_job_id":"b8dff921-b141-45c9-abf6-282df659ec48","html_url":"https://github.com/wenet-e2e/WeTextProcessing","commit_stats":{"total_commits":81,"total_committers":8,"mean_commits":10.125,"dds":0.4814814814814815,"last_synced_commit":"4dc6f875f742ace2c3a98ba83ecf0731a115190e"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/wenet-e2e/WeTextProcessing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWeTextProcessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWeTextProcessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWeTextProcessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWeTextProcessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wenet-e2e","download_url":"https://codeload.github.com/wenet-e2e/WeTextProcessing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wenet-e2e%2FWeTextProcessing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266101920,"owners_count":23876784,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["normalization","production-ready","text-processing"],"created_at":"2024-08-08T13:01:44.972Z","updated_at":"2025-07-21T00:31:12.539Z","avatar_url":"https://github.com/wenet-e2e.png","language":"Python","funding_links":[],"categories":["语音识别与合成_其他"],"sub_categories":["网络服务_其他"],"readme":"## Text Normalization \u0026 Inverse Text Normalization\r\n\r\n### 0. Brief Introduction\r\n\r\n```diff\r\n- **Must Read Doc** (In Chinese): https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ\r\n```\r\n\r\n[WeTextProcessing: Production First \u0026 Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)\r\n\r\n#### 0.1 Text Normalization\r\n\r\n\u003cdiv align=center\u003e\u003cimg src=\"https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png\" alt=\"Cover\" width=\"50%\"/\u003e\u003c/div\u003e\r\n\r\n#### 0.2 Inverse Text Normalization\r\n\r\n\u003cdiv align=center\u003e\u003cimg src=\"https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png\" alt=\"Cover\" width=\"50%\"/\u003e\u003c/div\u003e\r\n\r\n### 1. How To Use\r\n\r\n#### 1.1 Quick Start:\r\n```bash\r\n# install\r\npip install WeTextProcessing\r\n```\r\n\r\nCommand-usage:\r\n\r\n```bash\r\nwetn --text \"2.5平方电线\"\r\nweitn --text \"二点五平方电线\"\r\n```\r\n\r\nPython usage:\r\n\r\n```py\r\nfrom itn.chinese.inverse_normalizer import InverseNormalizer\r\nfrom tn.chinese.normalizer import Normalizer as ZhNormalizer\r\nfrom tn.english.normalizer import Normalizer as EnNormalizer\r\n\r\n# NOTE(xcsong): 和默认参数不一致时，必须重新构图，要重新构图请务必指定 `overwrite_cache=True`\r\n#               When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.\r\n\r\nzh_tn_text = \"你好 WeTextProcessing 1.0，船新版本儿，船新体验儿，简直666，9和10\"\r\nzh_itn_text = \"你好 WeTextProcessing 一点零，船新版本儿，船新体验儿，简直六六六，九和六\"\r\nen_tn_text = \"Hello WeTextProcessing 1.0, life is short, just use wetext, 666, 9 and 10\"\r\nzh_tn_model = ZhNormalizer(remove_erhua=True, overwrite_cache=True)\r\nzh_itn_model = InverseNormalizer(enable_0_to_9=False, overwrite_cache=True)\r\nen_tn_model = EnNormalizer(overwrite_cache=True)\r\nprint(\"中文 TN (去除儿化音，重新在线构图):\\n\\t{} =\u003e {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\r\nprint(\"中文ITN (小于10的单独数字不转换，重新在线构图):\\n\\t{} =\u003e {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\r\nprint(\"英文 TN (暂时还没有可控的选项，后面会加...):\\n\\t{} =\u003e {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\r\n\r\nzh_tn_model = ZhNormalizer(overwrite_cache=False)\r\nzh_itn_model = InverseNormalizer(overwrite_cache=False)\r\nen_tn_model = EnNormalizer(overwrite_cache=False)\r\nprint(\"中文 TN (复用之前编译好的图):\\n\\t{} =\u003e {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\r\nprint(\"中文ITN (复用之前编译好的图):\\n\\t{} =\u003e {}\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\r\nprint(\"英文 TN (复用之前编译好的图):\\n\\t{} =\u003e {}\\n\".format(en_tn_text, en_tn_model.normalize(en_tn_text)))\r\n\r\nzh_tn_model = ZhNormalizer(remove_erhua=False, overwrite_cache=True)\r\nzh_itn_model = InverseNormalizer(enable_0_to_9=True, overwrite_cache=True)\r\nprint(\"中文 TN (不去除儿化音，重新在线构图):\\n\\t{} =\u003e {}\".format(zh_tn_text, zh_tn_model.normalize(zh_tn_text)))\r\nprint(\"中文ITN (小于10的单独数字也进行转换，重新在线构图):\\n\\t{} =\u003e {}\\n\".format(zh_itn_text, zh_itn_model.normalize(zh_itn_text)))\r\n```\r\n\r\n#### 1.2 Advanced Usage:\r\n\r\nDIY your own rules \u0026\u0026 Deploy WeTextProcessing with cpp runtime !!\r\n\r\nFor users who want modifications and adapt tn/itn rules to fix badcase, please try:\r\n\r\n``` bash\r\ngit clone https://github.com/wenet-e2e/WeTextProcessing.git\r\ncd WeTextProcessing\r\npip install -r requirements.txt\r\npre-commit install # for clean and tidy code\r\n# `overwrite_cache` will rebuild all rules according to\r\n#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).\r\n#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.\r\npython -m tn --text \"2.5平方电线\" --overwrite_cache\r\npython -m itn --text \"二点五平方电线\" --overwrite_cache\r\n```\r\n\r\nOnce you successfully rebuild your rules, you can deploy them either with your installed pypi packages:\r\n\r\n```py\r\n# tn usage\r\n\u003e\u003e\u003e from tn.chinese.normalizer import Normalizer\r\n\u003e\u003e\u003e normalizer = Normalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\")\r\n\u003e\u003e\u003e normalizer.normalize(\"2.5平方电线\")\r\n# itn usage\r\n\u003e\u003e\u003e from itn.chinese.inverse_normalizer import InverseNormalizer\r\n\u003e\u003e\u003e invnormalizer = InverseNormalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\")\r\n\u003e\u003e\u003e invnormalizer.normalize(\"二点五平方电线\")\r\n```\r\n\r\nOr with cpp runtime:\r\n\r\n```bash\r\ncmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release\r\ncmake --build build\r\n# tn usage\r\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn\r\n./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text \"2.5平方电线\"\r\n# itn usage\r\ncache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn\r\n./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text \"二点五平方电线\"\r\n```\r\n\r\n### 2. TN Pipeline\r\n\r\nPlease refer to [TN.README](tn/README.md)\r\n\r\n### 3. ITN Pipeline\r\n\r\nPlease refer to [ITN.README](itn/README.md)\r\n\r\n## Discussion \u0026 Communication\r\n\r\nFor Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.\r\nWe created a WeChat group for better discussion and quicker response.\r\nPlease scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.\r\n\r\n| \u003cimg src=\"https://github.com/robin1001/qr/blob/master/wenet.jpeg\" width=\"250px\"\u003e | \u003cimg src=\"https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg\" width=\"250px\"\u003e |\r\n| ---- | ---- |\r\n\r\nOr you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).\r\n\r\n## Acknowledge\r\n\r\n1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) \u0026 [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).\r\n3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team \u0026 NeMo open-source community.\r\n2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.\r\n3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.\r\n4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.\r\n5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenet-e2e%2FWeTextProcessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwenet-e2e%2FWeTextProcessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwenet-e2e%2FWeTextProcessing/lists"}