https://github.com/shibing624/zh-normalization
Chinese(zh) sentence NSW(Non-Standard-Word) Normalization
https://github.com/shibing624/zh-normalization
Last synced: 5 months ago
JSON representation
Chinese(zh) sentence NSW(Non-Standard-Word) Normalization
- Host: GitHub
- URL: https://github.com/shibing624/zh-normalization
- Owner: shibing624
- License: apache-2.0
- Created: 2024-02-05T11:14:23.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-10T03:49:56.000Z (6 months ago)
- Last Synced: 2025-05-07T23:45:40.342Z (5 months ago)
- Language: Python
- Homepage:
- Size: 53.7 KB
- Stars: 9
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# zh-normalization
Chinese sentence NSW(Non-Standard-Word) Normalization## Supported NSW (Non-Standard-Word) Normalization
|NSW type|raw|normalized|
|:--|:-|:-|
|serial number|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
|cardinal|这块黄金重达324.75克
我们班的最高总分为583分|这块黄金重达三百二十四点七五克
我们班的最高总分为五百八十三分|
|numeric range |12\~23
-1.5\~2|十二到二十三
负一点五到二|
|date|她出生于86年8月18日,她弟弟出生于1995年3月1日|她出生于八六年八月十八日, 她弟弟出生于一九九五年三月一日|
|time|等会请在12:05请通知我|等会请在十二点零五分请通知我
|temperature|今天的最低气温达到-10°C|今天的最低气温达到零下十度
|fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
|percentage|明天有62%的概率降雨|明天有百分之六十二的概率降雨|
|money|随便来几个价格12块5,34.5元,20.1万|随便来几个价格十二块五,三十四点五元,二十点一万|
|telephone|这是固话0421-33441122
这是手机+86 18544139121|这是固话零四二一三三四四一一二二
这是手机八六一八五四四一三九一二一|## Usage
```shell
pip install zh-normalization
```Run the following code to normalize the Chinese sentence:
```python
from zh_normalization import TextNormalizerm = TextNormalizer()
text = "电影中梁朝伟扮演的陈永仁的编号27149!"
sents = m.normalize(text)
new_text = ''.join(sents)
print(new_text)
```Output:
```shell
电影中梁朝伟扮演的陈永仁的编号二七幺四九!
```
## References
[Pull requests #658 of DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech/pull/658/files)