Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seanghay/khmernormalizer
A missing toolkit for Khmer Natural Language Processing.
https://github.com/seanghay/khmernormalizer
khmer nlp normalization normalizer verbalization
Last synced: 30 days ago
JSON representation
A missing toolkit for Khmer Natural Language Processing.
- Host: GitHub
- URL: https://github.com/seanghay/khmernormalizer
- Owner: seanghay
- License: mit
- Created: 2023-07-19T04:20:45.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-07-26T10:13:00.000Z (over 1 year ago)
- Last Synced: 2024-04-24T08:16:32.042Z (8 months ago)
- Topics: khmer, nlp, normalization, normalizer, verbalization
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-khmer-language - seanghay/khmernormalizer
README
## Khmer Normalizer
A missing toolkit for **Khmer Natural Language Processing**.
- Character Reordering
- Duplicate Whitespaces
- Remove zero width space
- Remove emojis
- Fix Common misspellings
- Fix Unicode issues
- Fix Khmer trailing vowels
- URL Replacements
- Unicode Normalization (NFKC)
- Quotes symbols normalization
- Remove repeated punctuations### Installation
```shell
pip install khmernormalizer
```### Usage
```python
from khmernormalizer import normalizeinput_str = """
តាម៖៖សេចក្តីរាយការណ៍ឲ្យដឹងថា!!!!!
https://google.com/a?x=1
កាល 😂 ពីវេលាម៉ោង ៗ ប្រមាណ១១យប់ថ្ងៃទី៤ 😂😂😂😂😂 ??
កាាាាត់
មិិិិិន
មួយរយះះះះះះះ
រយះពេល
""".strip()normalize(input_str,
emoji_replacement="",
remove_zwsp=True,
url_replacement="")
```Result:
```
តាម៖សេចក្តីរាយការណ៍ឱ្យដឹងថា!កាល ពីវេលាម៉ោងៗ ប្រមាណ១១យប់ថ្ងៃទី៤?
កាត់
មិន
មួយរយៈ
រយៈពេល
```