Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seanghay/khmercut
A (fast) Khmer word segmentation toolkit.
https://github.com/seanghay/khmercut
cambodia crfsuite khmer
Last synced: 2 days ago
JSON representation
A (fast) Khmer word segmentation toolkit.
- Host: GitHub
- URL: https://github.com/seanghay/khmercut
- Owner: seanghay
- Created: 2023-08-03T09:37:44.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-08-24T10:36:36.000Z (about 1 year ago)
- Last Synced: 2024-10-30T18:20:07.110Z (9 days ago)
- Topics: cambodia, crfsuite, khmer
- Language: Python
- Homepage: https://pypi.org/project/khmercut/
- Size: 11.7 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-khmer-language - khmercut
README
### khmercut
A (fast) Khmer word segmentation toolkit.
- A single python file
- Using `pycrfsuite` only
- Include Khmer normalize
- CLI Supoprt
- Multiprocess support```shell
pip install khmercut
```### Python
```python
from khmercut import tokenizetokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']
```### CLI
e.g.
```shell
khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"
```Available options
```
usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]A fast Khmer word segmentation toolkit.
positional arguments:
files Path to text filesoptional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Output folder
-s SEPARATOR, --separator SEPARATOR
Specify token separator
-j JOBS, --jobs JOBS Number of processors
-q, --quiet Disable progress output
-n, --normalize Normalize input text before processing
```### Reference
- [Khmer language processing toolkit](https://github.com/VietHoang1512/khmer-nltk)