Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pucktada/cutkum
Thai Word-Segmentation with LSTM in Tensorflow
https://github.com/pucktada/cutkum
Last synced: 3 months ago
JSON representation
Thai Word-Segmentation with LSTM in Tensorflow
- Host: GitHub
- URL: https://github.com/pucktada/cutkum
- Owner: pucktada
- License: mit
- Created: 2017-05-04T07:00:25.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-12-14T18:30:46.000Z (11 months ago)
- Last Synced: 2024-07-10T18:05:47.866Z (4 months ago)
- Language: Python
- Homepage:
- Size: 70 MB
- Stars: 154
- Watchers: 17
- Forks: 34
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- nlp_thai_resources - GitHub
README
# Cutkum ['คัดคำ']
Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library.Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at
98.0% recall, 96.3% precision, 97.1% F-measure (character-level)
RC: 0.988, PC: 0.966, FC: 0.977
95% recall, 95% precision and 95.0% F-measure (word-level -- same evaluation method as BEST2010)# Update :D
A major update
1. now you dont have to load the model seperately, just do `pip install` and Cutkum is ready to use out of the box.
2. the included model is now smaller, faster, and have higher accuracy. :)# Requirements
* python = 2.7, 3.0+
* tensorflow = 1.4+# Installation
`cutkum` can be installed using `pip`
```
pip install cutkum```
# Usages
Once installed, you can use `cutkum` within your python code to tokenize thai sentences.
```
>>> from cutkum.tokenizer import Cutkum
>>> ck = Cutkum()
>>> words = ck.tokenize("สารานุกรมไทยสำหรับเยาวชนฯ")# python 3.0
>>> words
['สารานุกรม', 'ไทย', 'สำหรับ', 'เยาวชน', 'ฯ']# python 2.7
>>> print("|".join(words))
# สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ```
You can also use `cutkum` straight from the command line.
```
usage: cutkum [-h] [-v]
(-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)
[-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]
``````
cutkum -s "ล่าสุดกระทรวงพาณิชย์ได้ประกาศตัวเลขการส่งออกของไทย"# output as
ล่าสุด|กระทรวงพาณิชย์|ได้|ประกาศ|ตัว|เลข|การ|ส่ง|ออก|ของ|ไทย
````cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).
```
cutkum -i input.txt -o output.txt
cutkum -id input_dir -od output_dir
```## Citation
```
Pucktada Treeratpituk (2017). Cutkum: Thai Word-Segmentation with LSTM in Tensorflow. May 5, 2017. See https://github.com/pucktada/cutkum
```## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
## To Do
* Improve performance, with better better model, and better included trained-model
* Improve the speed when processing big file