https://github.com/pucktada/cutkum
Thai Word-Segmentation with LSTM in Tensorflow
https://github.com/pucktada/cutkum
Last synced: 10 months ago
JSON representation
Thai Word-Segmentation with LSTM in Tensorflow
- Host: GitHub
- URL: https://github.com/pucktada/cutkum
- Owner: pucktada
- License: mit
- Created: 2017-05-04T07:00:25.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2023-12-14T18:30:46.000Z (about 2 years ago)
- Last Synced: 2025-04-23T05:42:27.338Z (11 months ago)
- Language: Python
- Homepage:
- Size: 70 MB
- Stars: 154
- Watchers: 17
- Forks: 33
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- nlp_thai_resources - GitHub
- awesome-nlp - CutKum - 在 TensorFlow 中使用深度學習進行分詞。 (自然語言處理-泰語 / 函式庫)
- awesome-nlp - CutKum - 在 TensorFlow 中使用深度學習進行分詞。 (自然語言處理-泰語 / 函式庫)
README
# Cutkum ['คัดคำ']
Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library.
Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at
98.0% recall, 96.3% precision, 97.1% F-measure (character-level)
RC: 0.988, PC: 0.966, FC: 0.977
95% recall, 95% precision and 95.0% F-measure (word-level -- same evaluation method as BEST2010)
# Update :D
A major update
1. now you dont have to load the model seperately, just do `pip install` and Cutkum is ready to use out of the box.
2. the included model is now smaller, faster, and have higher accuracy. :)
# Requirements
* python = 2.7, 3.0+
* tensorflow = 1.4+
# Installation
`cutkum` can be installed using `pip`
```
pip install cutkum
```
# Usages
Once installed, you can use `cutkum` within your python code to tokenize thai sentences.
```
>>> from cutkum.tokenizer import Cutkum
>>> ck = Cutkum()
>>> words = ck.tokenize("สารานุกรมไทยสำหรับเยาวชนฯ")
# python 3.0
>>> words
['สารานุกรม', 'ไทย', 'สำหรับ', 'เยาวชน', 'ฯ']
# python 2.7
>>> print("|".join(words))
# สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ
```
You can also use `cutkum` straight from the command line.
```
usage: cutkum [-h] [-v]
(-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)
[-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]
```
```
cutkum -s "ล่าสุดกระทรวงพาณิชย์ได้ประกาศตัวเลขการส่งออกของไทย"
# output as
ล่าสุด|กระทรวงพาณิชย์|ได้|ประกาศ|ตัว|เลข|การ|ส่ง|ออก|ของ|ไทย
```
`cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).
```
cutkum -i input.txt -o output.txt
cutkum -id input_dir -od output_dir
```
## Citation
```
Pucktada Treeratpituk (2017). Cutkum: Thai Word-Segmentation with LSTM in Tensorflow. May 5, 2017. See https://github.com/pucktada/cutkum
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
## To Do
* Improve performance, with better better model, and better included trained-model
* Improve the speed when processing big file