Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pucktada/cutkum

Thai Word-Segmentation with LSTM in Tensorflow
https://github.com/pucktada/cutkum

Last synced: 3 months ago
JSON representation

Thai Word-Segmentation with LSTM in Tensorflow

Host: GitHub
URL: https://github.com/pucktada/cutkum
Owner: pucktada
License: mit
Created: 2017-05-04T07:00:25.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-12-14T18:30:46.000Z (11 months ago)
Last Synced: 2024-07-10T18:05:47.866Z (4 months ago)
Language: Python
Homepage:
Size: 70 MB
Stars: 154
Watchers: 17
Forks: 34
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

nlp_thai_resources - GitHub

README

        # Cutkum ['คัดคำ']

Cutkum ('คัดคำ') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library. 

Cutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at 

98.0% recall, 96.3% precision, 97.1% F-measure (character-level)

RC: 0.988, PC: 0.966, FC: 0.977

95% recall, 95% precision and 95.0% F-measure (word-level -- same evaluation method as BEST2010)

# Update :D

A major update

1. now you dont have to load the model seperately, just do `pip install` and Cutkum is ready to use out of the box.

2. the included model is now smaller, faster, and have higher accuracy. :)

# Requirements

* python = 2.7, 3.0+

* tensorflow = 1.4+

# Installation

`cutkum` can be installed using `pip` 

```

pip install cutkum

```

# Usages

Once installed, you can use `cutkum` within your python code to tokenize thai sentences. 

```

>>> from cutkum.tokenizer import Cutkum

>>> ck = Cutkum()

>>> words = ck.tokenize("สารานุกรมไทยสำหรับเยาวชนฯ")

# python 3.0

>>> words

['สารานุกรม', 'ไทย', 'สำหรับ', 'เยาวชน', 'ฯ']

# python 2.7

>>> print("|".join(words)) 

# สารานุกรม|ไทย|สำหรับ|เยาวชน|ฯ

```

You can also use `cutkum` straight from the command line.

```

usage: cutkum [-h] [-v]

              (-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)

              [-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]

```

```

cutkum -s "ล่าสุดกระทรวงพาณิชย์ได้ประกาศตัวเลขการส่งออกของไทย"

# output as

ล่าสุด|กระทรวงพาณิชย์|ได้|ประกาศ|ตัว|เลข|การ|ส่ง|ออก|ของ|ไทย

```

`cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).

```

cutkum -i input.txt -o output.txt

cutkum -id input_dir -od output_dir

```

## Citation

```

Pucktada Treeratpituk (2017). Cutkum: Thai Word-Segmentation with LSTM in Tensorflow. May 5, 2017. See https://github.com/pucktada/cutkum

```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## To Do

* Improve performance, with better better model, and better included trained-model

* Improve the speed when processing big file