https://github.com/nguyendat-bit/vietokenizer
Vietnamese Tokenizer package based on deeplearning methods
https://github.com/nguyendat-bit/vietokenizer
nlp tensorflow tensorflow2 tokenizer vietnamese-nlp vietnamese-tokenizer word-segmentation
Last synced: 17 days ago
JSON representation
Vietnamese Tokenizer package based on deeplearning methods
- Host: GitHub
- URL: https://github.com/nguyendat-bit/vietokenizer
- Owner: Nguyendat-bit
- License: apache-2.0
- Created: 2022-10-06T16:57:40.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-10-13T17:27:14.000Z (over 3 years ago)
- Last Synced: 2026-06-15T02:33:08.388Z (18 days ago)
- Topics: nlp, tensorflow, tensorflow2, tokenizer, vietnamese-nlp, vietnamese-tokenizer, word-segmentation
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# VieTokenizer
This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].
## Installation 🎉
- This repository is tested on python 3.7+ and Tensorflow 2.8+
- VieTokenizer can be installed using pip as follows:
```
pip install vietokenizer 🍰
```
- VieTokenizer can also be installed from source with the following commands:
```
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .
```
## Usage 🔥
```python
>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'
```
## License
[Apache 2.0 License](https://github.com/Nguyendat-bit/VieTokenizer).
Copyright © 2022 [Nguyễn Tiến Đạt](https://github.com/Nguyendat-bit). All rights reserved.