https://github.com/nguyendat-bit/vietokenizer

Vietnamese Tokenizer package based on deeplearning methods
https://github.com/nguyendat-bit/vietokenizer

nlp tensorflow tensorflow2 tokenizer vietnamese-nlp vietnamese-tokenizer word-segmentation

Last synced: about 1 month ago
JSON representation

Vietnamese Tokenizer package based on deeplearning methods

Host: GitHub
URL: https://github.com/nguyendat-bit/vietokenizer
Owner: Nguyendat-bit
License: apache-2.0
Created: 2022-10-06T16:57:40.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2022-10-13T17:27:14.000Z (almost 4 years ago)
Last Synced: 2026-06-15T02:33:08.388Z (about 1 month ago)
Topics: nlp, tensorflow, tensorflow2, tokenizer, vietnamese-nlp, vietnamese-tokenizer, word-segmentation
Language: Python
Homepage:
Size: 13.7 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# VieTokenizer

This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].

## Installation 🎉
- This repository is tested on python 3.7+ and Tensorflow 2.8+
- VieTokenizer can be installed using pip as follows:
```
pip install vietokenizer 🍰
```
- VieTokenizer can also be installed from source with the following commands:
```
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .
```
## Usage 🔥
```python
>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nguyendat-bit/vietokenizer

Awesome Lists containing this project

README