Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/VietHoang1512/khmer-nltk

Khmer language processing toolkit
https://github.com/VietHoang1512/khmer-nltk

crf khmer-language nlp nlp-library part-of-speech-tagging segmentation sentence-segmenter word-segmenter

Last synced: 2 days ago
JSON representation

Khmer language processing toolkit

Awesome Lists containing this project

README

        

# πŸ…Khmer natural language processing toolkitπŸ…

[![circleci](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main.svg?style=svg)](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/807f43366b314887946cd9e88df700c6)](https://www.codacy.com/gh/VietHoang1512/khmer-nltk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=VietHoang1512/khmer-nltk&utm_campaign=Badge_Grade)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![code style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![release](https://img.shields.io/pypi/v/khmer-nltk.svg)](https://pypi.org/project/khmer-nltk/)
![versions](https://img.shields.io/pypi/pyversions/khmer-nltk.svg)
[![fownloads](https://pepy.tech/badge/khmer-nltk)](https://pepy.tech/project/khmer-nltk)
[![DOI](https://zenodo.org/badge/313328421.svg)](https://zenodo.org/badge/latestdoi/313328421)

## 🎯TODO

- [X] Sentence Segmentation
- [X] Word Segmentation
- [X] Part of speech Tagging
- [ ] Named Entity Recognition
- [ ] Text classification

## πŸ’ͺInstallation

```bash
pip install khmer-nltk
```

## 🏹 Quick tour

[[Blog]](https://towardsdatascience.com/khmer-natural-language-processing-in-python-c770afb84784)

To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme

### Sentence tokenization

```python
>>> from khmernltk import sentence_tokenize
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(sentence_tokenize(raw_text))
['αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨!', '្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ']
```

### [Word tokenization](khmernltk/word_tokenize)

```python
>>> from khmernltk import word_tokenize
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['αžαž½αž”', 'αž†αŸ’αž“αžΆαŸ†', 'αž‘αžΈ', '្៨', '!', ' ', '្៣', ' ', 'αžαž»αž›αžΆ', ' ', 'αžŸαŸ’αž˜αžΆαžšαžαžΈ', 'αž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆ', 'αž‡αžΆαžαž·', 'αžšαžœαžΆαž„', 'αžαŸ’αž˜αŸ‚αžš', 'αž“αž·αž„', 'αžαŸ’αž˜αŸ‚αžš', ' ', 'αžˆαžΆαž“', 'αž‘αŸ…', 'αž”αž‰αŸ’αž…αž”αŸ‹', 'αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜', ' ', 'αž“αžΆαŸ†', 'αž–αž“αŸ’αž›αžΊ', 'αžŸαž“αŸ’αžαž·αž—αžΆαž–', ' ', 'αž“αž·αž„', 'αž€αžΆαžšαžšαž½αž”αžšαž½αž˜', 'αž‡αžΆαžαŸ’αž˜αžΈ']
```

### [POS Tagging](khmernltk/pos_tag)

### Usage

```python
>>> from khmernltk import pos_tag
>>> raw_text = "αžαž½αž”αž†αŸ’αž“αžΆαŸ†αž‘αžΈαŸ’αŸ¨! ្៣ αžαž»αž›αžΆ αžŸαŸ’αž˜αžΆαžšαžαžΈαž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆαž‡αžΆαžαž·αžšαžœαžΆαž„αžαŸ’αž˜αŸ‚αžšαž“αž·αž„αžαŸ’αž˜αŸ‚αžš αžˆαžΆαž“αž‘αŸ…αž”αž‰αŸ’αž…αž”αŸ‹αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜ αž“αžΆαŸ†αž–αž“αŸ’αž›αžΊαžŸαž“αŸ’αžαž·αž—αžΆαž– αž“αž·αž„αž€αžΆαžšαžšαž½αž”αžšαž½αž˜αž‡αžΆαžαŸ’αž˜αžΈ"
>>> print(pos_tag(raw_text))
[('αžαž½αž”', 'n'), ('αž†αŸ’αž“αžΆαŸ†', 'n'), ('αž‘αžΈ', 'n'), ('្៨', '1'), ('!', '.'), (' ', 'n'), ('្៣', '1'), (' ', 'n'), ('αžαž»αž›αžΆ', 'n'), (' ', 'n'), ('αžŸαŸ’αž˜αžΆαžšαžαžΈ', 'n'), ('αž•αŸ’αžŸαŸ‡αž•αŸ’αžŸαžΆ', 'n'), ('αž‡αžΆαžαž·', 'n'), ('αžšαžœαžΆαž„', 'o'), ('αžαŸ’αž˜αŸ‚αžš', 'n'), ('αž“αž·αž„', 'o'), ('αžαŸ’αž˜αŸ‚αžš', 'n'), (' ', 'n'), ('αžˆαžΆαž“', 'v'), ('αž‘αŸ…', 'v'), ('αž”αž‰αŸ’αž…αž”αŸ‹', 'v'), ('αžŸαž„αŸ’αžšαŸ’αž‚αžΆαž˜', 'n'), (' ', 'n'), ('αž“αžΆαŸ†', 'v'), ('αž–αž“αŸ’αž›αžΊ', 'n'), ('αžŸαž“αŸ’αžαž·αž—αžΆαž–', 'n'), (' ', 'n'), ('αž“αž·αž„', 'o'), ('αž€αžΆαžšαžšαž½αž”αžšαž½αž˜', 'n'), ('αž‡αžΆαžαŸ’αž˜αžΈ', 'o')]
```

### ✍️ Citation

```bibtex
@misc{hoang-khmer-nltk,
author = {Phan Viet Hoang},
title = {Khmer Natural Language Processing Tookit},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}
```
#### Used in:
- [stopes: A library for preparing data for machine translation research](https://github.com/facebookresearch/stopes)
- [LASER Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER)
- [Pretrained Models and Evaluation Data for the Khmer Language](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9645441)
- [Multilingual Open Text 1.0: Public Domain News in 44 Languages](https://arxiv.org/pdf/2201.05609.pdf)
- [ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System](https://arxiv.org/pdf/2205.14981.pdf)
- [Shared Task on Cross-lingual Open-Retrieval QA](https://www.aclweb.org/portal/content/shared-task-cross-lingual-open-retrieval-qa)
- [No Language Left Behind: Scaling Human-Centered Machine Translation](https://research.facebook.com/publications/no-language-left-behind/)
- [Wordless](https://github.com/BLKSerene/Wordless)
- [A Simple and Fast Strategy for Handling Rare Words in Neural Machine Translation](https://aclanthology.org/2022.aacl-srw.6/)

### πŸ‘¨β€πŸŽ“ References

- [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)
- [Khmer Word Segmentation Using Conditional Random Fields](https://www2.nict.go.jp/astrec-att/member/ding/KhNLP2015-SEG.pdf)
- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)

### πŸ“œ Advisor

- Prof. [Huong Le Thanh](https://users.soict.hust.edu.vn/huonglt/)