Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/VietHoang1512/khmer-nltk
Khmer language processing toolkit
https://github.com/VietHoang1512/khmer-nltk
crf khmer-language nlp nlp-library part-of-speech-tagging segmentation sentence-segmenter word-segmenter
Last synced: 2 days ago
JSON representation
Khmer language processing toolkit
- Host: GitHub
- URL: https://github.com/VietHoang1512/khmer-nltk
- Owner: VietHoang1512
- License: apache-2.0
- Created: 2020-11-16T14:30:10.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-10-07T03:42:56.000Z (about 1 year ago)
- Last Synced: 2024-10-29T02:45:37.866Z (10 days ago)
- Topics: crf, khmer-language, nlp, nlp-library, part-of-speech-tagging, segmentation, sentence-segmenter, word-segmenter
- Language: Python
- Homepage: https://pypi.org/project/khmer-nltk/
- Size: 10 MB
- Stars: 69
- Watchers: 2
- Forks: 18
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-khmer-language - Khmer natural language processing toolkit
README
# π Khmer natural language processing toolkitπ
[![circleci](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main.svg?style=svg)](https://circleci.com/gh/VietHoang1512/khmer-nltk/tree/main)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/807f43366b314887946cd9e88df700c6)](https://www.codacy.com/gh/VietHoang1512/khmer-nltk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=VietHoang1512/khmer-nltk&utm_campaign=Badge_Grade)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![code style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![release](https://img.shields.io/pypi/v/khmer-nltk.svg)](https://pypi.org/project/khmer-nltk/)
![versions](https://img.shields.io/pypi/pyversions/khmer-nltk.svg)
[![fownloads](https://pepy.tech/badge/khmer-nltk)](https://pepy.tech/project/khmer-nltk)
[![DOI](https://zenodo.org/badge/313328421.svg)](https://zenodo.org/badge/latestdoi/313328421)## π―TODO
- [X] Sentence Segmentation
- [X] Word Segmentation
- [X] Part of speech Tagging
- [ ] Named Entity Recognition
- [ ] Text classification## πͺInstallation
```bash
pip install khmer-nltk
```## πΉ Quick tour
[[Blog]](https://towardsdatascience.com/khmer-natural-language-processing-in-python-c770afb84784)
To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme
### Sentence tokenization
```python
>>> from khmernltk import sentence_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα αααα αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(sentence_tokenize(raw_text))
['αα½αααααΆαααΈα’α¨!', 'α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα αααα αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ']
```### [Word tokenization](khmernltk/word_tokenize)
```python
>>> from khmernltk import word_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα αααα αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['αα½α', 'ααααΆα', 'ααΈ', 'α’α¨', '!', ' ', 'α’α£', ' ', 'αα»ααΆ', ' ', 'ααααΆαααΈ', 'ααααααααΆ', 'ααΆαα·', 'αααΆα', 'ααααα', 'αα·α', 'ααααα', ' ', 'ααΆα', 'αα ', 'αααα αα', 'αααααααΆα', ' ', 'ααΆα', 'αααααΊ', 'ααααα·ααΆα', ' ', 'αα·α', 'ααΆααα½ααα½α', 'ααΆααααΈ']
```### [POS Tagging](khmernltk/pos_tag)
### Usage
```python
>>> from khmernltk import pos_tag
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα αααα αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(pos_tag(raw_text))
[('αα½α', 'n'), ('ααααΆα', 'n'), ('ααΈ', 'n'), ('α’α¨', '1'), ('!', '.'), (' ', 'n'), ('α’α£', '1'), (' ', 'n'), ('αα»ααΆ', 'n'), (' ', 'n'), ('ααααΆαααΈ', 'n'), ('ααααααααΆ', 'n'), ('ααΆαα·', 'n'), ('αααΆα', 'o'), ('ααααα', 'n'), ('αα·α', 'o'), ('ααααα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αα ', 'v'), ('αααα αα', 'v'), ('αααααααΆα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αααααΊ', 'n'), ('ααααα·ααΆα', 'n'), (' ', 'n'), ('αα·α', 'o'), ('ααΆααα½ααα½α', 'n'), ('ααΆααααΈ', 'o')]
```### βοΈ Citation
```bibtex
@misc{hoang-khmer-nltk,
author = {Phan Viet Hoang},
title = {Khmer Natural Language Processing Tookit},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}
```
#### Used in:
- [stopes: A library for preparing data for machine translation research](https://github.com/facebookresearch/stopes)
- [LASER Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER)
- [Pretrained Models and Evaluation Data for the Khmer Language](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9645441)
- [Multilingual Open Text 1.0: Public Domain News in 44 Languages](https://arxiv.org/pdf/2201.05609.pdf)
- [ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System](https://arxiv.org/pdf/2205.14981.pdf)
- [Shared Task on Cross-lingual Open-Retrieval QA](https://www.aclweb.org/portal/content/shared-task-cross-lingual-open-retrieval-qa)
- [No Language Left Behind: Scaling Human-Centered Machine Translation](https://research.facebook.com/publications/no-language-left-behind/)
- [Wordless](https://github.com/BLKSerene/Wordless)
- [A Simple and Fast Strategy for Handling Rare Words in Neural Machine Translation](https://aclanthology.org/2022.aacl-srw.6/)### π¨βπ References
- [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)
- [Khmer Word Segmentation Using Conditional Random Fields](https://www2.nict.go.jp/astrec-att/member/ding/KhNLP2015-SEG.pdf)
- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)### π Advisor
- Prof. [Huong Le Thanh](https://users.soict.hust.edu.vn/huonglt/)