https://github.com/seanghay/awesome-khmer-language

A large collection of Khmer language resources. Khmer is a language used by Cambodia.
https://github.com/seanghay/awesome-khmer-language
List: awesome-khmer-language
ai cambodia cambodian g2p khmer khmer-dataset khmer-language khmer-nlp khmer-research khmer-resource machine-learning nlp research segmentation seq2seq transformer
Last synced: 6 months ago
JSON representation
A large collection of Khmer language resources. Khmer is a language used by Cambodia.
Host: GitHub
URL: https://github.com/seanghay/awesome-khmer-language
Owner: seanghay
Created: 2023-07-25T06:32:23.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-21T03:40:00.000Z (about 1 year ago)
Last Synced: 2024-05-22T22:00:51.580Z (about 1 year ago)
Topics: ai, cambodia, cambodian, g2p, khmer, khmer-dataset, khmer-language, khmer-nlp, khmer-research, khmer-resource, machine-learning, nlp, research, segmentation, seq2seq, transformer
Language: Python
Homepage:
Size: 5.3 MB
Stars: 59
Watchers: 8
Forks: 14
Open Issues: 0
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project

ultimate-awesome - awesome-khmer-language - A large collection of Khmer language resources. Khmer is a language used by Cambodia. (Other Lists / Julia Lists)
README

        ## Awesome Khmer Language

A large collection of Khmer language resources. Khmer is a language used by Cambodia. 

**Pull Requests are very welcomed!** 

### 1. Specification

- [Khmer Characters - The Unicode Standard 15.0](https://www.unicode.org/charts/PDF/U1780.pdf)

- [Khmer Encoding Structure - Unicode](https://www.unicode.org/L2/L2021/21241-khmer-structure.pdf)

- [sillsdev/khmer-character-specification](https://github.com/sillsdev/khmer-character-specification)

- [Khmer Layout Requirements](https://www.w3.org/International/sealreq/khmer)

- [wiki/Khmer_language](https://en.wikipedia.org/wiki/Khmer_language)

- [wiki/Khmer_script](https://en.wikipedia.org/wiki/Khmer_script)

- [wiki/Romanization_of_Khmer](https://en.wikipedia.org/wiki/Romanization_of_Khmer)

- [http://www.eki.ee/wgrs/rom1_km.pdf](http://www.eki.ee/wgrs/rom1_km.pdf)

### 2. Toolkit

- [sillsdev/khmer-normalizer](https://github.com/sillsdev/khmer-normalizer) Normalize Khmer strings according to https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf

- [automatic-phonemic-and-phonetic-transcription](https://gitlab.com/mkrlab/automatic-phonemic-and-phonetic-transcription)

- [Khmer Word Segmentation - Rina Buoy](https://github.com/rinabuoy/KhmerNLP)

- [Khmer natural language processing toolkit](https://github.com/VietHoang1512/khmer-nltk)

- [Khmer Limon to Unicode](https://github.com/danhhong/limon_unicode_converter)

- [seanghay/split-khmer](https://github.com/seanghay/split-khmer) Split Khmer sentence into an array of words.

- [seanghay/khmertokenizer](https://github.com/seanghay/khmertokenizer)

- [seanghay/khmerword](https://github.com/seanghay/khmerword)

- [seanghay/khmernumber](https://github.com/seanghay/khmernumber)

- [seanghay/khmernormalizer](https://github.com/seanghay/khmernormalizer)

- [khmer-ocr-benchmark-dataset](https://github.com/EKYCSolutions/khmer-ocr-benchmark-dataset) A standardized benchmark dataset for Khmer Optical Character Recognition (OCR) engine.

- [Khmer utility functions](https://github.com/seanghay/is-khmer)

- [Trey314159/KhmerSyllableReordering](https://github.com/Trey314159/KhmerSyllableReordering)

- [khmer-dictionary-tools](https://code.google.com/archive/p/khmer-dictionary-tools/)

- [nota/split-graphemes](https://github.com/nota/split-graphemes)

- [NextSpell](https://nextspell.com/) - ពិនិត្យអក្ខរាវិរុទ្ធ, ខ្មែរ OCR, កាត់ពាក្យ

- [khmercut](https://github.com/seanghay/khmercut) A (fast) Khmer word segmentation toolkit.

- [Socret360/akara-python](https://github.com/Socret360/akara-python) AKARA: Open-Source Khmer Spell Checker

- [khmer-latin-name-transformer](https://github.com/seanghay/khmer-latin-name-transformer)

- [native-khmer-g2p](https://github.com/seanghay/native-khmer-g2p)

- [khmerphonemizer](https://github.com/seanghay/khmerphonemizer)

- [kfa](https://github.com/seanghay/kfa) A fast Khmer Forced Aligner powered by Wav2Vec2CTC and Phonetisaurus

- [sosap(សូរសព្ទ)](https://github.com/seanghay/sosap) Python binding for Phonetisaurus

- [khmer-unicode-converter](https://github.com/seanghay/khmer-unicode-converter) Khmer Unicode Converter

- [khmerpunctuate](https://github.com/seanghay/khmerpunctuate) Punctuation Restoration for Khmer language

- [khmerocr_tools](https://github.com/MetythornPenn/khmerocr_tools) Khmer OCR Synthetic Data Generator

- [Socret360/jaws](https://github.com/Socret360/jaws) Just Another Word Segmenter (JAWS): A Graph Neural Network Model for Khmer Word Segmentation

- [seanghay/khmersegment](https://github.com/seanghay/khmersegment) A Khmer word segmentation tool built for NIPTICT (now CADT) Khmer Word Segmentation CRF model.

- [seanghay/khmer-acoustic-model-mfa](https://github.com/seanghay/khmer-acoustic-model-mfa) Train an Acoustic Model for Khmer language with Montreal Forced Aligner

- [seanghay/tha](https://github.com/seanghay/tha) Tha (ថា) - A Khmer Text Normalization and Verbalization Toolkit

- [seanghay/khmerpronounce](https://github.com/seanghay/khmerpronounce) Khmer Pronounciation Toolkit

- [seanghay/khmer2number](https://github.com/seanghay/khmer2number) A Khmer word to number converter.

### 3. Datasets

- [khPOS (Khmer Part-of-Speech) Corpus for Khmer NLP Research and Developments](https://github.com/ye-kyaw-thu/khPOS/) 

- [ParaCrawl Corpus](https://paracrawl.eu/)

- [Asian Language Treebank (ALT) Project](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/)

- [phylypo/segmentation-crf-khmer](https://github.com/phylypo/segmentation-crf-khmer)

- [google/language-resources](https://github.com/google/language-resources/tree/master/km) Lexicon, Text normalization and Verbalizer

- [Illustrations and recordings for language learning](https://www.aakanee.com/) Audio recodings and illustration

- [seanghay/khmer-dictionary-44k](https://huggingface.co/datasets/seanghay/khmer-dictionary-44k)

- [seanghay/km-speech-corpus](https://huggingface.co/datasets/seanghay/km-speech-corpus)

- [seanghay/bookmebus-reviews](https://huggingface.co/datasets/seanghay/bookmebus-reviews)

- [seanghay/khmer_mpwt_speech](https://huggingface.co/datasets/seanghay/khmer_mpwt_speech)

- [seanghay/khmer_kheng_info_speech](https://huggingface.co/datasets/seanghay/khmer_kheng_info_speech)

- [seanghay/khmer_grkpp_speech](https://huggingface.co/datasets/seanghay/khmer_grkpp_speech)

- [High quality TTS data for Khmer](https://openslr.org/42/)

- [Google FLEURS](https://huggingface.co/datasets/google/fleurs) Audio Dataset

- [mc4](https://huggingface.co/datasets/mc4) A multilingual colossal, cleaned version of Common Crawl's web crawl corpus

- [Khmer LineBreaking Dictionary](https://github.com/sbbic/khmerlbdict)

- [Khmer tesseract-ocr](https://github.com/tesseract-ocr/tessdata/blob/main/khm.traineddata)

- [Khmerlang Mobile Keyboard data](https://khmerlang.com/posts/4)

- [Khmer Bible Recordings](http://littlex.net/khbible/)

- [SleukRith Set](https://github.com/donavaly/SleukRith-Set)

- [Khmer annotation](https://www.kaggle.com/datasets/keatchakravuth/khmer-annotation) Annotated Khmer Dataset for Word spotting

### 4. Research Papers

- [An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention](https://arxiv.org/abs/2106.10875)

- [Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search](https://arxiv.org/abs/2112.08918)

- [Khmer Text Classification Using Word Embedding and Neural Networks](https://arxiv.org/abs/2112.06748)

- [Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning](https://arxiv.org/abs/2103.16801)

- [Building WFST based Grapheme to Phoneme Conversion for Khmer](https://ksoky.github.io/static/pdf/wfst_g2p.pdf)

- [Query Expansion for Khmer Information Retrieval](https://aclanthology.org/W10-3211.pdf)

- [Building a Syllable Database to Solve the Problem of Khmer Word Segmentation](https://arxiv.org/pdf/1703.02166.pdf)

- [Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document](http://www.apsipa.org/proceedings_2014/data/paper/1406.pdf)

- [Khmer printed character recognition using attention-based Seq2Seq network](https://journalofscience.ou.edu.vn/index.php/tech-en/article/view/2217)

- [Khmer Word Segmentation Using Conditional Random Fields](https://att-astrec.nict.go.jp/member/ding/KhNLP2015-SEG.pdf)

- [A Large-scale Study of Statistical Machine Translation Methods for Khmer Language](https://aclanthology.org/Y15-1030.pdf)

- [A Rule-based Approach for Khmer Word Extraction](https://www.ieice.org/publications/conference-FIT-DVDs/FIT2010/pdf/E/E_007.pdf)

- [Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences](./assets/GITS-GITI_2012-2013_Van.pdf)

- [The Standard Khmer vowel system: An acoustic study](http://www.rupp.edu.kh/CJBAR/files/Vol-2-Issue-2/5-CHEM-Vol-2-Issue-2.pdf)

- [Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion](https://dl.acm.org/doi/fullHtml/10.1145/3464378)

- [Towards deep learning on speech recognition for Khmer language](https://mospace.umsystem.edu/xmlui/handle/10355/56110)

- [A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory](https://journalofscience.ou.edu.vn/index.php/tech-en/article/download/2219/1680)

- [Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence](https://s3.ap-northeast-2.amazonaws.com/journal-home/journal/jips/fullText/787/jips_18_4_9.pdf)

- [Detection and Correction of Homophonous Error Word for Khmer Language](https://www.researchgate.net/profile/Sok-Chea/publication/228963957_Detection_and_Correction_of_Homophonous_Error_Word_for_Khmer_Language/links/5572617108aeacff1ffacd75/Detection-and-Correction-of-Homophonous-Error-Word-for-Khmer-Language.pdf)

- [No Language Left Behind (NLLB)](https://ai.meta.com/research/no-language-left-behind/)

- [Phonological Principles And Automatic Phonemic And Phonetic Transcription Of Khmer Words](https://drive.google.com/file/d/1c_FXNy90pv06StsBMQz4Rzk87ulMqXyM/view)

- [Multi-lingual Transformer Training for Khmer Automatic Speech Recognition](http://www.sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/SOK-APSIPA19.pdf)

- [TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies](https://repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433/276897/1/s2717554522500072.pdf)

- [Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language](http://sap.ist.i.kyoto-u.ac.jp/EN/bib/intl/SOK-ICASSP23.pdf)

- [Khmer pronouncing dictionary: standard Khmer and Phnom Penh dialect](https://unesdoc.unesco.org/ark:/48223/pf0000246360)

- [ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition](https://www.mdpi.com/2313-433X/9/12/276)

- [Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition](https://www.mdpi.com/2313-433X/9/11/248)

- [Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition](https://ieeexplore.ieee.org/document/10316307)

  

### 5. Projects/Models

- [facebookresearch/fairseq/mms](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) Text to Speech and Speech to Text

- [Khmer Language Model using ULMFiT](https://ml.tovnah.com/khmer-ulmfit/)

- [KHMER WORD SEARCH BASE ON SEMANTIC RELATION](https://nlp.techostartup.center/)

- [Khmer Audio Dictionary](https://kheng.info/)

- [Khmer to IPA Converter](https://khmerlang.com/tools/khmer-ipa)

- [Khmer Phonemizer](https://huggingface.co/spaces/seanghay/khmer-g2p-ipa)

- [Khmer Text-to-Speech MMS](https://huggingface.co/spaces/seanghay/khmer-tts)

- [Khmer Part of Speech Tagging with XLM RoBERTa](https://huggingface.co/seanghay/khmer-pos-roberta)

- [Whisper Small Khmer Fine-tuned](https://huggingface.co/seanghay/whisper-small-khmer-v2)

- [Joint Word Segmentation and POS Tagging in Keras](https://github.com/Socret360/joint-khmer-word-segmentation-and-pos-tagging)

- [Socret360/akara-android](https://github.com/Socret360/akara-android)

- [vitouphy/wav2vec2-xls-r-300m-khmer](https://huggingface.co/vitouphy/wav2vec2-xls-r-300m-khmer)

- [vitouphy/wav2vec2-xls-r-1b-khmer](https://huggingface.co/vitouphy/wav2vec2-xls-r-1b-khmer)

- [Khmer Text Classification](https://huggingface.co/seanghay/khmer-text-classification-roberta)

- [khmerlang/khmer-text-summarizer](https://github.com/khmerlang/khmer-text-summarizer)

- [khmerlang/KhmerWordPrediction](https://github.com/khmerlang/KhmerWordPrediction)

- [khmerlang/elasticsearch-analysis-khmerlang](https://github.com/khmerlang/elasticsearch-analysis-khmerlang)

- [Khmer Fingerspelling](https://github.com/cadt-g6/khmer_fingerspelling)

- [isi-nlp/uroman](https://github.com/isi-nlp/uroman) Universal Romanizer

- [pisethx/khmer-word-segmentation](https://github.com/pisethx/khmer-word-segmentation)

- [khmer-forced-aligner](https://github.com/seanghay/khmer-forced-aligner)

- [Fast Khmer Dictionary](https://khmerdict.com)

- [SEANLP: Southeast Asia Natural Language Processing](https://github.com/zhaoshiyu/SEANLP)

- [Khmerlang-Keyboard](https://github.com/khmerlang/Khmerlang-Keyboard)

- [ericvida/khtransliterator](https://github.com/ericvida/khtransliterator)

- [Khmer Unicode Converter](https://github.com/chamnap/khmer_unicode_converter)

- [chantysothy/KhmerUnicodeConverter](https://github.com/chantysothy/KhmerUnicodeConverter)

- [Pretrained-BERT-model-for-Khmer-language](https://github.com/rifatul-rifat/Pretrained-BERT-model-for-Khmer-language)

- [Khmer Language Model for Handwritten Text Recognition on Historical Documents](https://github.com/SeanghortBorn/Khmer-Language-Model-v1.0)

- [Khmer Single Word TTS](https://huggingface.co/spaces/seanghay/KLEA)

- [SeaLLMs](https://huggingface.co/SeaLLMs) Large Language Models for Southeast Asia

- [XLM-RoBERTa-Khmer](https://huggingface.co/seanghay/xlm-roberta-khmer-small) Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. While being smaller than XLM-RoBERTa-Base

### 6. Blog / Slides

- [Issues in Khmer syllable validation](https://lindenbergsoftware.com/en/notes/issues-in-khmer-syllable-validation/)

- [Khmer Machine Learning (ML) Experiment](https://ml.tovnah.com/)

- [Using AI to Generate Khmer Baby Names ](https://medium.com/@phylypo/using-ai-to-generate-khmer-baby-names-b9b0af79ee83)

- [How domnung.com Ranks Khmer News](https://medium.com/@phylypo/how-domnung-com-ranks-khmer-news-92bd68989f7a)

- [Text Classification with scikit-learn on Khmer Documents](https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195)

- [Multi-Class Text Classification on Khmer News Articles](https://medium.com/@phylypo/multi-class-text-classification-on-khmer-news-articles-d0937281a524)

- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)

  - [NLP: Text Segmentation Using Dictionary Based Algorithms](https://medium.com/@phylypo/nlp-text-segmentation-using-dictionary-based-algorithms-6d0a45a76c08)

  - [NLP: Text Segmentation with Ngram](https://medium.com/@phylypo/nlp-text-segmentation-with-ngram-b5506dbb514c)

  - [NLP: Text Segmentation Using Naive Bayes](https://medium.com/@phylypo/nlp-text-segmentation-using-naive-bayes-bccdd08ccf6f)

  - [NLP: Text Segmentation Using Hidden Markov Model](https://medium.com/@phylypo/nlp-text-segmentation-using-hidden-markov-model-f238743d87eb)

  - [NLP: Text Segmentation Using Maximum Entropy Markov Model](https://medium.com/@phylypo/nlp-text-segmentation-using-maximum-entropy-markov-model-c6160b13b248)

  - [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)

- [Khmer Language Model Using ULMFiT (Feb 2020)](https://medium.com/@phylypo/khmer-language-model-using-ulmfit-b0f8ca4e15be)

- [Creating a Khmer Language Model using BERT](https://medium.com/@phylypo/creating-a-khmer-language-model-using-bert-9a12d3f12b03)

- [Building a Khmer Spelling Checker](https://towardsdatascience.com/building-a-khmer-spelling-checker-7e3356677335)

- [khmerlang.com](https://khmerlang.com/)

- [Khmer word spell correction using BK-Tree data structure and Levenshtein distance](https://engleangs.medium.com/khmer-word-spell-correction-using-bk-tree-data-structure-and-levenshtein-distance-dd4d98e3766a)

- [Introduction to kNN algorithm by experiment on Khmer Handwriting classification using Java 8](https://towardsdatascience.com/introduction-to-knn-machine-learning-algorithm-by-experiment-on-khmer-handwriting-classification-66a64652a02c)

- [Speech Synthesis and Low Resource Languages](./assets/SLTU_TTS_Tutorial.pdf)

- [ការបញ្ចូលអក្សរខ្មែរក្នុងយូនីកូដ ឯកសារឆ្នាំ 1996](https://khmertypography.com/khmer-unicode-letter-1996)

### 7. Misc

- [harfbuzz](https://github.com/harfbuzz/harfbuzz) A text shaping engine that supports Khmer language.

- [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) A better BERT with multiligual support.

- [mt5-base](https://huggingface.co/google/mt5-base) Google T5 multiligual support.

- [byt5-base](https://huggingface.co/google/byt5-base) Google T5 without tokenizer.

- [sentencepiece](https://github.com/google/sentencepiece) A tool to create a tokenizer

- [huggingface/transformers](https://github.com/huggingface/transformers)

- [tiktoken](https://github.com/openai/tiktoken)

- [montreal-forced-aligner](https://montreal-forced-aligner.readthedocs.io/) Accoustic Model & Alignment

- [pair_ngram](https://github.com/google-research/google-research/tree/master/pair_ngram) Building Grapheme to Phoneme

- [fastText](https://fasttext.cc/)

- [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus) Building Grapheme to Phoneme

- [Compact Language Detector v3](https://github.com/google/cld3) Language Detection tool

---

### 8. People

1. [Danh Hong](https://github.com/danhhong) OCR, Typography, Spellchecker, Standard/Specification

2. [Sovichet Tep](https://github.com/sovichet) Typography, Standard/Specification

3. [Dr. Rina Buoy](https://scholar.google.com/citations?user=Zw4RKwcAAAAJ) NLP, OCR, Document OCR

4. [Dr. Kak Soky](https://scholar.google.com/citations?user=221cQOgAAAAJ) TTS, ASR, Machine Translation

5. [Rathanak Sreang](https://github.com/RathanakSreang) NLP, SpellChecker

6. [Socret Lee](https://github.com/Socret360) OCR, SpellChecker, NLP, Other deep learning tasks.

7. [Vitou Phy](https://demystifyml.co/) Khmer OCR, ASR, SpellChecker, NLP, Other deep learning tasks.

8. [Marc Durdin](https://github.com/mcdurdin) Khmer Specification, Keyboard, Encoding

9. [Makara Sok](https://github.com/MakaraSok) Specification, Keyboard, Encoding, Phonetics

10. [Seanghay Yath](https://github.com/seanghay) TTS, ASR, NLP

11. You - Please send a Pull Request :)

---

> Khmer is not a low-resource language.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/seanghay/awesome-khmer-language

Awesome Lists containing this project

README