Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/seanghay/awesome-khmer-language

A large collection of Khmer language resources. Khmer is a language used by Cambodia.
https://github.com/seanghay/awesome-khmer-language

List: awesome-khmer-language

ai cambodia cambodian g2p khmer khmer-dataset khmer-language khmer-nlp khmer-research khmer-resource machine-learning nlp research segmentation seq2seq transformer

Last synced: 3 months ago
JSON representation

A large collection of Khmer language resources. Khmer is a language used by Cambodia.

Awesome Lists containing this project

README

        

## Awesome Khmer Language

A large collection of Khmer language resources. Khmer is a language used by Cambodia.

**Pull Requests are very welcomed!**

### 1. Specification

- [Khmer Characters - The Unicode Standard 15.0](https://www.unicode.org/charts/PDF/U1780.pdf)
- [Khmer Encoding Structure - Unicode](https://www.unicode.org/L2/L2021/21241-khmer-structure.pdf)
- [sillsdev/khmer-character-specification](https://github.com/sillsdev/khmer-character-specification)
- [Khmer Layout Requirements](https://www.w3.org/International/sealreq/khmer)
- [wiki/Khmer_language](https://en.wikipedia.org/wiki/Khmer_language)
- [wiki/Khmer_script](https://en.wikipedia.org/wiki/Khmer_script)
- [wiki/Romanization_of_Khmer](https://en.wikipedia.org/wiki/Romanization_of_Khmer)
- [http://www.eki.ee/wgrs/rom1_km.pdf](http://www.eki.ee/wgrs/rom1_km.pdf)

### 2. Toolkit

- [automatic-phonemic-and-phonetic-transcription](https://gitlab.com/mkrlab/automatic-phonemic-and-phonetic-transcription)
- [Khmer Word Segmentation - Rina Buoy](https://github.com/rinabuoy/KhmerNLP)
- [Khmer natural language processing toolkit](https://github.com/VietHoang1512/khmer-nltk)
- [Khmer Limon to Unicode](https://github.com/danhhong/limon_unicode_converter)
- [seanghay/split-khmer](https://github.com/seanghay/split-khmer) Split Khmer sentence into an array of words.
- [seanghay/khmertokenizer](https://github.com/seanghay/khmertokenizer)
- [seanghay/khmerword](https://github.com/seanghay/khmerword)
- [seanghay/khmernumber](https://github.com/seanghay/khmernumber)
- [seanghay/khmernormalizer](https://github.com/seanghay/khmernormalizer)
- [khmer-ocr-benchmark-dataset](https://github.com/EKYCSolutions/khmer-ocr-benchmark-dataset) A standardized benchmark dataset for Khmer Optical Character Recognition (OCR) engine.
- [Khmer utility functions](https://github.com/seanghay/is-khmer)
- [Trey314159/KhmerSyllableReordering](https://github.com/Trey314159/KhmerSyllableReordering)
- [khmer-dictionary-tools](https://code.google.com/archive/p/khmer-dictionary-tools/)
- [nota/split-graphemes](https://github.com/nota/split-graphemes)
- [NextSpell](https://nextspell.com/) - ពិនិត្យអក្ខរាវិរុទ្ធ, ខ្មែរ OCR, កាត់ពាក្យ
- [khmercut](https://github.com/seanghay/khmercut) A (fast) Khmer word segmentation toolkit.
- [Socret360/akara-python](https://github.com/Socret360/akara-python) AKARA: Open-Source Khmer Spell Checker
- [khmer-latin-name-transformer](https://github.com/seanghay/khmer-latin-name-transformer)
- [native-khmer-g2p](https://github.com/seanghay/native-khmer-g2p)
- [khmerphonemizer](https://github.com/seanghay/khmerphonemizer)
- [kfa](https://github.com/seanghay/kfa) A fast Khmer Forced Aligner powered by Wav2Vec2CTC and Phonetisaurus
- [sosap(សូរសព្ទ)](https://github.com/seanghay/sosap) Python binding for Phonetisaurus
- [khmer-unicode-converter](https://github.com/seanghay/khmer-unicode-converter) Khmer Unicode Converter
- [khmerpunctuate](https://github.com/seanghay/khmerpunctuate) Punctuation Restoration for Khmer language
- [khmerocr_tools](https://github.com/MetythornPenn/khmerocr_tools) Khmer OCR Synthetic Data Generator
- [Socret360/jaws](https://github.com/Socret360/jaws) Just Another Word Segmenter (JAWS): A Graph Neural Network Model for Khmer Word Segmentation
- [seanghay/khmersegment](https://github.com/seanghay/khmersegment) A Khmer word segmentation tool built for NIPTICT (now CADT) Khmer Word Segmentation CRF model.
- [seanghay/khmer-acoustic-model-mfa](https://github.com/seanghay/khmer-acoustic-model-mfa) Train an Acoustic Model for Khmer language with Montreal Forced Aligner
- [seanghay/tha](https://github.com/seanghay/tha) Tha (ថា) - A Khmer Text Normalization and Verbalization Toolkit
- [seanghay/khmerpronounce](https://github.com/seanghay/khmerpronounce) Khmer Pronounciation Toolkit
- [seanghay/khmer2number](https://github.com/seanghay/khmer2number) A Khmer word to number converter.

### 3. Datasets

- [khPOS (Khmer Part-of-Speech) Corpus for Khmer NLP Research and Developments](https://github.com/ye-kyaw-thu/khPOS/)
- [ParaCrawl Corpus](https://paracrawl.eu/)
- [Asian Language Treebank (ALT) Project](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/)
- [phylypo/segmentation-crf-khmer](https://github.com/phylypo/segmentation-crf-khmer)
- [google/language-resources](https://github.com/google/language-resources/tree/master/km) Lexicon, Text normalization and Verbalizer
- [Illustrations and recordings for language learning](https://www.aakanee.com/) Audio recodings and illustration
- [seanghay/khmer-dictionary-44k](https://huggingface.co/datasets/seanghay/khmer-dictionary-44k)
- [seanghay/km-speech-corpus](https://huggingface.co/datasets/seanghay/km-speech-corpus)
- [seanghay/bookmebus-reviews](https://huggingface.co/datasets/seanghay/bookmebus-reviews)
- [seanghay/khmer_mpwt_speech](https://huggingface.co/datasets/seanghay/khmer_mpwt_speech)
- [seanghay/khmer_kheng_info_speech](https://huggingface.co/datasets/seanghay/khmer_kheng_info_speech)
- [seanghay/khmer_grkpp_speech](https://huggingface.co/datasets/seanghay/khmer_grkpp_speech)
- [High quality TTS data for Khmer](https://openslr.org/42/)
- [Google FLEURS](https://huggingface.co/datasets/google/fleurs) Audio Dataset
- [mc4](https://huggingface.co/datasets/mc4) A multilingual colossal, cleaned version of Common Crawl's web crawl corpus
- [Khmer LineBreaking Dictionary](https://github.com/sbbic/khmerlbdict)
- [Khmer tesseract-ocr](https://github.com/tesseract-ocr/tessdata/blob/main/khm.traineddata)
- [Khmerlang Mobile Keyboard data](https://khmerlang.com/posts/4)
- [Khmer Bible Recordings](http://littlex.net/khbible/)
- [SleukRith Set](https://github.com/donavaly/SleukRith-Set)
- [Khmer annotation](https://www.kaggle.com/datasets/keatchakravuth/khmer-annotation) Annotated Khmer Dataset for Word spotting

### 4. Research Papers

- [An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention](https://arxiv.org/abs/2106.10875)
- [Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search](https://arxiv.org/abs/2112.08918)
- [Khmer Text Classification Using Word Embedding and Neural Networks](https://arxiv.org/abs/2112.06748)
- [Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning](https://arxiv.org/abs/2103.16801)
- [Building WFST based Grapheme to Phoneme Conversion for Khmer](https://ksoky.github.io/static/pdf/wfst_g2p.pdf)
- [Query Expansion for Khmer Information Retrieval](https://aclanthology.org/W10-3211.pdf)
- [Building a Syllable Database to Solve the Problem of Khmer Word Segmentation](https://arxiv.org/pdf/1703.02166.pdf)
- [Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document](http://www.apsipa.org/proceedings_2014/data/paper/1406.pdf)
- [Khmer printed character recognition using attention-based Seq2Seq network](https://journalofscience.ou.edu.vn/index.php/tech-en/article/view/2217)
- [Khmer Word Segmentation Using Conditional Random Fields](https://att-astrec.nict.go.jp/member/ding/KhNLP2015-SEG.pdf)
- [A Large-scale Study of Statistical Machine Translation Methods for Khmer Language](https://aclanthology.org/Y15-1030.pdf)
- [A Rule-based Approach for Khmer Word Extraction](https://www.ieice.org/publications/conference-FIT-DVDs/FIT2010/pdf/E/E_007.pdf)
- [Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences](./assets/GITS-GITI_2012-2013_Van.pdf)
- [The Standard Khmer vowel system: An acoustic study](http://www.rupp.edu.kh/CJBAR/files/Vol-2-Issue-2/5-CHEM-Vol-2-Issue-2.pdf)
- [Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion](https://dl.acm.org/doi/fullHtml/10.1145/3464378)
- [Towards deep learning on speech recognition for Khmer language](https://mospace.umsystem.edu/xmlui/handle/10355/56110)
- [A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory](https://journalofscience.ou.edu.vn/index.php/tech-en/article/download/2219/1680)
- [Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence](https://s3.ap-northeast-2.amazonaws.com/journal-home/journal/jips/fullText/787/jips_18_4_9.pdf)
- [Detection and Correction of Homophonous Error Word for Khmer Language](https://www.researchgate.net/profile/Sok-Chea/publication/228963957_Detection_and_Correction_of_Homophonous_Error_Word_for_Khmer_Language/links/5572617108aeacff1ffacd75/Detection-and-Correction-of-Homophonous-Error-Word-for-Khmer-Language.pdf)
- [No Language Left Behind (NLLB)](https://ai.meta.com/research/no-language-left-behind/)
- [Phonological Principles And Automatic Phonemic And Phonetic Transcription Of Khmer Words](https://drive.google.com/file/d/1c_FXNy90pv06StsBMQz4Rzk87ulMqXyM/view)
- [Multi-lingual Transformer Training for Khmer Automatic Speech Recognition](http://www.sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/SOK-APSIPA19.pdf)
- [TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies](https://repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433/276897/1/s2717554522500072.pdf)
- [Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language](http://sap.ist.i.kyoto-u.ac.jp/EN/bib/intl/SOK-ICASSP23.pdf)
- [Khmer pronouncing dictionary: standard Khmer and Phnom Penh dialect](https://unesdoc.unesco.org/ark:/48223/pf0000246360)
- [ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition](https://www.mdpi.com/2313-433X/9/12/276)
- [Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition](https://www.mdpi.com/2313-433X/9/11/248)
- [Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition](https://ieeexplore.ieee.org/document/10316307)

### 5. Projects/Models

- [facebookresearch/fairseq/mms](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) Text to Speech and Speech to Text
- [Khmer Language Model using ULMFiT](https://ml.tovnah.com/khmer-ulmfit/)
- [KHMER WORD SEARCH BASE ON SEMANTIC RELATION](https://nlp.techostartup.center/)
- [Khmer Audio Dictionary](https://kheng.info/)
- [Khmer to IPA Converter](https://khmerlang.com/tools/khmer-ipa)
- [Khmer Phonemizer](https://huggingface.co/spaces/seanghay/khmer-g2p-ipa)
- [Khmer Text-to-Speech MMS](https://huggingface.co/spaces/seanghay/khmer-tts)
- [Khmer Part of Speech Tagging with XLM RoBERTa](https://huggingface.co/seanghay/khmer-pos-roberta)
- [Whisper Small Khmer Fine-tuned](https://huggingface.co/seanghay/whisper-small-khmer-v2)
- [Joint Word Segmentation and POS Tagging in Keras](https://github.com/Socret360/joint-khmer-word-segmentation-and-pos-tagging)
- [Socret360/akara-android](https://github.com/Socret360/akara-android)
- [vitouphy/wav2vec2-xls-r-300m-khmer](https://huggingface.co/vitouphy/wav2vec2-xls-r-300m-khmer)
- [vitouphy/wav2vec2-xls-r-1b-khmer](https://huggingface.co/vitouphy/wav2vec2-xls-r-1b-khmer)
- [Khmer Text Classification](https://huggingface.co/seanghay/khmer-text-classification-roberta)
- [khmerlang/khmer-text-summarizer](https://github.com/khmerlang/khmer-text-summarizer)
- [khmerlang/KhmerWordPrediction](https://github.com/khmerlang/KhmerWordPrediction)
- [khmerlang/elasticsearch-analysis-khmerlang](https://github.com/khmerlang/elasticsearch-analysis-khmerlang)
- [Khmer Fingerspelling](https://github.com/cadt-g6/khmer_fingerspelling)
- [isi-nlp/uroman](https://github.com/isi-nlp/uroman) Universal Romanizer
- [pisethx/khmer-word-segmentation](https://github.com/pisethx/khmer-word-segmentation)
- [khmer-forced-aligner](https://github.com/seanghay/khmer-forced-aligner)
- [Fast Khmer Dictionary](https://khmerdict.com)
- [SEANLP: Southeast Asia Natural Language Processing](https://github.com/zhaoshiyu/SEANLP)
- [Khmerlang-Keyboard](https://github.com/khmerlang/Khmerlang-Keyboard)
- [ericvida/khtransliterator](https://github.com/ericvida/khtransliterator)
- [Khmer Unicode Converter](https://github.com/chamnap/khmer_unicode_converter)
- [chantysothy/KhmerUnicodeConverter](https://github.com/chantysothy/KhmerUnicodeConverter)
- [Pretrained-BERT-model-for-Khmer-language](https://github.com/rifatul-rifat/Pretrained-BERT-model-for-Khmer-language)
- [Khmer Language Model for Handwritten Text Recognition on Historical Documents](https://github.com/SeanghortBorn/Khmer-Language-Model-v1.0)
- [Khmer Single Word TTS](https://huggingface.co/spaces/seanghay/KLEA)
- [SeaLLMs](https://huggingface.co/SeaLLMs) Large Language Models for Southeast Asia
- [XLM-RoBERTa-Khmer](https://huggingface.co/seanghay/xlm-roberta-khmer-small) Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. While being smaller than XLM-RoBERTa-Base

### 6. Blog / Slides

- [Issues in Khmer syllable validation](https://lindenbergsoftware.com/en/notes/issues-in-khmer-syllable-validation/)
- [Khmer Machine Learning (ML) Experiment](https://ml.tovnah.com/)
- [Using AI to Generate Khmer Baby Names ](https://medium.com/@phylypo/using-ai-to-generate-khmer-baby-names-b9b0af79ee83)
- [How domnung.com Ranks Khmer News](https://medium.com/@phylypo/how-domnung-com-ranks-khmer-news-92bd68989f7a)
- [Text Classification with scikit-learn on Khmer Documents](https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195)
- [Multi-Class Text Classification on Khmer News Articles](https://medium.com/@phylypo/multi-class-text-classification-on-khmer-news-articles-d0937281a524)
- [Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)
- [NLP: Text Segmentation Using Dictionary Based Algorithms](https://medium.com/@phylypo/nlp-text-segmentation-using-dictionary-based-algorithms-6d0a45a76c08)
- [NLP: Text Segmentation with Ngram](https://medium.com/@phylypo/nlp-text-segmentation-with-ngram-b5506dbb514c)
- [NLP: Text Segmentation Using Naive Bayes](https://medium.com/@phylypo/nlp-text-segmentation-using-naive-bayes-bccdd08ccf6f)
- [NLP: Text Segmentation Using Hidden Markov Model](https://medium.com/@phylypo/nlp-text-segmentation-using-hidden-markov-model-f238743d87eb)
- [NLP: Text Segmentation Using Maximum Entropy Markov Model](https://medium.com/@phylypo/nlp-text-segmentation-using-maximum-entropy-markov-model-c6160b13b248)
- [NLP: Text Segmentation Using Conditional Random Fields](https://medium.com/@phylypo/nlp-text-segmentation-using-conditional-random-fields-e8ff1d2b6060)
- [Khmer Language Model Using ULMFiT (Feb 2020)](https://medium.com/@phylypo/khmer-language-model-using-ulmfit-b0f8ca4e15be)
- [Creating a Khmer Language Model using BERT](https://medium.com/@phylypo/creating-a-khmer-language-model-using-bert-9a12d3f12b03)
- [Building a Khmer Spelling Checker](https://towardsdatascience.com/building-a-khmer-spelling-checker-7e3356677335)
- [khmerlang.com](https://khmerlang.com/)
- [Khmer word spell correction using BK-Tree data structure and Levenshtein distance](https://engleangs.medium.com/khmer-word-spell-correction-using-bk-tree-data-structure-and-levenshtein-distance-dd4d98e3766a)
- [Introduction to kNN algorithm by experiment on Khmer Handwriting classification using Java 8](https://towardsdatascience.com/introduction-to-knn-machine-learning-algorithm-by-experiment-on-khmer-handwriting-classification-66a64652a02c)
- [Speech Synthesis and Low Resource Languages](./assets/SLTU_TTS_Tutorial.pdf)
- [ការបញ្ចូលអក្សរខ្មែរក្នុងយូនីកូដ ឯកសារឆ្នាំ 1996](https://khmertypography.com/khmer-unicode-letter-1996)

### 7. Misc

- [harfbuzz](https://github.com/harfbuzz/harfbuzz) A text shaping engine that supports Khmer language.
- [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) A better BERT with multiligual support.
- [mt5-base](https://huggingface.co/google/mt5-base) Google T5 multiligual support.
- [byt5-base](https://huggingface.co/google/byt5-base) Google T5 without tokenizer.
- [sentencepiece](https://github.com/google/sentencepiece) A tool to create a tokenizer
- [huggingface/transformers](https://github.com/huggingface/transformers)
- [tiktoken](https://github.com/openai/tiktoken)
- [montreal-forced-aligner](https://montreal-forced-aligner.readthedocs.io/) Accoustic Model & Alignment
- [pair_ngram](https://github.com/google-research/google-research/tree/master/pair_ngram) Building Grapheme to Phoneme
- [fastText](https://fasttext.cc/)
- [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus) Building Grapheme to Phoneme
- [Compact Language Detector v3](https://github.com/google/cld3) Language Detection tool

---

> Khmer is not a low-resource language.