An open API service indexing awesome lists of open source software.

https://github.com/vndee/awsome-vietnamese-nlp

A collection of Vietnamese Natural Language Processing resources.
https://github.com/vndee/awsome-vietnamese-nlp

Last synced: 7 months ago
JSON representation

A collection of Vietnamese Natural Language Processing resources.

Awesome Lists containing this project

README

          

# Vietnamese Natural Language Processing Resources

> Create a pull request or issue to add your works into this list.

- [Large Language Models](#Large-Language-Models)
- [Corpus](#Corpus)
- [Text Processing Toolkit](#Text-Processing-Toolkit)
- [Pre-trained Language Model](#Pre-trained-Language-Model)
- [Sentiment Analysis](#Sentiment-Analysis)
- [Named Entity Recognition](#Named-Entity-Recognition)
- [Speech Processing](#Speech-Processing)

### Large Language Models
- [GemSUra](https://huggingface.co/collections/ura-hcmut/gemsura-65da96cd27be2e8c65f17131): Pretrained Large Language Models based on Gemma built by URA (HCMUT).
- [Ghost-7b](https://huggingface.co/lamhieu/ghost-7b-v0.9.0): This model is fine tuned from HuggingFaceH4/zephyr-7b-beta on a small synthetic datasets (about 200MB) for 50% English and 50% Vietnamese.
- [PhoGPT](https://github.com/VinAIResearch/PhoGPT): They open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant PhoGPT-7B5-Instruct.
- [Sailor](https://huggingface.co/collections/sail/sailor-language-models-65e19a749f978976f1959825): Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
- [SeaLLM](https://huggingface.co/collections/SeaLLMs/seallms-65be16f92e67686440ae29f3)): The state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭.
- [ToRoLaMa](https://github.com/allbyai/ToRoLaMa): The Vietnamese Instruction-Following and Chat Model.
- [Vistral-7B-Chat-function-calling](https://huggingface.co/hiieu/Vistral-7B-Chat-function-calling): This model was fine-tuned on Vistral-7B-chat for function calling.
- [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat): Towards a State-of-the-Art Large Language Model for Vietnamese
- [ViGPTQA](https://github.com/DopikAI-Labs/ViGPT): LLMs for Vietnamese Question Answering
- [VBD-LLaMA2-Chat](https://huggingface.co/LR-AI-Labs/vbd-llama2-7B-50b-chat): A Conversationally-tuned LLaMA2 for Vietnamese.
- [Vietnamse LLaMA 2](https://github.com/bkai-research/Vietnamese-LLaMA-2): A 7B version of LLaMA 2 with 140GB of Vietnamese text by BKAI Foundation Models Lab.
- [VinaLlaMA](https://huggingface.co/collections/vilm/vinallama-654a099308775ce78e630a6f): Another collection of Vietnamese LlaMA tuned models.
- [Vietcuna](https://github.com/vilm-ai/vietcuna): A series of Vicuna tuned models for Vietnamese.
- [Llama2_vietnamese](https://github.com/ngoanpv/llama2_vietnamese): A fine-tuned Large Language Model (LLM) for the Vietnamese language based on the Llama 2 model.
- [Vietnamese_LLMs](https://github.com/VietnamAIHub/Vietnamese_LLMs): This project aims to create high-quality Vietnamese instruction datasets and tune several open-source large language models (LLMs). So far, they have released various models, including LLaMa and BLOOMZ. Additionally, they have released five instruction datasets, most of which were generated by GPT-4.

### Corpus
> For more recent updates, you can consider searching for datasets that include Vietnamese on HuggingFace here: https://huggingface.co/datasets?language=language:vi&sort=trending
- [Math Instruction datasets](https://huggingface.co/collections/5CD-AI/math-instruction-datasets-660801f244a011983be58fe0): A series of translated datasets by 5CD AI Team.
- [LLaVA - Visual Question Answering](https://huggingface.co/collections/5CD-AI/llava-visual-question-answering-6608019995db9114e35b1fb9): A series of translated datasets by 5CD AI Team.
- [CoT Instruction datasets](https://huggingface.co/collections/5CD-AI/cot-instruction-datasets-660800b52e58edd19eafe7e6): A series of translated datasets by 5CD AI Team.
- [DPO Instruction datasets](https://huggingface.co/collections/5CD-AI/dpo-instruction-datasets-6608026d80f057ee616e8bf5): A series of translated datasets by 5CD AI Team.
- [Retrieve-Rerank datasets](https://huggingface.co/collections/5CD-AI/retrieve-rerank-datasets-6660436222834190f7f26c0d): A series of translated datasets by 5CD AI Team.
- [Coding Instruction datasets](https://huggingface.co/collections/5CD-AI/coding-instruction-datasets-666fde69ad3050dd2bc67e6a): A series of translated datasets by 5CD AI Team.
- [Chat Instruction datasets](https://huggingface.co/collections/5CD-AI/chat-instruction-datasets-666fdf510de9ee884ba43026): A series of translated datasets by 5CD AI Team.
- [VN News Corpus](https://github.com/binhvq/news-corpus): 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.
- [10000 Vietnamese Books](https://www.kaggle.com/datasets/iambestfeeder/10000-vietnamese-books): 10000 Vietnamese Books from 195x.
- [CulturaX](https://huggingface.co/papers/2309.09400): A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
- [Bactrain-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X): The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages.
- [OSCAR](https://oscar-corpus.com/): 68GB of text data with 12.036.845.359 words.
- [Common Crawl](https://commoncrawl.org/): Open repository of web crawl data.
- [WikiDumps](https://dumps.wikimedia.org/): You can download directly or use scripts from [viwik18](https://github.com/NTT123/viwik18), [viwik19](https://github.com/NTT123/viwik19).
- [Vietnamese Treebank](https://vlsp.hpda.vn/demo/?page=resources): VLSP Project.
- [Vietnamese Stopwords](https://github.com/stopwords/vietnamese-stopwords): Vietnamese stopwords.
- [Vietnamese Dictionary](https://www.informatik.uni-leipzig.de/~duc/Dict/): Vietnamese dictionary.
- [vietnamese-wordnet](https://github.com/zeloru/vietnamese-wordnet): Vietnamese wordnet.
- [VietnameseWAC](https://xltiengviet.fandom.com/wiki/VietnameseWAC): The dataset comprises a substantial collection of Vietnamese text, consisting of 129,781,089 tokens and 106,464,835 words, which have been automatically segmented and labeled as per Kilgarriff, A., and Le-Hong, P., 2012.
- [Vietlex Corpus](https://www.vietlex.com/help/about_corpus.htm): Vietlex's Vietnamese Corpus, a pioneering effort in Vietnam since 1998, contains about 80 million syllables from various sources.
- [Lexical Database of Vietnamese](https://era.library.ualberta.ca/items/90d5b06c-e508-45b3-8526-3509bceb930e): A lexical database of Vietnamese contains various lexical information derived from two Vietnamese corpora.

### Text Processing Toolkit
- [coccoc-tokenizer](https://github.com/coccoc/coccoc-tokenizer): High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
- [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter): Fast and accurate Vietnamese word segmenter (LREC 2018).
- [RDRPOSTagger](https://github.com/datquocnguyen/RDRPOSTagger): Fast and accurate POS and morphological tagging toolkit (EACL 2014).
- [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP): A Vietnamese natural language processing toolkit (NAACL 2018).
- [vlp-tok](https://github.com/phuonglh/vlp): Vietnamese text processing library developed in the Scala programming language.
- [ETNLP](https://github.com/vietnlp/etnlp): A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
- [VietnameseTextNormalizer](https://github.com/langmaninternet/VietnameseTextNormalizer): Vietnamese Text Normalizer.
- [nnvlp](https://github.com/pth1993/NNVLP): Neural network-based Vietnamese language processing toolkit.
- [jPTDP](https://github.com/datquocnguyen/jPTDP): Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
- [vi_spacy](https://github.com/trungtv/vi_spacy): Vietnamese language model compatible with Spacy.
- [underthesea](https://underthesea.readthedocs.io/en/latest/readme.html): Underthesea - Vietnamese NLP toolkit.
- [vnlp](https://bitbucket.org/epilab/vnlp/wiki/Home): GATE plugin for Vietnamese language processing.
- [pyvi](https://github.com/trungtv/pyvi): Python Vietnamese toolkit.
- [JVnTextPro](http://jvntextpro.sourceforge.net/): Java-based Vietnamese text processing tool.
- [DongDu](https://github.com/rockkhuya/DongDu): C++ implementation of Vietnamese word segmentation tool.
- [VLSP Toolkit](https://vlsp.hpda.vn/demo/?page=resources): Vietnamese tokenizer from VLSP.
- [vTools](https://github.com/lupanh/vTools): Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
- [JNSP](http://jnsp.sourceforge.net/): Java Implementation of Ngram Statistic Package.

### Pre-trained Language Model
- [RoBERTa Vietnamese](https://github.com/nguyenvulebinh/vietnamese-roberta): Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
- [PhoBERT](https://github.com/VinAIResearch/PhoBERT): Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
- [ALBERT for Vietnamese](https://github.com/ngoanpv/albert_vi): "A Lite" version of BERT for Vietnamese.
- [Vietnamese ELECTRA](https://github.com/nguyenvulebinh/vietnamese-electra): Electra pre-trained model using Vietnamese corpus.
- [word2vecVN](https://github.com/sonvx/word2vecVN): Pre-trained Word2Vec models for Vietnamese.

### Sentiment Analysis
#### Benchmark
- **[VLSP 2016 Share Task: Sentiment Analysis](https://vlsp.org.vn/vlsp2016/eval/sa)**
- Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).
- Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

| Model | F1 | Paper | Code |
|-----------------------|----|-------|------|
| Perceptron/SVM/Maxent | 80.05 | DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews | |
| SVM/MLNN/LSTM | 71.44 | A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016 | |
| Ensemble: Random forest, SVM, Naive Bayes | 71.22 | A Lightweight Ensemble Method for Sentiment Classification Task | |
| Ensemble: SVM, LR, LSTM, CNN | 69.71 | An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis | |
| SVM | 67.54 | Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments | |
| SVM/MLNN | 67.23 | A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign | |
| Multi-channel LSTM-CNN | 59.61 | Multi-channel LSTM-CNN model for Vietnamese sentiment analysis | [official](https://github.com/ntienhuy/MultiChannel) |

- **[VLSP 2018 Shared Task: Aspect Based Sentiment Analysis](https://vlsp.org.vn/vlsp2018)**
- **Restaurant Dataset**: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

| Model| Aspect (F1) | Aspect Polarity (F1) | Paper | Code |
|---|---|---|---|---|
| CNN | 0.80 | | Deep Learning for Aspect Detection on Vietnamese Reviews | |
| SVM | 0.77 | 0.61 | NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis | |
| SVM | 0.54 | 0.48 | Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task | |

- **Hotel Dataset**: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

| Model| Aspect (F1) | Aspect Polarity (F1) | Paper | Code |
|---|---|---|---|---|
| SVM | 0.70 | 0.61 | NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis | |
| CNN | 0.69 | | Deep Learning for Aspect Detection on Vietnamese Reviews | |
| SVM | 0.56 | 0.53 | Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task | |

- **[Vietnamese Student's Feedback Corpus (UIT-VSFC)](https://ieeexplore.ieee.org/document/8573337)**
- UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

| Model | Sentiment (F1) | Topic (F1) | Paper | Code |
|-------|----------------|------------|-------|------|
| Bi-LSTM/Word2Vec | 0.896 | 0.92 | Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus | |
| Maximum Entropy Classifier | 0.88 | 0.84 | UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis | |

### Named Entity Recognition
#### Benchmark
- **[VLSP 2016 Shared Task: Named Entity Recognition](http://vjs.ac.vn/index.php/jcc/article/view/13161/103810382796)**

| Model | F1 | Paper | Code |
|-------|----|-------|------|
| PhoBERT_large | 94.7 | PhoBERT: Pre-trained language models for Vietnamese | [official](https://github.com/VinAIResearch/PhoBERT) |
| vELECTRA + BiLSTM + Attention | 94.07 | Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models | |
| PhoBERT_base | 93.6 | PhoBERT: Pre-trained language models for Vietnamese | [official](https://github.com/VinAIResearch/PhoBERT) |
| XLM-R | 92.0 | PhoBERT: Pre-trained language models for Vietnamese | |
| VnCoreNLP-NER + ETNLP | 91.3 | ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task | |
| BiLSTM-CNN-CRF + ETNLP | 91.1 | ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task | |
| VNER: Attentive Neural Network | 89.6 | Attentive Neural Network for Named Entity Recognition in Vietnamese | |
| BiLSTM-CNN-CRF | 88.3 | VnCoreNLP: A Vietnamese Natural Language Processing Toolkit | [official](https://github.com/vncorenlp/VnCoreNLP) |
| LSTM + CRF | 66.07 | An investigation of Vietnamese Nested Entity Recognition Models | |
- **[VLSP 2018 Shared Task: Named Entity Recognition](https://www.researchgate.net/publication/331956361_VLSP_Shared_Task_Named_Entity_Recognition)**

| Model | F1 | Paper | Code |
|-------|----|-------|------|
| vELECTRA + BiGRU | 90.31 | Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models | |
| VIETNER: CRF (ngrams + word shapes + cluster + w2v) | 76.63 | A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign | |
| ZA-NER | 74.70 | ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign | |

### Speech Processing
#### Corpus:
- VLSP 2020 - ASR challenge - training set: [announcement](https://institute.vinbigdata.org/events/vinbigdata-chia-se-100-gio-du-lieu-tieng-noi-cho-cong-dong/), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/vlsp2020_vinai_100h)
- VIVOS: [official link](http://ailab.hcmus.edu.vn/vivos), [mirror link on huggingface](https://huggingface.co/datasets/vivos)
- Bud500: [announcement](https://github.com/quocanh34/Bud500), [mirror link on huggingface](https://huggingface.co/datasets/linhtran92/viet_bud500)
- FOSD (FPT open speech dataset): [official link](https://data.mendeley.com/datasets/k9sxg2twv4/4), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/fpt_fosd)
- LSVSC (Large-scale Vietnamese speech corpus): [announcement](https://www.mdpi.com/2079-9292/13/5/977), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/LSVSC)
- Infore: [official link](https://www.facebook.com/groups/j2team.community/permalink/1010834009248719/), [unofficial mirror link for dataset 1 on huggingface](https://huggingface.co/datasets/doof-ferb/infore1_25hours), [unofficial mirror link for dataset 2 on huggingface](https://huggingface.co/datasets/doof-ferb/infore2_audiobooks)
- [unofficial mirror link Vivos + InfoRe 1 + InfoRe 2](https://github.com/TensorSpeech/TensorFlowASR/blob/main/README.md#vietnamese)
- [VietTTS-v1](https://github.com/NTT123/Vietnamese-Text-To-Speech-Dataset): A synthesized dataset for Vietnamese TTS task (35.1 hrs)
- [Mozilla CommonVoice](https://commonvoice.mozilla.org/vi/datasets)
- [Google FLEURS](https://huggingface.co/datasets/google/fleurs)

#### Project
- [vietTTS](https://github.com/NTT123/vietTTS): Tacotron + HiFiGAN vocoder for vietnamese datasets.