https://github.com/vndee/awsome-vietnamese-nlp

A collection of Vietnamese Natural Language Processing resources.
https://github.com/vndee/awsome-vietnamese-nlp
Last synced: 7 months ago
JSON representation
A collection of Vietnamese Natural Language Processing resources.
Host: GitHub
URL: https://github.com/vndee/awsome-vietnamese-nlp
Owner: vndee
Created: 2020-08-08T02:58:41.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2024-07-03T08:44:32.000Z (over 1 year ago)
Last Synced: 2025-01-19T07:26:53.321Z (9 months ago)
Homepage:
Size: 109 KB
Stars: 237
Watchers: 5
Forks: 45
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Vietnamese Natural Language Processing Resources

> Create a pull request or issue to add your works into this list.

- [Large Language Models](#Large-Language-Models)

- [Corpus](#Corpus)

- [Text Processing Toolkit](#Text-Processing-Toolkit)

- [Pre-trained Language Model](#Pre-trained-Language-Model)

- [Sentiment Analysis](#Sentiment-Analysis)

- [Named Entity Recognition](#Named-Entity-Recognition)

- [Speech Processing](#Speech-Processing)

### Large Language Models

- [GemSUra](https://huggingface.co/collections/ura-hcmut/gemsura-65da96cd27be2e8c65f17131): Pretrained Large Language Models based on Gemma built by URA (HCMUT).

- [Ghost-7b](https://huggingface.co/lamhieu/ghost-7b-v0.9.0): This model is fine tuned from HuggingFaceH4/zephyr-7b-beta on a small synthetic datasets (about 200MB) for 50% English and 50% Vietnamese.

- [PhoGPT](https://github.com/VinAIResearch/PhoGPT): They open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant PhoGPT-7B5-Instruct.

- [Sailor](https://huggingface.co/collections/sail/sailor-language-models-65e19a749f978976f1959825): Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.

- [SeaLLM](https://huggingface.co/collections/SeaLLMs/seallms-65be16f92e67686440ae29f3)): The state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭.

- [ToRoLaMa](https://github.com/allbyai/ToRoLaMa): The Vietnamese Instruction-Following and Chat Model.

- [Vistral-7B-Chat-function-calling](https://huggingface.co/hiieu/Vistral-7B-Chat-function-calling): This model was fine-tuned on Vistral-7B-chat for function calling.

- [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat): Towards a State-of-the-Art Large Language Model for Vietnamese

- [ViGPTQA](https://github.com/DopikAI-Labs/ViGPT): LLMs for Vietnamese Question Answering

- [VBD-LLaMA2-Chat](https://huggingface.co/LR-AI-Labs/vbd-llama2-7B-50b-chat): A Conversationally-tuned LLaMA2 for Vietnamese.

- [Vietnamse LLaMA 2](https://github.com/bkai-research/Vietnamese-LLaMA-2): A 7B version of LLaMA 2 with 140GB of Vietnamese text by BKAI Foundation Models Lab.

- [VinaLlaMA](https://huggingface.co/collections/vilm/vinallama-654a099308775ce78e630a6f): Another collection of Vietnamese LlaMA tuned models.

- [Vietcuna](https://github.com/vilm-ai/vietcuna): A series of Vicuna tuned models for Vietnamese.

- [Llama2_vietnamese](https://github.com/ngoanpv/llama2_vietnamese): A fine-tuned Large Language Model (LLM) for the Vietnamese language based on the Llama 2 model.

- [Vietnamese_LLMs](https://github.com/VietnamAIHub/Vietnamese_LLMs): This project aims to create high-quality Vietnamese instruction datasets and tune several open-source large language models (LLMs). So far, they have released various models, including LLaMa and BLOOMZ. Additionally, they have released five instruction datasets, most of which were generated by GPT-4.

### Corpus

> For more recent updates, you can consider searching for datasets that include Vietnamese on HuggingFace here: https://huggingface.co/datasets?language=language:vi&sort=trending

- [Math Instruction datasets](https://huggingface.co/collections/5CD-AI/math-instruction-datasets-660801f244a011983be58fe0): A series of translated datasets by 5CD AI Team.

- [LLaVA - Visual Question Answering](https://huggingface.co/collections/5CD-AI/llava-visual-question-answering-6608019995db9114e35b1fb9): A series of translated datasets by 5CD AI Team.

- [CoT Instruction datasets](https://huggingface.co/collections/5CD-AI/cot-instruction-datasets-660800b52e58edd19eafe7e6): A series of translated datasets by 5CD AI Team.

- [DPO Instruction datasets](https://huggingface.co/collections/5CD-AI/dpo-instruction-datasets-6608026d80f057ee616e8bf5): A series of translated datasets by 5CD AI Team.

- [Retrieve-Rerank datasets](https://huggingface.co/collections/5CD-AI/retrieve-rerank-datasets-6660436222834190f7f26c0d): A series of translated datasets by 5CD AI Team.

- [Coding Instruction datasets](https://huggingface.co/collections/5CD-AI/coding-instruction-datasets-666fde69ad3050dd2bc67e6a): A series of translated datasets by 5CD AI Team.

- [Chat Instruction datasets](https://huggingface.co/collections/5CD-AI/chat-instruction-datasets-666fdf510de9ee884ba43026): A series of translated datasets by 5CD AI Team.

- [VN News Corpus](https://github.com/binhvq/news-corpus): 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.

- [10000 Vietnamese Books](https://www.kaggle.com/datasets/iambestfeeder/10000-vietnamese-books): 10000 Vietnamese Books from 195x.

- [CulturaX](https://huggingface.co/papers/2309.09400): A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

- [Bactrain-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X): The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages.

- [OSCAR](https://oscar-corpus.com/): 68GB of text data with 12.036.845.359 words.

- [Common Crawl](https://commoncrawl.org/): Open repository of web crawl data.

- [WikiDumps](https://dumps.wikimedia.org/): You can download directly or use scripts from [viwik18](https://github.com/NTT123/viwik18), [viwik19](https://github.com/NTT123/viwik19).

- [Vietnamese Treebank](https://vlsp.hpda.vn/demo/?page=resources): VLSP Project.

- [Vietnamese Stopwords](https://github.com/stopwords/vietnamese-stopwords):  Vietnamese stopwords.

- [Vietnamese Dictionary](https://www.informatik.uni-leipzig.de/~duc/Dict/): Vietnamese dictionary.

- [vietnamese-wordnet](https://github.com/zeloru/vietnamese-wordnet): Vietnamese wordnet.

- [VietnameseWAC](https://xltiengviet.fandom.com/wiki/VietnameseWAC): The dataset comprises a substantial collection of Vietnamese text, consisting of 129,781,089 tokens and 106,464,835 words, which have been automatically segmented and labeled as per Kilgarriff, A., and Le-Hong, P., 2012.

- [Vietlex Corpus](https://www.vietlex.com/help/about_corpus.htm): Vietlex's Vietnamese Corpus, a pioneering effort in Vietnam since 1998, contains about 80 million syllables from various sources.

- [Lexical Database of Vietnamese](https://era.library.ualberta.ca/items/90d5b06c-e508-45b3-8526-3509bceb930e): A lexical database of Vietnamese contains various lexical information derived from two Vietnamese corpora.

### Text Processing Toolkit

- [coccoc-tokenizer](https://github.com/coccoc/coccoc-tokenizer): High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.

- [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter):  Fast and accurate Vietnamese word segmenter (LREC 2018).

- [RDRPOSTagger](https://github.com/datquocnguyen/RDRPOSTagger): Fast and accurate POS and morphological tagging toolkit (EACL 2014).

- [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP):  A Vietnamese natural language processing toolkit (NAACL 2018).

- [vlp-tok](https://github.com/phuonglh/vlp): Vietnamese text processing library developed in the Scala programming language.

- [ETNLP](https://github.com/vietnlp/etnlp): A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.

- [VietnameseTextNormalizer](https://github.com/langmaninternet/VietnameseTextNormalizer): Vietnamese Text Normalizer.

- [nnvlp](https://github.com/pth1993/NNVLP): Neural network-based Vietnamese language processing toolkit.

- [jPTDP](https://github.com/datquocnguyen/jPTDP):  Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).

- [vi_spacy](https://github.com/trungtv/vi_spacy): Vietnamese language model compatible with Spacy.

- [underthesea](https://underthesea.readthedocs.io/en/latest/readme.html): Underthesea - Vietnamese NLP toolkit.

- [vnlp](https://bitbucket.org/epilab/vnlp/wiki/Home): GATE plugin for Vietnamese language processing.

- [pyvi](https://github.com/trungtv/pyvi): Python Vietnamese toolkit.

- [JVnTextPro](http://jvntextpro.sourceforge.net/): Java-based Vietnamese text processing tool.

- [DongDu](https://github.com/rockkhuya/DongDu): C++ implementation of Vietnamese word segmentation tool.

- [VLSP Toolkit](https://vlsp.hpda.vn/demo/?page=resources): Vietnamese tokenizer from VLSP.

- [vTools](https://github.com/lupanh/vTools): Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.

- [JNSP](http://jnsp.sourceforge.net/): Java Implementation of Ngram Statistic Package.

### Pre-trained Language Model

- [RoBERTa Vietnamese](https://github.com/nguyenvulebinh/vietnamese-roberta): Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.

- [PhoBERT](https://github.com/VinAIResearch/PhoBERT): Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).

- [ALBERT for Vietnamese](https://github.com/ngoanpv/albert_vi): "A Lite" version of BERT for Vietnamese.

- [Vietnamese ELECTRA](https://github.com/nguyenvulebinh/vietnamese-electra):  Electra pre-trained model using Vietnamese corpus.

- [word2vecVN](https://github.com/sonvx/word2vecVN): Pre-trained Word2Vec models for Vietnamese.

### Sentiment Analysis

#### Benchmark

- **[VLSP 2016 Share Task: Sentiment Analysis](https://vlsp.org.vn/vlsp2016/eval/sa)**

    - Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).

    - Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

        | Model                 | F1 | Paper | Code |

        |-----------------------|----|-------|------|

        | Perceptron/SVM/Maxent |  80.05  |   DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews  |      |

        | SVM/MLNN/LSTM | 71.44   | A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016 |      |

        | Ensemble: Random forest, SVM, Naive Bayes | 71.22 | A Lightweight Ensemble Method for Sentiment Classification Task | |

        | Ensemble: SVM, LR, LSTM, CNN | 69.71 | An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis | |

        | SVM | 67.54 | Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments | |

        | SVM/MLNN | 67.23 | A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign | |

        | Multi-channel LSTM-CNN | 59.61 | Multi-channel LSTM-CNN model for Vietnamese sentiment analysis | [official](https://github.com/ntienhuy/MultiChannel) |

- **[VLSP 2018 Shared Task: Aspect Based Sentiment Analysis](https://vlsp.org.vn/vlsp2018)**

    - **Restaurant Dataset**: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

    

        | Model| Aspect (F1) | Aspect Polarity (F1) | Paper | Code |

        |---|---|---|---|---|

        | CNN | 0.80 | | Deep Learning for Aspect Detection on Vietnamese Reviews | |

        | SVM | 0.77 | 0.61 | NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis | |

        | SVM | 0.54 | 0.48 | Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task | |

    

    - **Hotel Dataset**: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

        

        | Model| Aspect (F1) | Aspect Polarity (F1) | Paper | Code |

        |---|---|---|---|---|

        | SVM | 0.70 | 0.61 | NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis | |

        | CNN | 0.69 | | Deep Learning for Aspect Detection on Vietnamese Reviews | |

        | SVM | 0.56 | 0.53 | Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task | |

- **[Vietnamese Student's Feedback Corpus (UIT-VSFC)](https://ieeexplore.ieee.org/document/8573337)**

    - UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

    

        | Model | Sentiment (F1) | Topic (F1) | Paper | Code |

        |-------|----------------|------------|-------|------|

        | Bi-LSTM/Word2Vec | 0.896 | 0.92 | Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus | |

        | Maximum Entropy Classifier | 0.88 | 0.84 | UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis | |

### Named Entity Recognition

#### Benchmark

- **[VLSP 2016 Shared Task: Named Entity Recognition](http://vjs.ac.vn/index.php/jcc/article/view/13161/103810382796)**

    | Model | F1 | Paper | Code |

    |-------|----|-------|------|

    | PhoBERT_large | 94.7 | PhoBERT: Pre-trained language models for Vietnamese | [official](https://github.com/VinAIResearch/PhoBERT) | 

    | vELECTRA + BiLSTM + Attention | 94.07 | Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models | |

    | PhoBERT_base | 93.6 | PhoBERT: Pre-trained language models for Vietnamese | [official](https://github.com/VinAIResearch/PhoBERT) |

    | XLM-R | 92.0 | PhoBERT: Pre-trained language models for Vietnamese | |

    | VnCoreNLP-NER + ETNLP | 91.3 | ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task | |

    | BiLSTM-CNN-CRF + ETNLP | 91.1 | ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task | |

    | VNER: Attentive Neural Network | 89.6 | Attentive Neural Network for Named Entity Recognition in Vietnamese | |

    | BiLSTM-CNN-CRF | 88.3 | VnCoreNLP: A Vietnamese Natural Language Processing Toolkit | [official](https://github.com/vncorenlp/VnCoreNLP) | 

    | LSTM + CRF | 66.07 | An investigation of Vietnamese Nested Entity Recognition Models | |

- **[VLSP 2018 Shared Task: Named Entity Recognition](https://www.researchgate.net/publication/331956361_VLSP_Shared_Task_Named_Entity_Recognition)**

    | Model | F1 | Paper | Code |

    |-------|----|-------|------|

    | vELECTRA + BiGRU | 90.31 | Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models | |

    | VIETNER: CRF (ngrams + word shapes + cluster + w2v) | 76.63 | A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign | |

    | ZA-NER | 74.70 | ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign | |

    

### Speech Processing

#### Corpus:

- VLSP 2020 - ASR challenge - training set: [announcement](https://institute.vinbigdata.org/events/vinbigdata-chia-se-100-gio-du-lieu-tieng-noi-cho-cong-dong/), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/vlsp2020_vinai_100h)

- VIVOS: [official link](http://ailab.hcmus.edu.vn/vivos), [mirror link on huggingface](https://huggingface.co/datasets/vivos)

- Bud500: [announcement](https://github.com/quocanh34/Bud500), [mirror link on huggingface](https://huggingface.co/datasets/linhtran92/viet_bud500)

- FOSD (FPT open speech dataset): [official link](https://data.mendeley.com/datasets/k9sxg2twv4/4), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/fpt_fosd)

- LSVSC (Large-scale Vietnamese speech corpus): [announcement](https://www.mdpi.com/2079-9292/13/5/977), [unofficial mirror link on huggingface](https://huggingface.co/datasets/doof-ferb/LSVSC)

- Infore: [official link](https://www.facebook.com/groups/j2team.community/permalink/1010834009248719/), [unofficial mirror link for dataset 1 on huggingface](https://huggingface.co/datasets/doof-ferb/infore1_25hours), [unofficial mirror link for dataset 2 on huggingface](https://huggingface.co/datasets/doof-ferb/infore2_audiobooks)

- [unofficial mirror link Vivos + InfoRe 1 + InfoRe 2](https://github.com/TensorSpeech/TensorFlowASR/blob/main/README.md#vietnamese)

- [VietTTS-v1](https://github.com/NTT123/Vietnamese-Text-To-Speech-Dataset): A synthesized dataset for Vietnamese TTS task (35.1 hrs)

- [Mozilla CommonVoice](https://commonvoice.mozilla.org/vi/datasets)

- [Google FLEURS](https://huggingface.co/datasets/google/fleurs)

#### Project

- [vietTTS](https://github.com/NTT123/vietTTS): Tacotron + HiFiGAN vocoder for vietnamese datasets.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vndee/awsome-vietnamese-nlp

Awesome Lists containing this project

README