{"id":31302658,"url":"https://github.com/VinAIResearch/BARTpho","last_synced_at":"2025-09-25T02:22:38.869Z","repository":{"id":44847667,"uuid":"408560254","full_name":"VinAIResearch/BARTpho","owner":"VinAIResearch","description":"BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (INTERSPEECH 2022)","archived":false,"fork":false,"pushed_at":"2024-07-22T13:48:45.000Z","size":708,"stargazers_count":95,"open_issues_count":0,"forks_count":7,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-07-22T16:32:23.911Z","etag":null,"topics":["bart","bartpho","pretrained-models","sequence-to-sequence","text-summarization","vietnamese-nlp"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VinAIResearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-20T18:38:19.000Z","updated_at":"2024-07-22T13:48:49.000Z","dependencies_parsed_at":"2024-07-22T16:21:25.981Z","dependency_job_id":null,"html_url":"https://github.com/VinAIResearch/BARTpho","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/VinAIResearch/BARTpho","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FBARTpho","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FBARTpho/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FBARTpho/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FBARTpho/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VinAIResearch","download_url":"https://codeload.github.com/VinAIResearch/BARTpho/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VinAIResearch%2FBARTpho/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276847277,"owners_count":25715039,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-25T02:00:09.612Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bart","bartpho","pretrained-models","sequence-to-sequence","text-summarization","vietnamese-nlp"],"created_at":"2025-09-25T02:22:36.424Z","updated_at":"2025-09-25T02:22:38.860Z","avatar_url":"https://github.com/VinAIResearch.png","language":null,"funding_links":[],"categories":["PyTorch Models","NLP per Language"],"sub_categories":["Natural Language Processing","Models and Embeddings"],"readme":"#### Table of contents\n1. [Introduction](#introduction)\n2. [Using BARTpho with `transformers`](#transformers)\n3. [Using BARTpho with `fairseq`](#fairseq)\n4. [Notes](#notes)\n\n# \u003ca name=\"introduction\"\u003e\u003c/a\u003e BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese\n\n\n\u003e We present BARTpho with two versions, BARTpho-syllable and BARTpho-word, which are the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. BARTpho uses the \"large\" architecture and the pre-training scheme of the sequence-to-sequence denoising autoencoder BART, thus it is especially suitable for generative NLP tasks. We conduct experiments to compare our BARTpho with its competitor mBART on a downstream task of Vietnamese text summarization and show that: in both automatic and human evaluations, BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We further evaluate and compare BARTpho and mBART on the Vietnamese capitalization and punctuation restoration tasks and also find that BARTpho is more effective than mBART on these two tasks.\n\nThe general architecture and experimental results of BARTpho can be found in our [paper](https://arxiv.org/abs/2109.09701):\n\n\t@inproceedings{bartpho,\n\t    title     = {{BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese}},\n\t    author    = {Nguyen Luong Tran and Duong Minh Le and Dat Quoc Nguyen},\n\t    booktitle = {Proceedings of the 23rd Annual Conference of the International Speech Communication Association},\n\t    year      = {2022}\n\t}\n\n**Please CITE** our paper when BARTpho is used to help produce published results or incorporated into other software.\n\n\n## \u003ca name=\"transformers\"\u003e\u003c/a\u003e Using BARTpho in [`transformers`](https://github.com/huggingface/transformers)\n\n### Installation\n- Install `transformers` with pip: `pip install transformers`, or [install `transformers` from source](https://huggingface.co/docs/transformers/installation#installing-from-source).  \u003cbr /\u003e\nNote that we merged a slow tokenizer for BARTpho into the main `transformers` branch. The process of merging a fast tokenizer for BARTpho is in the discussion, as detailed in [this pull request](https://github.com/huggingface/transformers/pull/17254). If users would like to utilize the fast tokenizer, the users might install `transformers` as follows:\n\n```\ngit clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git\ncd transformers\npip install -e .\n```\n\n- Install `sentencepiece` and `tokenizers` with pip: `pip install sentencepiece tokenizers`\n\n### Pre-trained models\n\nModel | #params | Arch. | Max length | Input text\n---|---|---|---|---\n[`vinai/bartpho-syllable-base`](https://huggingface.co/vinai/bartpho-syllable-base) | 132M | base | 1024 | Syllable level\n[`vinai/bartpho-syllable`](https://huggingface.co/vinai/bartpho-syllable) | 396M | large | 1024 | Syllable level\n[`vinai/bartpho-word-base`](https://huggingface.co/vinai/bartpho-word-base) | 150M | base | 1024 | Word level\n[`vinai/bartpho-word`](https://huggingface.co/vinai/bartpho-word) | 420M | large | 1024 | Word level\n\n### Example usage\n\n```python3\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\n\n#BARTpho-syllable\nsyllable_tokenizer = AutoTokenizer.from_pretrained(\"vinai/bartpho-syllable\")\nbartpho_syllable = AutoModel.from_pretrained(\"vinai/bartpho-syllable\")\nTXT = 'Chúng tôi là những nghiên cứu viên.'  \ninput_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']\nfeatures = bartpho_syllable(input_ids)\n\n#BARTpho-word\nword_tokenizer = AutoTokenizer.from_pretrained(\"vinai/bartpho-word\")\nbartpho_word = AutoModel.from_pretrained(\"vinai/bartpho-word\")\nTXT = 'Chúng_tôi là những nghiên_cứu_viên .'  \ninput_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']\nfeatures = bartpho_word(input_ids)\n```\n\n## \u003ca name=\"fairseq\"\u003e\u003c/a\u003e Using BARTpho in [`fairseq`](https://github.com/pytorch/fairseq)\n\n### Installation\n\nThere is an issue w.r.t. the `encode` function in the BART hub_interface, as discussed in this pull request [https://github.com/pytorch/fairseq/pull/3905](https://github.com/pytorch/fairseq/pull/3905). While waiting for this pull request's approval, please install `fairseq` as follows:\n\n\tgit clone https://github.com/datquocnguyen/fairseq.git\n\tcd fairseq\n\tpip install --editable ./\n\n### Pre-trained models\n\nModel | #params | Download | Input text\n---|---|---|---\nBARTpho-syllable | 396M | [fairseq-bartpho-syllable.zip](https://drive.google.com/file/d/1iw44DztS03JyVP9IcJx0Jh2q_3Y63oio/view?usp=sharing) | Syllable level\nBARTpho-word | 420M | [fairseq-bartpho-word.zip](https://drive.google.com/file/d/1j23nCYQlqwwFQPpcwiogfZ9VHDHIO0UD/view?usp=sharing) | Word level\n\n- `unzip fairseq-bartpho-syllable.zip`\n- `unzip fairseq-bartpho-word.zip`\n\n### Example usage\n\n```python\nfrom fairseq.models.bart import BARTModel  \n\n#Load BARTpho-syllable model:  \nmodel_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/'  \nspm_model_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/sentence.bpe.model'  \nbartpho_syllable = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='sentencepiece', sentencepiece_model=spm_model_path).eval()\n#Input syllable-level/raw text:  \nsentence = 'Chúng tôi là những nghiên cứu viên.'  \n#Apply SentencePiece to the input text\ntokenIDs = bartpho_syllable.encode(sentence, add_if_not_exist=False)\n#Extract features from BARTpho-syllable\nlast_layer_features = bartpho_syllable.extract_features(tokenIDs)\n\n##Load BARTpho-word model:  \nmodel_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/'  \nbpe_codes_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/bpe.codes'  \nbartpho_word = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='fastbpe', bpe_codes=bpe_codes_path).eval()\n#Input word-level text:  \nsentence = 'Chúng_tôi là những nghiên_cứu_viên .'  \n#Apply BPE to the input text\ntokenIDs = bartpho_word.encode(sentence, add_if_not_exist=False)\n#Extract features from BARTpho-word\nlast_layer_features = bartpho_word.extract_features(tokenIDs)\n```\n\n## \u003ca name=\"notes\"\u003e\u003c/a\u003e Notes\n\n-  Before fine-tuning BARTpho on a downstream task, users should perform Vietnamese tone normalization on the downstream task's data as this pre-process was also applied to the pre-training corpus. A Python script for Vietnamese tone normalization is available at [HERE](https://github.com/VinAIResearch/BARTpho/blob/main/VietnameseToneNormalization.md).\n- For `BARTpho-word`, users should use [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to segment input raw texts as it was used to perform both Vietnamese tone normalization and word segmentation on the pre-training corpus. \n\n\n## License\n    \n    MIT License\n\n    Copyright (c) 2021 VinAI\n\n    Permission is hereby granted, free of charge, to any person obtaining a copy\n    of this software and associated documentation files (the \"Software\"), to deal\n    in the Software without restriction, including without limitation the rights\n    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n    copies of the Software, and to permit persons to whom the Software is\n    furnished to do so, subject to the following conditions:\n\n    The above copyright notice and this permission notice shall be included in all\n    copies or substantial portions of the Software.\n\n    THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n    SOFTWARE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVinAIResearch%2FBARTpho","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVinAIResearch%2FBARTpho","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVinAIResearch%2FBARTpho/lists"}