{"id":13564263,"url":"https://github.com/nguyenvulebinh/vietnamese-roberta","last_synced_at":"2025-12-30T02:02:03.476Z","repository":{"id":94092239,"uuid":"261631774","full_name":"nguyenvulebinh/vietnamese-roberta","owner":"nguyenvulebinh","description":"A Robustly Optimized BERT Pretraining Approach for Vietnamese","archived":false,"fork":false,"pushed_at":"2024-07-25T11:01:48.000Z","size":12,"stargazers_count":30,"open_issues_count":1,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-04T17:47:19.339Z","etag":null,"topics":["bert","bert-embeddings","fairseq","natural-language-processing","pretrained-models","pytorch","roberta","sentencepiece","transformer","vietnamese","vietnamese-nlp"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nguyenvulebinh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-06T02:24:40.000Z","updated_at":"2024-09-25T02:58:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"85d80bef-8815-488a-aa4d-07f313c290b9","html_url":"https://github.com/nguyenvulebinh/vietnamese-roberta","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenvulebinh%2Fvietnamese-roberta","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenvulebinh%2Fvietnamese-roberta/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenvulebinh%2Fvietnamese-roberta/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenvulebinh%2Fvietnamese-roberta/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nguyenvulebinh","download_url":"https://codeload.github.com/nguyenvulebinh/vietnamese-roberta/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247082871,"owners_count":20880732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-embeddings","fairseq","natural-language-processing","pretrained-models","pytorch","roberta","sentencepiece","transformer","vietnamese","vietnamese-nlp"],"created_at":"2024-08-01T13:01:28.925Z","updated_at":"2025-12-30T02:01:58.437Z","avatar_url":"https://github.com/nguyenvulebinh.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Pre-trained embedding using RoBERTa architecture on Vietnamese corpus\n\n## Overview\n\n[RoBERTa](https://arxiv.org/abs/1907.11692) is an improved recipe for training BERT models that can match or exceed the performance of all of the post-BERT methods. The different between RoBERTa and BERT:\n\n- Training the model longer, with bigger batches, over more data.\n- Removing the next sentence prediction objective.\n- Training on longer sequences.\n- Dynamically changing the masking pattern applied to the training data.\n\nData to train this model is Vietnamese corpus crawled from many online newspapers: 50GB of text with approximate 7.7 billion words that crawl from many domains on the internet including news, law, entertainment, wikipedia and so on. Data was cleaned using [visen](https://github.com/nguyenvulebinh/visen) library and tokenize using [sentence piece](https://github.com/google/sentencepiece). With [envibert](https://bit.ly/envibert) model, we use another 50GB of text in English, so a total of 100GB text is used to train envibert model.\n\n## Prepare environment\n\n- Download the model using the following link: [envibert model](https://bit.ly/envibert), [cased model](https://bit.ly/vibert-cased), [uncased model](https://bit.ly/vibert-uncased) and put it in folder data-bin as the following folder structure::\n\n```text\nmodel-bin\n├── envibert\n│   ├── dict.txt\n│   ├── model.pt\n│   └── sentencepiece.bpe.model\n└── uncased\n|   ├── dict.txt\n|   ├── model.pt\n|   └── sentencepiece.bpe.model\n└── cased\n    ├── dict.txt\n    ├── model.pt\n    └── sentencepiece.bpe.model\n\n```\n\n- Install environment library\n```bash\npip install -r requirements.txt\n```\n\n## Example usage\n\n### Load [envibert](https://bit.ly/envibert) model with Huggingface\n\n```python\nfrom transformers import RobertaModel\nfrom transformers.file_utils import cached_path, hf_bucket_url\nfrom importlib.machinery import SourceFileLoader\nimport os\n\ncache_dir='./cache'\nmodel_name='nguyenvulebinh/envibert'\n\ndef download_tokenizer_files():\n  resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']\n  for item in resources:\n    if not os.path.exists(os.path.join(cache_dir, item)):\n      tmp_file = hf_bucket_url(model_name, filename=item)\n      tmp_file = cached_path(tmp_file,cache_dir=cache_dir)\n      os.rename(tmp_file, os.path.join(cache_dir, item))\n      \ndownload_tokenizer_files()\ntokenizer = SourceFileLoader(\"envibert.tokenizer\", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)\nmodel = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)\n\n# Encode text\ntext_input = 'Đại học Bách Khoa Hà Nội .'\ntext_ids = tokenizer(text_input, return_tensors='pt').input_ids\n# tensor([[   0,  705,  131, 8751, 2878,  347,  477,    5,    2]])\n\n# Extract features\ntext_features = model(text_ids)\ntext_features['last_hidden_state'].shape\n# torch.Size([1, 9, 768])\nlen(text_features['hidden_states'])\n# 7\n```\n\n### Load RoBERTa model\n\n```python\nfrom fairseq.models.roberta import XLMRModel\n\n# Using cased model\npretrained_path = './model-bin/envibert/'\n\n# Load RoBERTa model. That already include loading sentence piece model\nroberta = XLMRModel.from_pretrained(pretrained_path, checkpoint_file='model.pt')\nroberta.eval()  # disable dropout (or leave in train mode to finetune)\n```\n\n### Extract features from RoBERTa\n\n```python\ntext_input = 'Đại học Bách Khoa Hà Nội.'\n# Encode using roberta class\ntokens_ids = roberta.encode(text_input)\n# assert tokens_ids.tolist() == [0, 451, 71, 3401, 1384, 168, 234, 5, 2]\n# Extracted feature using roberta model\ntokens_embed = roberta.extract_features(tokens_ids)\n# assert tokens_embed.shape == (1, 9, 512)\n```\n\n### Filling masks\n\nRoBERTa can be used to fill \\\u003cmask\\\u003e tokens in the input.\n\n```python\nmasked_line = 'Đại học \u003cmask\u003e Khoa Hà Nội'\nroberta.fill_mask(masked_line, topk=5)\n\n#('Đại học Bách Khoa Hà Nội', 0.9954977035522461, ' Bách'),\n#('Đại học Y Khoa Hà Nội', 0.001166337518952787, ' Y'),\n#('Đại học Đa Khoa Hà Nội', 0.0005696234875358641, ' Đa'),\n#('Đại học Văn Khoa Hà Nội', 0.000467598409159109, ' Văn'),\n#('Đại học Anh Khoa Hà Nội', 0.00035955727798864245, ' Anh')\n```\n\n## Model detail\n\nThis model was a custom version from RoBERTa with less hidden layers (6 layers). Three versions: **envibert** (with dictionary case sensitive in two languages), **cased** (with dictionary case sensitive) and **uncased** (all word is lower)\n\n\n## Training model\n\nTo train this model, please follow this [repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta) instruction.\n\n## Citation\n\n```text\n@inproceedings{nguyen20d_interspeech,\n  author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},\n  title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},\n  year=2020,\n  booktitle={Proc. Interspeech 2020},\n  pages={4263--4267},\n  doi={10.21437/Interspeech.2020-1896}\n}\n```\n**Please CITE** our repo when it is used to help produce published results or is incorporated into other software.\n\n## Contact \n\nnguyenvulebinh@gmail.com\n\n[![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnguyenvulebinh%2Fvietnamese-roberta","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnguyenvulebinh%2Fvietnamese-roberta","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnguyenvulebinh%2Fvietnamese-roberta/lists"}