Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nguyenvulebinh/vietnamese-roberta

A Robustly Optimized BERT Pretraining Approach for Vietnamese
https://github.com/nguyenvulebinh/vietnamese-roberta

bert bert-embeddings fairseq natural-language-processing pretrained-models pytorch roberta sentencepiece transformer vietnamese vietnamese-nlp

Last synced: 4 months ago
JSON representation

A Robustly Optimized BERT Pretraining Approach for Vietnamese

Host: GitHub
URL: https://github.com/nguyenvulebinh/vietnamese-roberta
Owner: nguyenvulebinh
Created: 2020-05-06T02:24:40.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2024-07-25T11:01:48.000Z (7 months ago)
Last Synced: 2024-08-01T13:31:21.890Z (7 months ago)
Topics: bert, bert-embeddings, fairseq, natural-language-processing, pretrained-models, pytorch, roberta, sentencepiece, transformer, vietnamese, vietnamese-nlp
Language: Python
Size: 11.7 KB
Stars: 29
Watchers: 3
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Pre-trained embedding using RoBERTa architecture on Vietnamese corpus

## Overview

[RoBERTa](https://arxiv.org/abs/1907.11692) is an improved recipe for training BERT models that can match or exceed the performance of all of the post-BERT methods. The different between RoBERTa and BERT:

- Training the model longer, with bigger batches, over more data.

- Removing the next sentence prediction objective.

- Training on longer sequences.

- Dynamically changing the masking pattern applied to the training data.

Data to train this model is Vietnamese corpus crawled from many online newspapers: 50GB of text with approximate 7.7 billion words that crawl from many domains on the internet including news, law, entertainment, wikipedia and so on. Data was cleaned using [visen](https://github.com/nguyenvulebinh/visen) library and tokenize using [sentence piece](https://github.com/google/sentencepiece). With [envibert](https://bit.ly/envibert) model, we use another 50GB of text in English, so a total of 100GB text is used to train envibert model.

## Prepare environment

- Download the model using the following link: [envibert model](https://bit.ly/envibert), [cased model](https://bit.ly/vibert-cased), [uncased model](https://bit.ly/vibert-uncased) and put it in folder data-bin as the following folder structure::

```text

model-bin

├── envibert

│   ├── dict.txt

│   ├── model.pt

│   └── sentencepiece.bpe.model

└── uncased

|   ├── dict.txt

|   ├── model.pt

|   └── sentencepiece.bpe.model

└── cased

    ├── dict.txt

    ├── model.pt

    └── sentencepiece.bpe.model

```

- Install environment library

```bash

pip install -r requirements.txt

```

## Example usage

### Load [envibert](https://bit.ly/envibert) model with Huggingface

```python

from transformers import RobertaModel

from transformers.file_utils import cached_path, hf_bucket_url

from importlib.machinery import SourceFileLoader

import os

cache_dir='./cache'

model_name='nguyenvulebinh/envibert'

def download_tokenizer_files():

  resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']

  for item in resources:

    if not os.path.exists(os.path.join(cache_dir, item)):

      tmp_file = hf_bucket_url(model_name, filename=item)

      tmp_file = cached_path(tmp_file,cache_dir=cache_dir)

      os.rename(tmp_file, os.path.join(cache_dir, item))

      

download_tokenizer_files()

tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)

# Encode text

text_input = 'Đại học Bách Khoa Hà Nội .'

text_ids = tokenizer(text_input, return_tensors='pt').input_ids

# tensor([[   0,  705,  131, 8751, 2878,  347,  477,    5,    2]])

# Extract features

text_features = model(text_ids)

text_features['last_hidden_state'].shape

# torch.Size([1, 9, 768])

len(text_features['hidden_states'])

# 7

```

### Load RoBERTa model

```python

from fairseq.models.roberta import XLMRModel

# Using cased model

pretrained_path = './model-bin/envibert/'

# Load RoBERTa model. That already include loading sentence piece model

roberta = XLMRModel.from_pretrained(pretrained_path, checkpoint_file='model.pt')

roberta.eval()  # disable dropout (or leave in train mode to finetune)

```

### Extract features from RoBERTa

```python

text_input = 'Đại học Bách Khoa Hà Nội.'

# Encode using roberta class

tokens_ids = roberta.encode(text_input)

# assert tokens_ids.tolist() == [0, 451, 71, 3401, 1384, 168, 234, 5, 2]

# Extracted feature using roberta model

tokens_embed = roberta.extract_features(tokens_ids)

# assert tokens_embed.shape == (1, 9, 512)

```

### Filling masks

RoBERTa can be used to fill \ tokens in the input.

```python

masked_line = 'Đại học  Khoa Hà Nội'

roberta.fill_mask(masked_line, topk=5)

#('Đại học Bách Khoa Hà Nội', 0.9954977035522461, ' Bách'),

#('Đại học Y Khoa Hà Nội', 0.001166337518952787, ' Y'),

#('Đại học Đa Khoa Hà Nội', 0.0005696234875358641, ' Đa'),

#('Đại học Văn Khoa Hà Nội', 0.000467598409159109, ' Văn'),

#('Đại học Anh Khoa Hà Nội', 0.00035955727798864245, ' Anh')

```

## Model detail

This model was a custom version from RoBERTa with less hidden layers (6 layers). Three versions: **envibert** (with dictionary case sensitive in two languages), **cased** (with dictionary case sensitive) and **uncased** (all word is lower)

## Training model

To train this model, please follow this [repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta) instruction.

## Citation

```text

@inproceedings{nguyen20d_interspeech,

  author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},

  title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},

  year=2020,

  booktitle={Proc. Interspeech 2020},

  pages={4263--4267},

  doi={10.21437/Interspeech.2020-1896}

}

```

**Please CITE** our repo when it is used to help produce published results or is incorporated into other software.

## Contact 

[email protected]

[![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)