https://github.com/neuralmind-ai/portuguese-bert

Portuguese pre-trained BERT models
https://github.com/neuralmind-ai/portuguese-bert

bert bert-model deep-learning natural-language-processing nlp-resources portuguese

Last synced: about 13 hours ago
JSON representation

Portuguese pre-trained BERT models

Host: GitHub
URL: https://github.com/neuralmind-ai/portuguese-bert
Owner: neuralmind-ai
License: other
Created: 2020-01-14T22:56:00.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-06-16T11:30:25.000Z (about 4 years ago)
Last Synced: 2024-03-22T03:10:32.553Z (over 2 years ago)
Topics: bert, bert-model, deep-learning, natural-language-processing, nlp-resources, portuguese
Language: Python
Size: 927 KB
Stars: 765
Watchers: 53
Forks: 117
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-nlp - BERTimbau - BERT for Brazilian Portuguese. (NLP per Language / Models)

README

          
# BERTimbau - Portuguese BERT

This repository contains pre-trained [BERT](https://github.com/google-research/bert) models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the [BrWaC (Brazilian Web as Corpus)](https://www.researchgate.net/publication/326303825_The_brWaC_Corpus_A_New_Open_Resource_for_Brazilian_Portuguese), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.

The models are a result of an ongoing Master's Program. The [text submission for Qualifying Exam](qualifying_exam-portuguese_named_entity_recognition_using_bert_crf.pdf) is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.

## Download

The base and large models are available at [Hugging Face](https://huggingface.co/neuralmind)

## Evaluation benchmarks

The models were benchmarked on three tasks (Sentence Textual Similarity, Recognizing Textual Entailment and Named Entity Recognition) and compared to previous published results and [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md). Metrics are: Pearson's correlation for STS and F1-score for RTE and NER.

| Task | Test Dataset           | BERTimbau-Large | BERTimbau-Base | mBERT  |      Previous SOTA    |

|:----:|:----------------------:|:---------------:|:-------------: | :-----:| :--------------------:| 

| STS  | ASSIN2                 |    **0.852**    |     0.836      |  0.809 | 0.83 [[1]](#References) |

| RTE  | ASSIN2                 |    **90.0**     |     89.2       |  86.8  | 88.3 [[1]](#References) |

| NER  | MiniHAREM (5 classes)  |    **83.7**     |     83.1       |  79.2  | 82.3 [[2]](#References) |

| NER  | MiniHAREM (10 classes) |    **78.5**     |     77.6       |  73.1  | 74.6 [[2]](#References) |

### NER experiments code

Code and instructions to reproduce the Named Entity Recognition experiments are in [`ner_evaluation/`](ner_evaluation/) directory.

## PyTorch usage example

Our PyTorch artifacts are compatible with the [🤗Huggingface Transformers](https://github.com/huggingface/transformers) library and are also available on the [Community models](https://huggingface.co/models):

- [BERTimbau Base model card](https://huggingface.co/neuralmind/bert-base-portuguese-cased)

- [BERTimbau Large model card](https://huggingface.co/neuralmind/bert-large-portuguese-cased)

```python

from transformers import AutoModel, AutoTokenizer

# Using the community model

# BERT Base

tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

# BERT Large

tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')

model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')

# or, using BertModel and BertTokenizer directly

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt', do_lower_case=False)

model = BertModel.from_pretrained('path/to/bert_dir')  # Or other BERT model class

```

## Acknowledgement

We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.

## References

[1] [Multilingual Transformer Ensembles for Portuguese Natural Language Task](https://www.researchgate.net/publication/340236502_Multilingual_Transformer_Ensembles_for_Portuguese_Natural_Language_Tasks)

[2] [Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition](https://github.com/jneto04/ner-pt)

## How to cite this work

    @InProceedings{souza2020bertimbau,

        author="Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto",

        editor="Cerri, Ricardo and Prati, Ronaldo C.",

        title="BERTimbau: Pretrained BERT Models for Brazilian Portuguese",

        booktitle="Intelligent Systems",

        year="2020",

        publisher="Springer International Publishing",

        address="Cham",

        pages="403--417",

        isbn="978-3-030-61377-8"

    }

    @article{souza2019portuguese,

        title={Portuguese Named Entity Recognition using BERT-CRF},

        author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},

        journal={arXiv preprint arXiv:1909.10649},

        url={http://arxiv.org/abs/1909.10649},

        year={2019}

    }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/neuralmind-ai/portuguese-bert

Awesome Lists containing this project

README