https://github.com/getalp/Flaubert

Unsupervised Language Model Pre-training for French
https://github.com/getalp/Flaubert
Last synced: 7 months ago
JSON representation
Unsupervised Language Model Pre-training for French
Host: GitHub
URL: https://github.com/getalp/Flaubert
Owner: getalp
License: other
Created: 2019-11-20T08:45:36.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-04-11T13:20:16.000Z (over 2 years ago)
Last Synced: 2024-09-04T00:05:03.029Z (10 months ago)
Language: Python
Size: 568 KB
Stars: 240
Watchers: 19
Forks: 30
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # FlauBERT and FLUE

**FlauBERT** is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS  (French National Centre for Scientific Research) [Jean Zay](http://www.idris.fr/eng/jean-zay/ ) supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need. 

 

Along with FlauBERT comes [**FLUE**](https://github.com/getalp/Flaubert/tree/master/flue): an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language. 

 

This repository is **still under construction** and everything will be available soon. 

# Table of Contents

**1. [FlauBERT models](#1-flaubert-models)**  

**2. [Using FlauBERT](#2-using-flaubert)**   

    2.1. [Using FlauBERT with Hugging Face's Transformers](#21-using-flaubert-with-hugging-faces-transformers)   

    2.2. [Using FlauBERT with Facebook XLM's library](#22-using-flaubert-with-facebook-xlms-library)  

**3. [Pre-training FlauBERT](#3-pre-training-flaubert)**  

    3.1. [Data](#31-data)  

    3.2. [Training](#32-training)  

    3.3. [Convert an XLM pre-trained model to Hugging Face's Transformers](#33-convert-an-XLM-pre-trained-model-to-hugging-faces-transformers)  

**4. [Fine-tuning FlauBERT on the FLUE benchmark](#4-fine-tuning-flaubert-on-the-flue-benchmark)**  

**5. [Citation](#5-citation)** 

 

# 1. FlauBERT models

**FlauBERT** is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS  (French National Centre for Scientific Research) [Jean Zay](http://www.idris.fr/eng/jean-zay/ ) supercomputer. We have released the pretrained weights for the following model sizes.

The pretrained models are available for download from [here](https://zenodo.org/record/3627732) or via Hugging Face's library.

| Model name | Number of layers | Attention Heads | Embedding Dimension | Total Parameters |

| :------:       |   :---: | :---: | :---: | :---: |

| `flaubert-small-cased` | 6    | 8    | 512   | 54 M |

| `flaubert-base-uncased`  | 12  | 12  | 768  | 137 M |

| `flaubert-base-cased`   | 12   | 12      | 768   | 138 M |

| `flaubert-large-cased`  | 24   | 16     | 1024 | 373 M |

Note: `flaubert-small-cased` is partially trained so performance is not guaranteed. Consider using it for debugging purpose only.

We also provide the checkpoints from [here](https://www.dropbox.com/s/65f8unz1imz89ew/flaubert_checkpoints.tar.gz?dl=0) for model base (cased/uncased) and large (cased).

# 2. Using FlauBERT

In this section, we describe two ways to obtain sentence embeddings from pretrained FlauBERT models: either via [Hugging Face's Transformer](https://github.com/huggingface/transformers) library or via [Facebook's XLM library](https://github.com/facebookresearch/XLM). We will intergrate FlauBERT into [Facebook' fairseq](https://github.com/pytorch/fairseq) in the near future.

## 2.1. Using FlauBERT with Hugging Face's Transformers

You can use FlauBERT with [Hugging Face's Transformers](https://github.com/huggingface/transformers) library as follow.

```python

import torch

from transformers import FlaubertModel, FlaubertTokenizer

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 

#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']

modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer

flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)

flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)

# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."

token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]

print(last_layer.shape)

# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer

cls_embedding = last_layer[:, 0, :]

```

**Notes:** if your `transformers` version is <=2.10.0, `modelname` should take one

of the following values:

```

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

```

## 2.2. Using FlauBERT with Facebook XLM's library

The pretrained FlauBERT models are available for download from [here](https://zenodo.org/record/3627732). Each compressed folder includes 3 files:

- `*.pth`: FlauBERT's pretrained model.

- `codes`: BPE codes learned on the training data.

- `vocab`: BPE vocabulary file.

**Note:** The following example only works for the modified XLM provided in this repo, it won't work for the [original XLM](https://github.com/facebookresearch/XLM). The code is taken from [this tutorial](https://github.com/getalp/Flaubert/blob/master/tutorials/generate_embeddings.py).

```python

import sys

import torch

import fastBPE

# Add Flaubert root to system path (change accordingly)

FLAUBERT_ROOT = '/home/user/Flaubert'

sys.path.append(FLAUBERT_ROOT)

from xlm.model.embedder import SentenceEmbedder

from xlm.data.dictionary import PAD_WORD

# Paths to model files

model_path = '/home/user/flaubert_base_cased/flaubert_base_cased_xlm.pth'

codes_path = '/home/user/flaubert_base_cased/codes'

vocab_path = '/home/user/flaubert_base_cased/vocab'

do_lowercase = False # Change this to True if you use uncased FlauBERT

bpe = fastBPE.fastBPE(codes_path, vocab_path)

sentences = "Le chat mange une pomme ."

if do_lowercase:

    sentences = sentences.lower()

# Apply BPE

sentences = bpe.apply([sentences])

sentences = [((' %s ' % sent.strip()).split()) for sent in sentences]

print(sentences)

# Create batch

bs = len(sentences)

slen = max([len(sent) for sent in sentences])

# Reload pretrained model

embedder = SentenceEmbedder.reload(model_path)

embedder.eval()

dico = embedder.dico

# Prepare inputs to model

word_ids = torch.LongTensor(slen, bs).fill_(dico.index(PAD_WORD))

for i in range(len(sentences)):

    sent = torch.LongTensor([dico.index(w) for w in sentences[i]])

    word_ids[:len(sent), i] = sent

lengths = torch.LongTensor([len(sent) for sent in sentences])

# Get sentence embeddings (corresponding to the BERT [CLS] token)

cls_embedding = embedder.get_embeddings(x=word_ids, lengths=lengths)

print(cls_embedding.size())

# Get the entire output tensor for all tokens

# Note that cls_embedding = tensor[0]

tensor = embedder.get_embeddings(x=word_ids, lengths=lengths, all_tokens=True)

print(tensor.size())

```

# 3. Pre-training FlauBERT

### Install dependencies

You should clone this repo and then install [WikiExtractor](https://github.com/attardi/wikiextractor), [fastBPE](https://github.com/facebookresearch/XLM/tree/master/tools#fastbpe) and [Moses tokenizer](https://github.com/moses-smt/mosesdecoder) under `tools`:

```bash

git clone https://github.com/getalp/Flaubert.git

cd Flaubert

# Install toolkit

cd tools

git clone https://github.com/attardi/wikiextractor.git

git clone https://github.com/moses-smt/mosesdecoder.git

git clone https://github.com/glample/fastBPE.git

cd fastBPE

g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

```

## 3.1. Data

In this section, we describe the pipeline to prepare the data for training FlauBERT. This is based on [Facebook XLM's library](https://github.com/facebookresearch/XLM). The steps are as follows:

1. Download, clean, and tokenize data using Moses tokenizer.

2. Split cleaned data into: train, validation, and test sets.

3. Learn BPE on the training set. Then apply learned BPE codes to train, validation, and test sets.

4. Binarize data.

### (1) Download and Preprocess Data

In the following, replace `$DATA_DIR`, `$corpus_name` respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

To download and preprocess the data, excecute the following commands:

```bash

./download.sh $DATA_DIR $corpus_name fr

./preprocess.sh $DATA_DIR $corpus_name fr

```

For example:

```bash

./download.sh ~/data gutenberg fr

./preprocess.sh ~/data gutenberg fr

```

The first command will download the raw data to `$DATA_DIR/raw/fr_gutenberg`, the second one processes them and save to `$DATA_DIR/processed/fr_gutenberg`.

### (2) Split Data

Run the following command to split cleaned corpus into train, validation, and test sets. You can modify the train/validation/test ratio in the script.

```bash

bash tools/split_train_val_test.sh $DATA_PATH

```

where `$DATA_PATH` is path to the file to be split. 

The output files are: `fr.train, fr.valid, fr.test` which are saved under the same directory as the original file.

### (3) & (4) Learn BPE and Prepare Data

Run the following command to learn BPE codes on the training set, and apply BPE codes on the train, validation, and test sets. The data is then binarized and ready for training.

```bash

bash tools/create_pretraining_data.sh $DATA_DIR $BPE_size

```

where `$DATA_DIR` is path to the directory where the 3 above files `fr.train, fr.valid, fr.test` are saved. `$BPE_size` is the number of BPE vocabulary size, for example: `30` for 30k,`50` for 50k, etc. The output files are saved in `$DATA_DIR/BPE/30k` or `$DATA_DIR/BPE/50k` correspondingly.

## 3.2. Training

Our codebase for pretraining FlauBERT is largely based on the [XLM repo](https://github.com/facebookresearch/XLM#i-monolingual-language-model-pretraining-bert), with some modifications. You can use their code to train FlauBERT, it will work just fine.

Execute the following command to train FlauBERT (base) on your preprocessed data:

```bash

python train.py \

    --exp_name flaubert_base_cased \

    --dump_path $dump_path \

    --data_path $data_path \

    --amp 1 \

    --lgs 'fr' \

    --clm_steps '' \

    --mlm_steps 'fr' \

    --emb_dim 768 \

    --n_layers 12 \

    --n_heads 12 \

    --dropout 0.1 \

    --attention_dropout 0.1 \

    --gelu_activation true \

    --batch_size 16 \

    --bptt 512 \

    --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \

    --epoch_size 300000 \

    --max_epoch 100000 \

    --validation_metrics _valid_fr_mlm_ppl \

    --stopping_criterion _valid_fr_mlm_ppl,20 \

    --fp16 true \

    --accumulate_gradients 16 \

    --word_mask_keep_rand '0.8,0.1,0.1' \

    --word_pred '0.15'                      

```

where `$dump_path` is the path to where you want to save your pretrained model, `$data_path` is the path to the binarized data sets, for example `$DATA_DIR/BPE/50k`.

### Run experiments on multiple GPUs and/or multiple nodes

To run experiments on multiple GPUs in a single machine, you can use the following command (the parameters after `train.py` are the same as above).

```bash

export NGPU=4

export CUDA_VISIBLE_DEVICES=0,1,2,3,4 # if you only use some of the GPUs in the machine

python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

```

To run experiments on multiple nodes, multiple GPUs in clusters using SLURM as a resource manager, you can use the following command to launch training after requesting resources with `#SBATCH` (the parameters after `train.py` are the same as above plus `--master_port` parameter).

```bash

srun python train.py

```

## 3.3. Convert an XLM pre-trained model to Hugging Face's Transformers

To convert an XLM pre-trained model to Hugging Face's Transformers, you can use the following command.

```bash

python tools/use_flaubert_with_transformers/convert_to_transformers.py --inputdir $inputdir --outputdir $outputdir

```

where `$inputdir` is path to the XLM pretrained model directory, `$outputdir` is path to the output directory where you want to save the Hugging Face's Transformer model.

# 4. Fine-tuning FlauBERT on the FLUE benchmark

[FLUE](https://github.com/getalp/Flaubert/tree/master/flue) (French Language Understanding Evaludation) is a general benchmark for evaluating French NLP systems. Please refer to [this page](https://github.com/getalp/Flaubert/tree/master/flue) for an example of fine-tuning FlauBERT on this benchmark.

# 5. Video presentation

You can watch this 7mn video presentation of FlauBERT

[VIDEO 7mn] (https://www.youtube.com/watch?v=NgLM9GuwSwc)

# 6. Citation

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

[LREC paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.302.pdf)

```

@InProceedings{le2020flaubert,

  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},

  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},

  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},

  month     = {May},

  year      = {2020},

  address   = {Marseille, France},

  publisher = {European Language Resources Association},

  pages     = {2479--2490},

  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}

}

```

[TALN paper](https://hal.archives-ouvertes.fr/hal-02784776/)

```

@inproceedings{le2020flaubert,

  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},

  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},

  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},

  pages         = {268--278},

  year          = {2020},

  organization  = {ATALA}

}

```

### licence of the models

The models can be accessed on Hugging Face and their license is listed as MIT.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/getalp/Flaubert

Awesome Lists containing this project

README