Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/xiongma/roberta-wwm-base-distill

this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
https://github.com/xiongma/roberta-wwm-base-distill

bert distillation natural-language-processing pretrained-models roberta tensorflow

Last synced: about 2 months ago
JSON representation

this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large

Awesome Lists containing this project

README

        

# A Roberta-ext-wwm Distillation Model

This is a chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large. The large model is from this [github](https://github.com/ymcui/Chinese-BERT-wwm), thanks for his contribution.

## Based On

This model was trained based on this [paper](https://arxiv.org/abs/1910.01108), which was punished by huggingface.

## Corpus

For train this model, I used baike_qa2019, news2016_zh, webtext_2019, wiki_zh. this data can be found in this [github](https://github.com/brightmart/nlp_chinese_corpus)

## Model Download

I just support BaiduYun to down this model, this link is below.

| Model | BaiduYun |
|:----------------------------------------------:|:------------------------------------------------------------------:|
| Roberta-wwm-ext-base-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1e7_Zqa1_TlFfzx1n51BTUw) |
| Roberta-wwm-ext-large-3layers-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1XKaqVHrp8y0tHCNK64kpYw) 26hu |
| Roberta-wwm-ext-large-6layers-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1MlALh3WC71KO0lw-jTCyxw) seou |

## Train Detail

To train this model, I used 2 steps.

- I used roberta_ext_wwm_large model to get all examples tokens' output.

- I used the output to train the model, which inited roberta_ext_wwm_base pretrain model weights.

### Dataset

- I just used 5 different ways to mask one sentence, not dynamic mask.

- Every example just use maximum 20 token masks

### Teacher Model

- I used Roberta large model to get every masked token's output, which was mapped to vocab, I just kept max 128 dimensions, you could ask why didn't you keep more dimensions, first, the storge is too much, second, I think keep too much is unneccessary.

### Student Model

- **Loss**: In this training, I use 2 loss functions, first is cross entropy, second is cosin loss, add them together, I think it has a big improvement if I use another loss function, but I didn't have too much resource to train this model, because my free Google TPU expired.

- **Other Parameters**

| Parameter | batch size | learning rate | training step | warming step |
|:----------------------------------------------:|:----------:|:-------------:|:-------------:|:------------:|
| Roberta-wwm-ext-base-distill, Chinese | 384 | 5e-5 | 1M | 2W |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 128 | 3e-5 | 3M | 2.5K |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 512 | 8e-5 | 1M | 5K |

## Comparison

In this part, every task I just ran one time, the result is below.

### Classification

| Model | AFQMC | CMNLI | TNEWS |
|:----------------------------------------------:|:----------:|:---------:|:---------:|
| Roberta-wwm-ext-base, Chinese | 74.04% | 80.51% | 56.94% |
| Roberta-wwm-ext-base-distill, Chinese | **74.44%** | **81.1%** | **57.6%** |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 68.8% | 75.5% | 55.7% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 72% | 79.3% | 56.7% |

| Model | LCQMC dev | LCQMC test |
|:----------------------------------------------:|:---------:|:----------:|
| Roberta-wwm-ext-base, Chinese | 89% | 86.5% |
| Roberta-wwm-ext-base-distill, Chinese | 89% | **87.2%** |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 85.1% | 86% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 87.7% | 86.7% |

### SQUAD

| Model | CMRC2018 dev (F1/EM) |
|:----------------------------------------------:|:--------------------:|
| Roberta-wwm-ext-base, Chinese | 84.72%/**65.24%** |
| Roberta-wwm-ext-base-distill, Chinese | **85.2%**/65.20% |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 78.5%/57.4% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 82.6%/61.7% |

In this part you could ask, your comparison is different with this [github](https://github.com/ymcui/Chinese-BERT-wwm), I don't know why, I just used the original base model to run this task, got the score is up, and I used same parameters and distilled model to run this task, got the score is up. Maybe I used the different parameters.

But as you can see, in the same situation, the distilled model has improvement than the original model.

## How To Train

- **create pretraining data**

```python
export DATA_DIR=YOUR_DATA_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
export VOCAB_FILE=YOUR_VOCAB_FILE

python create_pretraining_data.py \
        --input_dir=$DATA_DIR\
--output_dir=$OUTPUT_DIR \
--vocab_file=$YOUR_VOCAB_FILE \
--do_whole_word_mask=True \
--ramdom_next=True \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--random_seed=12345 \
--dupe_factor=5 \
--masked_lm_prob=0.15 \
--doc_stride=256 \
--max_workers=2 \
--short_seq_prob=0.1
```

- **create teacher output data**

```python
export TF_RECORDS=YOUR_PRETRAINING_TF_RECORDS
export TEACHER_MODEL=YOUR_TEACHER_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python create_teacher_output_data.py \
--bert_config_file=$TEACHER_MODEL/bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$YOUR_OUTPUT_DIR \
--truncation_factor=128 \
--init_checkpoint=$TEACHER_MODEL\bert_model.ckpt \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--predict_batch_size=64
```

- **run distill**

```python
export TF_RECORDS=YOUR_TEACHER_OUTPUT_TF_RECORDS
export STUDENT_MODEL_DIR=YOUR_STUDENT_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR

python run_distill.py \
--bert_config_file=$STUDENT_MODEL_DIR\bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$OUTPUT_DIR \
--init_checkpoint=$STUDENT_MODEL_DIR\bert_model.ckpt
--truncation_factor=128 \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--do_train=True \
--do_eval=True \
--train_batch_size=384 \
--eval_batch_size=1024 \
--num_train_steps=1000000 \
--num_warmup_steps=20000
```

## Answers

- **We need a small size one, your model are still base size.**
1. The purpose of punish this model is to identify feasibility of distilled of method.

2. As you can see, this distilled method can improve the accuracy.
- **Why did you punish the 3 layers model?**
1. Some githuber told me, we need small size one, the bert base version is so large, I can't afford the cost of the server, so I punished the small size one!

## Thanks

Thanks TFRC supports the TPU!