Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xiongma/roberta-wwm-base-distill
this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
https://github.com/xiongma/roberta-wwm-base-distill
bert distillation natural-language-processing pretrained-models roberta tensorflow
Last synced: 3 months ago
JSON representation
this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
- Host: GitHub
- URL: https://github.com/xiongma/roberta-wwm-base-distill
- Owner: xiongma
- License: apache-2.0
- Archived: true
- Created: 2020-01-08T09:17:48.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-03-30T01:56:08.000Z (almost 5 years ago)
- Last Synced: 2024-08-12T05:02:24.224Z (6 months ago)
- Topics: bert, distillation, natural-language-processing, pretrained-models, roberta, tensorflow
- Language: Python
- Homepage:
- Size: 90.8 KB
- Stars: 65
- Watchers: 4
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ATPapers - policeme / roberta-wwm-base-distill - A chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large (Pretrained Language Model / Repository)
README
# A Roberta-ext-wwm Distillation Model
This is a chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large. The large model is from this [github](https://github.com/ymcui/Chinese-BERT-wwm), thanks for his contribution.
## Based On
This model was trained based on this [paper](https://arxiv.org/abs/1910.01108), which was punished by huggingface.
## Corpus
For train this model, I used baike_qa2019, news2016_zh, webtext_2019, wiki_zh. this data can be found in this [github](https://github.com/brightmart/nlp_chinese_corpus)
## Model Download
I just support BaiduYun to down this model, this link is below.
| Model | BaiduYun |
|:----------------------------------------------:|:------------------------------------------------------------------:|
| Roberta-wwm-ext-base-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1e7_Zqa1_TlFfzx1n51BTUw) |
| Roberta-wwm-ext-large-3layers-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1XKaqVHrp8y0tHCNK64kpYw) 26hu |
| Roberta-wwm-ext-large-6layers-distill, Chinese | [Tensorflow](https://pan.baidu.com/s/1MlALh3WC71KO0lw-jTCyxw) seou |## Train Detail
To train this model, I used 2 steps.
- I used roberta_ext_wwm_large model to get all examples tokens' output.
- I used the output to train the model, which inited roberta_ext_wwm_base pretrain model weights.
### Dataset
- I just used 5 different ways to mask one sentence, not dynamic mask.
- Every example just use maximum 20 token masks
### Teacher Model
- I used Roberta large model to get every masked token's output, which was mapped to vocab, I just kept max 128 dimensions, you could ask why didn't you keep more dimensions, first, the storge is too much, second, I think keep too much is unneccessary.
### Student Model
- **Loss**: In this training, I use 2 loss functions, first is cross entropy, second is cosin loss, add them together, I think it has a big improvement if I use another loss function, but I didn't have too much resource to train this model, because my free Google TPU expired.
- **Other Parameters**
| Parameter | batch size | learning rate | training step | warming step |
|:----------------------------------------------:|:----------:|:-------------:|:-------------:|:------------:|
| Roberta-wwm-ext-base-distill, Chinese | 384 | 5e-5 | 1M | 2W |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 128 | 3e-5 | 3M | 2.5K |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 512 | 8e-5 | 1M | 5K |## Comparison
In this part, every task I just ran one time, the result is below.
### Classification
| Model | AFQMC | CMNLI | TNEWS |
|:----------------------------------------------:|:----------:|:---------:|:---------:|
| Roberta-wwm-ext-base, Chinese | 74.04% | 80.51% | 56.94% |
| Roberta-wwm-ext-base-distill, Chinese | **74.44%** | **81.1%** | **57.6%** |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 68.8% | 75.5% | 55.7% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 72% | 79.3% | 56.7% || Model | LCQMC dev | LCQMC test |
|:----------------------------------------------:|:---------:|:----------:|
| Roberta-wwm-ext-base, Chinese | 89% | 86.5% |
| Roberta-wwm-ext-base-distill, Chinese | 89% | **87.2%** |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 85.1% | 86% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 87.7% | 86.7% |### SQUAD
| Model | CMRC2018 dev (F1/EM) |
|:----------------------------------------------:|:--------------------:|
| Roberta-wwm-ext-base, Chinese | 84.72%/**65.24%** |
| Roberta-wwm-ext-base-distill, Chinese | **85.2%**/65.20% |
| Roberta-wwm-ext-large-3layers-distill, Chinese | 78.5%/57.4% |
| Roberta-wwm-ext-large-6layers-distill, Chinese | 82.6%/61.7% |In this part you could ask, your comparison is different with this [github](https://github.com/ymcui/Chinese-BERT-wwm), I don't know why, I just used the original base model to run this task, got the score is up, and I used same parameters and distilled model to run this task, got the score is up. Maybe I used the different parameters.
But as you can see, in the same situation, the distilled model has improvement than the original model.
## How To Train
- **create pretraining data**
```python
export DATA_DIR=YOUR_DATA_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
export VOCAB_FILE=YOUR_VOCAB_FILEpython create_pretraining_data.py \
--input_dir=$DATA_DIR\
--output_dir=$OUTPUT_DIR \
--vocab_file=$YOUR_VOCAB_FILE \
--do_whole_word_mask=True \
--ramdom_next=True \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--random_seed=12345 \
--dupe_factor=5 \
--masked_lm_prob=0.15 \
--doc_stride=256 \
--max_workers=2 \
--short_seq_prob=0.1
```- **create teacher output data**
```python
export TF_RECORDS=YOUR_PRETRAINING_TF_RECORDS
export TEACHER_MODEL=YOUR_TEACHER_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIRpython create_teacher_output_data.py \
--bert_config_file=$TEACHER_MODEL/bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$YOUR_OUTPUT_DIR \
--truncation_factor=128 \
--init_checkpoint=$TEACHER_MODEL\bert_model.ckpt \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--predict_batch_size=64
```- **run distill**
```python
export TF_RECORDS=YOUR_TEACHER_OUTPUT_TF_RECORDS
export STUDENT_MODEL_DIR=YOUR_STUDENT_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIRpython run_distill.py \
--bert_config_file=$STUDENT_MODEL_DIR\bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$OUTPUT_DIR \
--init_checkpoint=$STUDENT_MODEL_DIR\bert_model.ckpt
--truncation_factor=128 \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--do_train=True \
--do_eval=True \
--train_batch_size=384 \
--eval_batch_size=1024 \
--num_train_steps=1000000 \
--num_warmup_steps=20000
```## Answers
- **We need a small size one, your model are still base size.**
1. The purpose of punish this model is to identify feasibility of distilled of method.2. As you can see, this distilled method can improve the accuracy.
- **Why did you punish the 3 layers model?**
1. Some githuber told me, we need small size one, the bert base version is so large, I can't afford the cost of the server, so I punished the small size one!## Thanks
Thanks TFRC supports the TPU!