https://github.com/microsoft/xtreme-distil-transformers

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale
https://github.com/microsoft/xtreme-distil-transformers

Last synced: 3 months ago
JSON representation

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Host: GitHub
URL: https://github.com/microsoft/xtreme-distil-transformers
Owner: microsoft
License: mit
Created: 2020-12-27T07:53:50.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-12-20T20:40:06.000Z (over 1 year ago)
Last Synced: 2025-03-30T21:39:30.218Z (4 months ago)
Language: Python
Size: 51.8 KB
Stars: 153
Watchers: 4
Forks: 14
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

        # XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks

## Microsoft Research [[Task-specific]](https://aclanthology.org/2020.acl-main.202.pdf) [[Task-agnostic with Task Transfer]](https://arxiv.org/pdf/2106.04563.pdf)

Releasing [**XtremeDistilTransformers**] with Tensorflow 2.3 and [HuggingFace Transformers](https://huggingface.co/transformers) with an unified API with the following features:

* Distil any supported [pre-trained language models](https://huggingface.co/transformers/pretrained_models.html) as teachers (e.g, Bert, Electra, Roberta)

* Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch

* Multilingual text classification and sequence tagging 

* Distil multiple hidden states from teacher

* Distil deep attention networks from teacher

* Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)

* Progressive knowledge transfer with gradual unfreezing

* Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)

* ONNX runtime inference

You can use the following *task-agnostic pre-distilled checkpoints* from XtremeDistilTransformers for (only) fine-tuning on labeled data from downstream tasks:

- [6/256 xtremedistil pre-trained checkpoint](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased)

- [6/384 xtremedistil pre-trained checkpoint](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased)

- [12/384 xtremedistil pre-trained checkpoint](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)

For further performance improvement, initialize XtremeDistilTransformers with any of the above pre-distilled checkpoints for *task-specific distillation* with additional unlabeled data from the downstream task for the best performance.

The following table shows the performance of the above checkpoints on GLUE dev set and SQuAD-v2.

| Models         | #Params | Speedup | MNLI | QNLI | QQP  | RTE  | SST  | MRPC | SQUAD2 | Avg   |

|----------------|--------|---------|------|------|------|------|------|------|--------|-------|

| BERT        | 109    | 1x       | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8   | 84.8 |

| DistilBERT  | 66     | 2x       | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7   | 81.3 |

| TinyBERT    | 66     | 2x       | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1   | 84.3 |

| MiniLM      | 66     | 2x       | 84.0   | 91.0   | 91.0   | 71.5 | 92.0   | 88.4 | 76.4   | 84.9  |

| MiniLM      | 22     | 5.3x     | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9   | 83.3 |

| XtremeDistil-l6-h256   | 13     | 8.7x     | 83.9 | 89.5 | 90.6   | 80.1 | 91.2 | 90.0   | 74.1   | 85.6 |

| XtremeDistil-l6-h384   | 22     | 5.3x     | 85.4 | 90.3 | 91.0   | 80.9 | 92.3 | 90.0   | 76.6   | 86.6 |

| XtremeDistil-l12-h384   | 33     | 2.7x     | 87.2 | 91.9 | 91.3   | 85.6 | 93.1 | 90.4   | 80.2   | 88.5 |

*Install requirements*

```pip install -r requirements.txt```

Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0, python 3.6.9 and CUDA 10.2`

*Sample usages for distilling different pre-trained language models*

***Training***

*Sequence Labeling for Wiki NER*

```

PYTHONHASHSEED=42 python run_xtreme_distil.py 

--task $$PT_DATA_DIR/datasets/NER 

--model_dir $$PT_OUTPUT_DIR 

--seq_len 32  

--transfer_file $$PT_DATA_DIR/datasets/NER/unlabeled.txt 

--do_NER 

--pt_teacher TFBertModel 

--pt_teacher_checkpoint bert-base-multilingual-cased 

--student_distil_batch_size 256 

--student_ft_batch_size 32

--teacher_batch_size 128  

--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 

--distil_chunk_size 10000 

--teacher_model_dir $$PT_OUTPUT_DIR 

--distil_multi_hidden_states 

--distil_attention 

--compress_word_embedding 

--freeze_word_embedding

--opt_policy mixed_float16

```

*Text Classification for MNLI*

```

PYTHONHASHSEED=42 python run_xtreme_distil.py 

--task $$PT_DATA_DIR/glue_data/MNLI 

--model_dir $$PT_OUTPUT_DIR 

--seq_len 128  

--transfer_file $$PT_DATA_DIR/glue_data/MNLI/train.tsv 

--do_pairwise 

--pt_teacher TFElectraModel 

--pt_teacher_checkpoint google/electra-base-discriminator 

--student_distil_batch_size 128  

--student_ft_batch_size 32

--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 

--teacher_model_dir $$PT_OUTPUT_DIR 

--teacher_batch_size 32

--distil_chunk_size 300000

--opt_policy mixed_float16

```

Alternatively, use TinyBert pre-trained student model checkpoint as `--pt_student_checkpoint nreimers/TinyBERT_L-4_H-312_v2`

*Arguments*

```- refer to code for detailed arguments

- task folder contains

	-- train/dev/test '.tsv' files with text and classification labels / token-wise tags (space-separated)

	--- Example 1: feel good about themselves  1

	--- Example 2: '' Atelocentra '' Meyrick , 1884  O B-LOC O O O O

	-- label files containing class labels for sequence labeling

	-- transfer file containing unlabeled data

	

- model_dir to store/restore model checkpoints

- task arguments

-- do_pairwise for pairwise classification tasks like MNLI and MRPC

-- do_NER for sequence labeling

- teacher arguments

-- pt_teacher for teacher model to distil (e.g., TFBertModel, TFRobertaModel, TFElectraModel)

-- pt_teacher_checkpoint for pre-trained teacher model checkpoints (e.g., bert-base-multilingual-cased, roberta-large, google/electra-base-discriminator)

- student arguments

-- pt_student_checkpoint to initialize from pre-trained small student models (e.g., MiniLM, DistilBert, TinyBert)

-- instead of pre-trained checkpoint, initialize a raw student from scratch with

--- hidden_size

--- num_hidden_layers

--- num_attention_heads

- distillation features

-- distil_multi_hidden_states to distil multiple hidden states from the teacher

-- distil_attention to distil deep attention network of the teacher

-- compress_word_embedding to initialize student word embedding with SVD-compressed teacher word embedding (useful for multilingual distillation)

-- freeze_word_embedding to keep student word embeddings frozen during distillation (useful for multilingual distillation)

-- opt_policy (e.g., mixed_float16 for GPU and mixed_bfloat16 for TPU)

-- distil_chunk_size for using transfer data in chunks during distillation (reduce for OOM issues, checkpoints are saved after every distil_chunk_size steps)

```

***Model Outputs***

The above training code generates intermediate model checkpoints to continue the training in case of abrupt termination instead of starting from scratch -- all saved in $$PT_OUTPUT_DIR. The final output of the model consists of (i) `xtremedistil.h5` with distilled model weights, (ii) `xtremedistil-config.json` with the training configuration, and (iii) `word_embedding.npy` for the input word embeddings from the student model.

***Prediction***

```

PYTHONHASHSEED=42 python run_xtreme_distil_predict.py 

--do_eval 

--model_dir $$PT_OUTPUT_DIR 

--do_predict 

--pred_file ../../datasets/NER/unlabeled.txt

--opt_policy mixed_float16

```

****ONNX Runtime Inference***

You can also use [ONXX Runtime](https://github.com/microsoft/onnxruntime) for inference speedup with the following script:

```

PYTHONHASHSEED=42 python run_xtreme_distil_predict_onnx.py 

--do_eval 

--model_dir $$PT_OUTPUT_DIR 

--do_predict 

--pred_file ../../datasets/NER/unlabeled.txt

```

For details on ONNX Runtime Inference, environment and arguments refer to this [Notebook](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/Tensorflow_Keras_Bert-Squad_OnnxRuntime_CPU.ipynb)

The script is for online inference with batch_size=1.

****Continued Fine-tuning***

You can continue fine-tuning the distilled/compressed student model on more labeled data with the following script:

```

PYTHONHASHSEED=42 python run_xtreme_distil_ft.py --model_dir $$PT_OUTPUT_DIR 

```

If you use this code, please cite:

```

@misc{mukherjee2021xtremedistiltransformers,

      title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, 

      author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},

      year={2021},

      eprint={2106.04563},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

```

@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,

    title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",

    author = "Mukherjee, Subhabrata  and

      Hassan Awadallah, Ahmed",

    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",

    month = jul,

    year = "2020",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2020.acl-main.202",

    pages = "2221--2234",

}

```

Code is released under [MIT](https://github.com/MSR-LIT/XtremeDistil/blob/master/LICENSE) license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microsoft/xtreme-distil-transformers

Awesome Lists containing this project

README