Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/princeton-nlp/cofipruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
https://github.com/princeton-nlp/cofipruning
bert model-compression nlp pruning
Last synced: 3 months ago
JSON representation
[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
Host: GitHub
URL: https://github.com/princeton-nlp/cofipruning
Owner: princeton-nlp
License: mit
Created: 2022-03-30T01:59:57.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-05-09T01:35:01.000Z (almost 2 years ago)
Last Synced: 2023-11-07T20:21:40.975Z (over 1 year ago)
Topics: bert, model-compression, nlp, pruning
Language: Python
Homepage:
Size: 1.79 MB
Stars: 173
Watchers: 10
Forks: 32
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        ## ☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models

This repository contains the code and pruned models for our ACL'22 paper [Structured Pruning Learns Compact and Accurate Models](https://arxiv.org/pdf/2204.00408.pdf). Our talk slides can be found [here](https://github.com/princeton-nlp/CoFiPruning/blob/main/ACL2022_talk.pdf). Numerical results in the paper can be found [here](https://docs.google.com/spreadsheets/d/1IF6ReWkSDKnZomvQT9GBlEhfN48NoDyr-XHOvD2vjvs/edit#gid=0). 

**************************** **Updates** ****************************

* 05/09/2022: We release the pruned model checkpoints on RTE, MRPC and CoLA!

* 04/01/2022: We released [our paper](https://arxiv.org/pdf/2204.00408.pdf) along with pruned model checkpoints on SQuAD, SST-2, QNLI and MNLI. Check it out!

## Quick Links

- [☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models](#-cofipruning-structured-pruning-learns-compact-and-accurate-models)

- [Quick Links](#quick-links)

- [Overview](#overview)

- [Main Results](#main-results)

- [Model List](#model-list)

- [Train CoFiPruning](#train-cofipruning)

  - [Requirements](#requirements)

  - [Training](#training)

  - [Evaluation](#evaluation)

- [Bugs or Questions?](#bugs-or-questions)

- [Citation](#citation)

## Overview

We propose CoFiPruning, a task-specific, structured pruning approach (**Co**arse and **Fi**ne-grained Pruning) and show that structured pruning can achieve highly compact subnetworks and obtain large speedups and competitive accuracy as distillation approaches, while requiring much less computation. Our key insight is to jointly prune coarse-grained units (e.g., self-attention or feed-forward layers) and fine-grained units (e.g., heads, hidden dimensions) simultaneously. Different from existing works, our approach controls the pruning decision of every single parameter by multiple masks of different granularity. This is the key to large compression, as it allows the greatest flexibility of pruned structures and eases the optimization compared to only pruning small units. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. 

![](figures/main_figure.png)

## Main Results

We show the main results of CoFiPruning along with results of popular pruning and distillation methods including Block Pruning, DynaBERT, DistilBERT and TinyBERT. Please see more detailed results in our paper. 

![](figures/main_results.png)

## Model List

Our released models are listed as following. You can download these models with the following links. We use a batch size of 128 and V100 32GB GPUs for speedup evaluation. We show F1 score for SQuAD and accuracy score for GLUE datasets. `s60` denotes that the sparsity of the model is roughly 60%.

|  model name | task | sparsity | speedup | score |  

|:------|:--------:|:--------:|:-------:|:-----:|

|  [princeton-nlp/CoFi-MNLI-s60](https://huggingface.co/princeton-nlp/CoFi-MNLI-s60) | MNLI |  60.2% | 2.1 ×| 85.3 |  

| [princeton-nlp/CoFi-MNLI-s95](https://huggingface.co/princeton-nlp/CoFi-MNLI-s95) | MNLI  |  94.3% | 12.1 × | 80.6 |  

|  [princeton-nlp/CoFi-QNLI-s60](https://huggingface.co/princeton-nlp/CoFi-QNLI-s60) | QNLI |  60.3% | 2.1 × | 91.8 |  

| [princeton-nlp/CoFi-QNLI-s95](https://huggingface.co/princeton-nlp/CoFi-QNLI-s95) | QNLI  |  94.5% | 12.1 × | 86.1 |  

| [princeton-nlp/CoFi-SST2-s60](https://huggingface.co/princeton-nlp/CoFi-SST2-s60) | SST-2 |  60.1% | 2.1 ×| 93.0 |  

| [princeton-nlp/CoFi-SST2-s95](https://huggingface.co/princeton-nlp/CoFi-SST2-s95) | SST-2  |  94.5% | 12.2 ×| 90.4 |  

|  [princeton-nlp/CoFi-SQuAD-s60](https://huggingface.co/princeton-nlp/CoFi-SQuAD-s60) |  SQuAD |  59.8% | 2.0 ×| 89.1 |  

| [princeton-nlp/CoFi-SQuAD-s93](https://huggingface.co/princeton-nlp/CoFi-SQuAD-s93) | SQuAD  |  92.4% | 8.7 ×| 82.6 |  

| [princeton-nlp/CoFi-RTE-s60](https://huggingface.co/princeton-nlp/CoFi-RTE-s60) | RTE | 60.2% | 2.0 x | 72.6 |

| [princeton-nlp/CoFi-RTE-s96](https://huggingface.co/princeton-nlp/CoFi-RTE-s96) | RTE | 96.2% | 12.8 x | 66.1 |

| [princeton-nlp/CoFi-CoLA-s60](https://huggingface.co/princeton-nlp/CoFi-CoLA-s60) | CoLA | 60.4% | 2.0 x | 60.4 |

| [princeton-nlp/CoFi-CoLA-s95](https://huggingface.co/princeton-nlp/CoFi-CoLA-s95) | CoLA | 95.1% | 12.3 x | 38.9 |

| [princeton-nlp/CoFi-MRPC-s60](https://huggingface.co/princeton-nlp/CoFi-MRPC-s60) | MRPC | 61.5% | 2.0 x | 86.8 |

| [princeton-nlp/CoFi-MRPC-s95](https://huggingface.co/princeton-nlp/CoFi-MRPC-s95) | MRPC | 94.9% | 12.2 x | 83.6 |

You can use these models with the huggingface interface:

```python

from models.modeling_bert import CoFiBertForSequenceClassification

model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95") 

output = model(**inputs)

```

 

## Train CoFiPruning

In the following section, we provide instructions on training CoFi with our code.

### Requirements

Try runing the following script to install the dependencies.

Please define a lower version of transformers, because the latest version seems seems do not have `hf_bucket_url` in `transformers.file_utils` 

```bash

pip install -r requirements.txt

```

### Training

**Training scripts**

We provide example training scripts for training with CoFiPruning with different combination of training units and objectives in `scripts/run_CoFi.sh`. The script only supports single-GPU training and we explain the arguments in following:

- `--task_name`: we support sequence classification tasks and extractive question answer tasks. You can input a glue task name, e.g., `MNLI` or use `--train_file` and `--validation_file` arguments with other tasks (supported by HuggingFace).

- `--ex_name_suffix`: experiment name (for output dir)

- `--ex_cate`: experiment category name (for output dir)

- `--pruning_type`: we support all combinations of the following four types of pruning units. Default pruning type is `structured_heads+structured_mlp+hidden+layer`. Setting it to `None` falls back to standard fine-tuning.

  - `structured_heads`: head pruning

  - `structured_mlp`: mlp intermediate dimension pruning

  - `hidden`: hidden states pruning

  - `layer`: layer pruning

- `--target_sparsity`: target sparsity of the pruned model

- `--distillation_path`: the directory of the teacher model

- `--distillation_layer_loss_alpha`: weight for layer distillation

- `--distillation_ce_loss_alpha`: weight for cross entropy distillation

- `--layer_distill_version`: we recommend using version 4 for small-sized datasets to impose an explicit restriction on layer orders but for relatively larger datasets, version 3 and version 4 do not make much difference. @zhangzhenyu13 found that randomly selecting teacher layers leads to more stable results, which is version 6. Please find [this pull request](https://github.com/princeton-nlp/CoFiPruning/pull/35) for more details.

- `--sparsity_epsilon`: the epsilon to relax the sparsity constraint. If set to be larger than 0, the training process will start saving models with a sparsity `target_sparsity - sparsity_epislon`. This is recommended to be set to be 0.01 when training with 0.95 sparsity to replicate our reported numbers, so that the models with a sparsity above 0.94 will be saved.

After pruning the model, the same script could be used for further fine-tuning the pruned model with following arguments:

- `--pretrained_pruned_model`: directory of the pruned model

- `--learning_rate`: learning rate of the fine-tuning stage

Note that during fine-tuning stage, `pruning_type` should be set to `None`.

An example for training (pruning) is as follows:

```bash

TASK=MNLI

SUFFIX=sparsity0.95

EX_CATE=CoFi

PRUNING_TYPE=structured_heads+structured_mlp+hidden+layer

SPARSITY=0.95

DISTILL_LAYER_LOSS_ALPHA=0.9

DISTILL_CE_LOSS_ALPHA=0.1

LAYER_DISTILL_VERSION=4

SPARSITY_EPSILON=0.01

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON

```

An example for fine_tuning after pruning is as follows:

```bash

PRUNED_MODEL_PATH=$proj_dir/$TASK/$EX_CATE/${TASK}_${SUFFIX}/best

PRUNING_TYPE=None # Setting the pruning type to be None for standard fine-tuning.

LEARNING_RATE=3e-5

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON [PRUNED_MODEL_PATH] $LEARNING_RATE

```

The training process will save the model with the best validation accuracy under `$PRUNED_MODEL_PATH/best`. And you can use the `evaluation.py` script for evaluation.

### Evaluation

Our pruned models are served on Huggingface's model hub. You can use the script `evalution.py` to get the sparsity, inference time and development set results of a pruned model.

```bash

python evaluation.py [TASK] [MODEL_NAME_OR_DIR]

```

An example use of evaluating a sentence classification model is as follows:

```bash

python evaluation.py MNLI princeton-nlp/CoFi-MNLI-s95 

```  

The expected output of the model is as follows:

``` 

Task: MNLI

Model path: princeton-nlp/CoFi-MNLI-s95

Model size: 4920106

Sparsity: 0.943

mnli/acc: 0.8055

seconds/example: 0.010151

```

**Hyperparameters**

We use the following hyperparamters for training CoFiPruning:

|               | GLUE (small) | GLUE (large) | SQuAD|

|:--------------|:-----------:|:--------------:|:---------:|

| Batch size    | 32          | 32            | 16       |

| Pruning learning rate  | 2e-5 | 2e-5 | 3e-5 |

| Fine-tuning learning rate |     1e-5, 2e-5, 3e-5      |1e-5, 2e-5, 3e-5|1e-5, 2e-5, 3e-5|

| Layer distill. alpha | 0.9, 0.7, 0.5|0.9, 0.7, 0.5|0.9, 0.7, 0.5|

| Cross entropy distill. alpha | 0.1, 0.3, 0.5|0.1, 0.3, 0.5|0.1, 0.3, 0.5|

| Pruning epochs | 100 | 20 | 20 |

| Pre-finetuning epochs | 4 | 1 | 1 |

| Sparsity warmup epochs | 20 | 2 | 2 |

| Finetuning epochs | 20 | 20 | 20 |

GLUE (small) denotes the GLUE tasks with a relatively smaller size including CoLA, STS-B, MRPC and RTE and GLUE (large) denotes the rest of the GLUE tasks including SST-2, MNLI, QQP and QNLI. Note that hyperparameter search is essential for small-sized datasets but is less important for large-sized datasets. 

## Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Mengzhou ([email protected]) and Zexuan ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

## Citation

Please cite our paper if you use CoFiPruning in your work:

```bibtex

@inproceedings{xia2022structured,

   title={Structured Pruning Learns Compact and Accurate Models},

   author={Xia, Mengzhou and Zhong, Zexuan and Chen, Danqi},

   booktitle={Association for Computational Linguistics (ACL)},

   year={2022}

}

```