https://github.com/sungnyun/armhubert

(Interspeech 2023 & ICASSP 2024) Official repository for ARMHuBERT and STaRHuBERT
https://github.com/sungnyun/armhubert
automatic-speech-recognition distillation ssl-compression
Last synced: 6 months ago
JSON representation
(Interspeech 2023 & ICASSP 2024) Official repository for ARMHuBERT and STaRHuBERT
Host: GitHub
URL: https://github.com/sungnyun/armhubert
Owner: sungnyun
License: apache-2.0
Created: 2023-05-14T11:34:28.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-08-29T06:56:21.000Z (about 1 year ago)
Last Synced: 2025-03-25T22:21:33.917Z (7 months ago)
Topics: automatic-speech-recognition, distillation, ssl-compression
Language: Python
Homepage:
Size: 4.52 MB
Stars: 39
Watchers: 2
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # ♽ Recycle-and-Distill (Interspeech 2023)

   



 










[**Recycle-and-Distill: Universal Compression Strategy for Transformer-based

Speech SSL Models with Attention Map Reusing and Masking Distillation**](https://arxiv.org/abs/2305.11685), INTERSPEECH 2023.

[Kangwook Jang](https://scholar.google.com/citations?user=p8GFX-sAAAAJ&hl)\*,

[Sungnyun Kim](https://bit.ly/sungnyunkim)\*,

[Se-Young Yun](https://fbsqkd.github.io), [Hoirin Kim](https://scholar.google.com/citations?user=naLHjOsAAAAJ&hl=en)


\* equal contribution

- **Attention Map Reusing**: Reuse previous layer's attention map to remove key & query parameters in Transformer

- **Masking Distillation**: Masking distillation treating masked frames and unmasked frames separately

- Parameters and MACs of ARMHuBERT have decreased to **28% and 30%** of the teacher, HuBERT Base, respectively.

- ARMHuBERT achieves **PER of 7.72%, WER of 9.96%** on the SUPERB benchmark in an E2E distillation manner.

📌 Check out our model's performance in [SUPERB Leaderboard](https://superbbenchmark.org/leaderboard)!    

### 🤗 Checkpoints

For our model's checkpoints, go check this [link](https://huggingface.co/sungnyun/ARMHuBERT/tree/main)!

| Model name       | Parameters | Teacher | Training dataset | Link |

|------------------|------------|---------|------------------| ---- |

| ARMHuBERT-960h   | 26.45M     | HuBERT  | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-960h.ckpt) |

| ARMHuBERT-S-100h | 22.39M     | HuBERT  | LibriSpeech-100h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-S-100h.ckpt) |

| ARMHuBERT-S-960h | 22.39M     | HuBERT  | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-S-960h.ckpt) |

| ARMwavLM-S-100h  | 22.39M     | wavLM   | LibriSpeech-100h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMwavLM-S-100h.ckpt) |

| ARMwavLM-S-960h  | 22.39M     | wavLM   | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMwavLM-S-960h.ckpt) |

| MaskHuBERT-960h  | 26.64M     | HuBERT  | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/MaskHuBERT-960h.ckpt) |




# How to use this repo

## Requirements

Install the necessary packages with: 

```

$ pip install -r requirements.txt

```

## Distillation

1. Download the teacher model checkpoint to perform knowledge distillation, and place it under the root path, `./`.

	+ For HuBERT Base: [link](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert) (`hubert_base_ls960.pt`)

	+ For wavLM Base: [link](https://huggingface.co/s3prl/converted_ckpts/tree/main) (`wavlm_base.pt`)

2. Download the [LibriSpeech](https://www.openslr.org/12) dataset.

	+ For 100h distillation, download `train-clean-100`

	+ For 960h distillation, download whole dataset, `train-clean-100`, `train-clean-360`, `train-other-500`

	+ For validation, download `dev-clean`

		+ You can validate your model with test clean other either. In this case, please download `test-clean`, and modify `self.eval_data` in `train.py` file.

3. Modify the configuration file in `./conf/[model_name]/[config].yaml`.    

	+ For example, the configuration file `./conf/armhubert/armhubert-960.yaml` contains all the settings for reproducing ARMHuBERT on LibriSpeech 960h dataset.	

	+ Set the path to the teacher model checkpoint at `teacher_model`, and the root path to the LibriSpeech dataset at `libri_root`. 

4. Then, run the following command:

```

python train.py -c ./conf/[model_name]/[config].yaml

```

For ARMHuBERT,

	```

	python train.py -c ./conf/armhubert/armhubert-960.yaml

	```

After training, the model checkpoints and the corresponding configuration file will be created at `./results/pretrain/`.

## Fine-tuning

0. If you don't feel like training your model, feel free to use our [checkpoints](https://huggingface.co/sungnyun/ARMHuBERT/tree/main).

   

2. Clone and install the [S3PRL toolkit](https://github.com/s3prl/s3prl) with ```pip install -e ".[all]"``` (dev mode).

3. Copy the entire `./models/[model_name]` folder into `/s3prl/upstream/`.

4. Please add upstream importing line in `/s3prl/hub.py`.

	

	```

	from s3prl.upstream.[model_name].hubconf import *

	```

	For ARMHuBERT,

	```

	from s3prl.upstream.armhubert.hubconf import *

	```

5. Please change each config file of s3prl downstream tasks as follows.

	+ Uncomment learning rate scheduler

	+ Learning rate scaled to 10x in spekaer identification (SID) task

6. Run the following command to fine-tune the ARMHuBERT model.

	For automatic speech recognition (ASR) as an example:

	```

	python run_downstream.py \

	-m train \

	-n ARMHuBERT-ASR \  # You can set your exp name whatever you want

	-u armhubert \

	-d asr \

	-k /results/pretrain/> \

	-g /results/pretrain/>

	```

	Note: Refer to the [SUPERB docs](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/docs/superb.md) for more information on usage details and data preparation.

## Result







We evaluate our student models on the SUPERB benchmark.

MaskHuBERT highly improves the performances in content- and semantics-related tasks. See PR, ASR, SF, and IC.

ARMHuBERT shows promising improvements when compared to MaskHuBERT in SF and SID tasks, exhibiting a similar level of performance in other tasks.

ARMHuBERT achieves a better overall score of **78.1** with less parameters than MaskHuBERT.

This is an state-of-the-art performance for an end-to-end distillation approach such as [Deep-versus-wide 12-L](https://arxiv.org/abs/2207.06867?context=eess.AS) or [FitHuBERT](https://arxiv.org/abs/2207.00555).

You can also check that our model works on other Transformer backbone model, [wavLM](https://arxiv.org/abs/2110.13900), too.

## Try this distillation strategy with your Transformer backbone models

We have only performed evaluation on HuBERT-based models, but this strategy can be performed identically on any speech model with a Transformer backbone. E.g. [AST](https://arxiv.org/abs/2104.01778) (Audio Spectrogram Transformer).

## BibTeX

If you find this repo useful for your research, please consider citing our paper:

```

@article{jang2023recycleanddistill,

         title={Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation}, 

         author={Kangwook Jang and Sungnyun Kim and Se-Young Yun and Hoirin Kim},

	 	 booktitle={Proc. INTERSPEECH 2023},

  		 pages={316--320},

         year={2023}

}

```




# 🌟 STaR (ICASSP 2024)  

   



🎉 Update (Apr 12, 2024): Our new paper, STaR, has been selected as **Best Student Paper** in ICASSP 2024!




🎉 Check out our model's performance in [SUPERB Leaderboard](https://superbbenchmark.org/leaderboard)! 







[**STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models**](https://arxiv.org/abs/2312.09040), ICASSP 2024.

[Kangwook Jang](https://scholar.google.com/citations?user=p8GFX-sAAAAJ&hl),

[Sungnyun Kim](https://bit.ly/sungnyunkim),

[Hoirin Kim](https://scholar.google.com/citations?user=naLHjOsAAAAJ&hl=en)


- **Speech Temporal Relation (STaR)**: Distill the knowledge by focusing on the pairwise **temporal relation** between two speech frames.

- **Temporal Gram Matrix (TGM)**: Propose Temporal Gram Matrix which aggregates channel information at two time steps.

  - Layer-wise TGM: Distill the TGM for every Transformer layer

  - Intra-layer TGM: Modify the TGM as computing the temporal relation between the input and output of a single Transformer layer.

- Incorporating two TGMs as the distillation objectives together, our student model STaRHuBERT (22M & 26M) shows the SOTA performance on the SUPERB benchmark with the metric of overall and generalizability scores.

- For further compression (9.39M & 14.1M), our approach shows the robust performance against degradation compares to other works.













## Checkpoints

For our model's checkpoints, please check the following links. All models are distilled from HuBERT base.

- STaRHuBERT-L (26.6M): [ckpt](https://drive.google.com/file/d/1La5jh0jPv-JCk2ECT2raRHtXMYYJNtsF/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1NpHhMVj7EE1Sx5eG3tX5fMCa4UUlrnwI/view?usp=drive_link)

- STaRHuBERT (22.3M): [ckpt](https://drive.google.com/file/d/1Zu1idRx-sVaqMvRsUKtttzkSn4df6TUr/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1C6ZcYM4Fcxj0vurid5N02BAANj4smr8U/view?usp=drive_link)

- STaRHuBERT-S (14.1M): [ckpt](https://drive.google.com/file/d/1sUpbDupbDtlCvN-49TblUn8GDmhV8MT8/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1OFjIo1UjxNrxEboKaT8Z-vtuL6xmyNsu/view?usp=drive_link)

- STaRHuBERT-XS (9.39M): [ckpt](https://drive.google.com/file/d/1sUpbDupbDtlCvN-49TblUn8GDmhV8MT8/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1OFjIo1UjxNrxEboKaT8Z-vtuL6xmyNsu/view?usp=drive_link)

We also add the model distilled from WavLM base models!

- STaRWavLM (22.3M): [ckpt](https://drive.google.com/file/d/1gWq55o0HfarwgfRpY_2ncmr7jfhtIKR-/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1p79PqwbEarBDm2X7k2DEbsN_lE3pgqki/view?usp=drive_link)

## Distillation

We do not offer an official implementation code for distillation. 

Nevertheless, since STaRHuBERT is developed upon the backbone of ARMHuBERT, you can easily re-implement our apporach with this ARMHuBERT repository.

## Fine-tuning

You can reproduce our model with given checkpoints. Please follow the steps. (This is almost the same as ARMHuBERT case.)

   

1. Clone and install the [S3PRL toolkit](https://github.com/s3prl/s3prl) with ```pip install -e ".[all]"``` (dev mode).

2. Copy the entire `./models/starhubert` folder into `/s3prl/upstream/`.

3. Please add upstream importing line in `/s3prl/hub.py`.	

	```

	from s3prl.upstream.starhubert.hubconf import *

	```

4. Please change each config file of s3prl downstream tasks as follows.

	+ Uncomment learning rate scheduler

	+ Learning rate scaled to 10x in spekaer identification (SID) task

5. Run the following command to fine-tune the ARMHuBERT model.

	For automatic speech recognition (ASR) as an example:

	```

	python run_downstream.py \

	-m train \

	-n STaRHuBERT-ASR \  # You can set your exp name whatever you want

	-u starhubert \

	-d asr \

	-k /results/pretrain/> \

	-g /results/pretrain/>

	```

	Note: Refer to the [SUPERB docs](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/docs/superb.md) for more information on usage details and data preparation.

## BibTeX

If you find this repo useful for your research, please consider citing our paper:

```

@inproceedings{jang2024star,

  title={STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models},

  author={Jang, Kangwook and Kim, Sungnyun and Kim, Hoirin},

  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},

  pages={10721--10725},

  year={2024},

  organization={IEEE}

}

```

## Contact

For any details or clarification, please reach out to

- Kangwook Jang: dnrrkdwkd12@kaist.ac.kr

- Sungnyun Kim: ksn4397@kaist.ac.kr
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sungnyun/armhubert

Awesome Lists containing this project

README