https://github.com/sungnyun/armhubert
(Interspeech 2023 & ICASSP 2024) Official repository for ARMHuBERT and STaRHuBERT
https://github.com/sungnyun/armhubert
automatic-speech-recognition distillation ssl-compression
Last synced: 6 months ago
JSON representation
(Interspeech 2023 & ICASSP 2024) Official repository for ARMHuBERT and STaRHuBERT
- Host: GitHub
- URL: https://github.com/sungnyun/armhubert
- Owner: sungnyun
- License: apache-2.0
- Created: 2023-05-14T11:34:28.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-08-29T06:56:21.000Z (about 1 year ago)
- Last Synced: 2025-03-25T22:21:33.917Z (7 months ago)
- Topics: automatic-speech-recognition, distillation, ssl-compression
- Language: Python
- Homepage:
- Size: 4.52 MB
- Stars: 39
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ♽ Recycle-and-Distill (Interspeech 2023)
![]()
[**Recycle-and-Distill: Universal Compression Strategy for Transformer-based
Speech SSL Models with Attention Map Reusing and Masking Distillation**](https://arxiv.org/abs/2305.11685), INTERSPEECH 2023.[Kangwook Jang](https://scholar.google.com/citations?user=p8GFX-sAAAAJ&hl)\*,
[Sungnyun Kim](https://bit.ly/sungnyunkim)\*,
[Se-Young Yun](https://fbsqkd.github.io), [Hoirin Kim](https://scholar.google.com/citations?user=naLHjOsAAAAJ&hl=en)
\* equal contribution- **Attention Map Reusing**: Reuse previous layer's attention map to remove key & query parameters in Transformer
- **Masking Distillation**: Masking distillation treating masked frames and unmasked frames separately
- Parameters and MACs of ARMHuBERT have decreased to **28% and 30%** of the teacher, HuBERT Base, respectively.
- ARMHuBERT achieves **PER of 7.72%, WER of 9.96%** on the SUPERB benchmark in an E2E distillation manner.📌 Check out our model's performance in [SUPERB Leaderboard](https://superbbenchmark.org/leaderboard)!
### 🤗 Checkpoints
For our model's checkpoints, go check this [link](https://huggingface.co/sungnyun/ARMHuBERT/tree/main)!| Model name | Parameters | Teacher | Training dataset | Link |
|------------------|------------|---------|------------------| ---- |
| ARMHuBERT-960h | 26.45M | HuBERT | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-960h.ckpt) |
| ARMHuBERT-S-100h | 22.39M | HuBERT | LibriSpeech-100h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-S-100h.ckpt) |
| ARMHuBERT-S-960h | 22.39M | HuBERT | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMHuBERT-S-960h.ckpt) |
| ARMwavLM-S-100h | 22.39M | wavLM | LibriSpeech-100h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMwavLM-S-100h.ckpt) |
| ARMwavLM-S-960h | 22.39M | wavLM | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/ARMwavLM-S-960h.ckpt) |
| MaskHuBERT-960h | 26.64M | HuBERT | LibriSpeech-960h | [HF Model](https://huggingface.co/sungnyun/ARMHuBERT/blob/main/MaskHuBERT-960h.ckpt) |
# How to use this repo
## Requirements
Install the necessary packages with:
```
$ pip install -r requirements.txt
```## Distillation
1. Download the teacher model checkpoint to perform knowledge distillation, and place it under the root path, `./`.+ For HuBERT Base: [link](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert) (`hubert_base_ls960.pt`)
+ For wavLM Base: [link](https://huggingface.co/s3prl/converted_ckpts/tree/main) (`wavlm_base.pt`)2. Download the [LibriSpeech](https://www.openslr.org/12) dataset.
+ For 100h distillation, download `train-clean-100`
+ For 960h distillation, download whole dataset, `train-clean-100`, `train-clean-360`, `train-other-500`
+ For validation, download `dev-clean`
+ You can validate your model with test clean other either. In this case, please download `test-clean`, and modify `self.eval_data` in `train.py` file.3. Modify the configuration file in `./conf/[model_name]/[config].yaml`.
+ For example, the configuration file `./conf/armhubert/armhubert-960.yaml` contains all the settings for reproducing ARMHuBERT on LibriSpeech 960h dataset.
+ Set the path to the teacher model checkpoint at `teacher_model`, and the root path to the LibriSpeech dataset at `libri_root`.4. Then, run the following command:
```
python train.py -c ./conf/[model_name]/[config].yaml
```For ARMHuBERT,
```
python train.py -c ./conf/armhubert/armhubert-960.yaml
```After training, the model checkpoints and the corresponding configuration file will be created at `./results/pretrain/`.
## Fine-tuning
0. If you don't feel like training your model, feel free to use our [checkpoints](https://huggingface.co/sungnyun/ARMHuBERT/tree/main).
2. Clone and install the [S3PRL toolkit](https://github.com/s3prl/s3prl) with ```pip install -e ".[all]"``` (dev mode).3. Copy the entire `./models/[model_name]` folder into `/s3prl/upstream/`.
4. Please add upstream importing line in `/s3prl/hub.py`.
```
from s3prl.upstream.[model_name].hubconf import *
```
For ARMHuBERT,
```
from s3prl.upstream.armhubert.hubconf import *
```5. Please change each config file of s3prl downstream tasks as follows.
+ Uncomment learning rate scheduler
+ Learning rate scaled to 10x in spekaer identification (SID) task6. Run the following command to fine-tune the ARMHuBERT model.
For automatic speech recognition (ASR) as an example:
```
python run_downstream.py \
-m train \
-n ARMHuBERT-ASR \ # You can set your exp name whatever you want
-u armhubert \
-d asr \
-k /results/pretrain/> \
-g /results/pretrain/>
```
Note: Refer to the [SUPERB docs](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/docs/superb.md) for more information on usage details and data preparation.## Result
![]()
We evaluate our student models on the SUPERB benchmark.
MaskHuBERT highly improves the performances in content- and semantics-related tasks. See PR, ASR, SF, and IC.
ARMHuBERT shows promising improvements when compared to MaskHuBERT in SF and SID tasks, exhibiting a similar level of performance in other tasks.
ARMHuBERT achieves a better overall score of **78.1** with less parameters than MaskHuBERT.
This is an state-of-the-art performance for an end-to-end distillation approach such as [Deep-versus-wide 12-L](https://arxiv.org/abs/2207.06867?context=eess.AS) or [FitHuBERT](https://arxiv.org/abs/2207.00555).You can also check that our model works on other Transformer backbone model, [wavLM](https://arxiv.org/abs/2110.13900), too.
## Try this distillation strategy with your Transformer backbone models
We have only performed evaluation on HuBERT-based models, but this strategy can be performed identically on any speech model with a Transformer backbone. E.g. [AST](https://arxiv.org/abs/2104.01778) (Audio Spectrogram Transformer).## BibTeX
If you find this repo useful for your research, please consider citing our paper:
```
@article{jang2023recycleanddistill,
title={Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation},
author={Kangwook Jang and Sungnyun Kim and Se-Young Yun and Hoirin Kim},
booktitle={Proc. INTERSPEECH 2023},
pages={316--320},
year={2023}
}
```
🎉 Update (Apr 12, 2024): Our new paper, STaR, has been selected as **Best Student Paper** in ICASSP 2024!
🎉 Check out our model's performance in [SUPERB Leaderboard](https://superbbenchmark.org/leaderboard)!
![]()
[**STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models**](https://arxiv.org/abs/2312.09040), ICASSP 2024.
[Kangwook Jang](https://scholar.google.com/citations?user=p8GFX-sAAAAJ&hl),
[Sungnyun Kim](https://bit.ly/sungnyunkim),
[Hoirin Kim](https://scholar.google.com/citations?user=naLHjOsAAAAJ&hl=en)- **Speech Temporal Relation (STaR)**: Distill the knowledge by focusing on the pairwise **temporal relation** between two speech frames.
- **Temporal Gram Matrix (TGM)**: Propose Temporal Gram Matrix which aggregates channel information at two time steps.
- Layer-wise TGM: Distill the TGM for every Transformer layer
- Intra-layer TGM: Modify the TGM as computing the temporal relation between the input and output of a single Transformer layer.
- Incorporating two TGMs as the distillation objectives together, our student model STaRHuBERT (22M & 26M) shows the SOTA performance on the SUPERB benchmark with the metric of overall and generalizability scores.
- For further compression (9.39M & 14.1M), our approach shows the robust performance against degradation compares to other works.
![]()
![]()
## Checkpoints
For our model's checkpoints, please check the following links. All models are distilled from HuBERT base.- STaRHuBERT-L (26.6M): [ckpt](https://drive.google.com/file/d/1La5jh0jPv-JCk2ECT2raRHtXMYYJNtsF/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1NpHhMVj7EE1Sx5eG3tX5fMCa4UUlrnwI/view?usp=drive_link)
- STaRHuBERT (22.3M): [ckpt](https://drive.google.com/file/d/1Zu1idRx-sVaqMvRsUKtttzkSn4df6TUr/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1C6ZcYM4Fcxj0vurid5N02BAANj4smr8U/view?usp=drive_link)
- STaRHuBERT-S (14.1M): [ckpt](https://drive.google.com/file/d/1sUpbDupbDtlCvN-49TblUn8GDmhV8MT8/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1OFjIo1UjxNrxEboKaT8Z-vtuL6xmyNsu/view?usp=drive_link)
- STaRHuBERT-XS (9.39M): [ckpt](https://drive.google.com/file/d/1sUpbDupbDtlCvN-49TblUn8GDmhV8MT8/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1OFjIo1UjxNrxEboKaT8Z-vtuL6xmyNsu/view?usp=drive_link)We also add the model distilled from WavLM base models!
- STaRWavLM (22.3M): [ckpt](https://drive.google.com/file/d/1gWq55o0HfarwgfRpY_2ncmr7jfhtIKR-/view?usp=drive_link), [yaml](https://drive.google.com/file/d/1p79PqwbEarBDm2X7k2DEbsN_lE3pgqki/view?usp=drive_link)## Distillation
We do not offer an official implementation code for distillation.
Nevertheless, since STaRHuBERT is developed upon the backbone of ARMHuBERT, you can easily re-implement our apporach with this ARMHuBERT repository.## Fine-tuning
You can reproduce our model with given checkpoints. Please follow the steps. (This is almost the same as ARMHuBERT case.)
1. Clone and install the [S3PRL toolkit](https://github.com/s3prl/s3prl) with ```pip install -e ".[all]"``` (dev mode).2. Copy the entire `./models/starhubert` folder into `/s3prl/upstream/`.
3. Please add upstream importing line in `/s3prl/hub.py`.
```
from s3prl.upstream.starhubert.hubconf import *
```4. Please change each config file of s3prl downstream tasks as follows.
+ Uncomment learning rate scheduler
+ Learning rate scaled to 10x in spekaer identification (SID) task5. Run the following command to fine-tune the ARMHuBERT model.
For automatic speech recognition (ASR) as an example:
```
python run_downstream.py \
-m train \
-n STaRHuBERT-ASR \ # You can set your exp name whatever you want
-u starhubert \
-d asr \
-k /results/pretrain/> \
-g /results/pretrain/>
```
Note: Refer to the [SUPERB docs](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/docs/superb.md) for more information on usage details and data preparation.## BibTeX
If you find this repo useful for your research, please consider citing our paper:
```
@inproceedings{jang2024star,
title={STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models},
author={Jang, Kangwook and Kim, Sungnyun and Kim, Hoirin},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={10721--10725},
year={2024},
organization={IEEE}
}
```## Contact
For any details or clarification, please reach out to
- Kangwook Jang: dnrrkdwkd12@kaist.ac.kr
- Sungnyun Kim: ksn4397@kaist.ac.kr