Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cwx-worst-one/eat
[IJCAI 2024] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
https://github.com/cwx-worst-one/eat
audio audio-classification deep-learning eat fairseq pytorch representation-learning self-supervised-learning
Last synced: 1 day ago
JSON representation
[IJCAI 2024] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
- Host: GitHub
- URL: https://github.com/cwx-worst-one/eat
- Owner: cwx-worst-one
- License: mit
- Created: 2023-12-12T10:19:55.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-23T06:12:15.000Z (about 2 months ago)
- Last Synced: 2025-02-12T15:16:46.315Z (8 days ago)
- Topics: audio, audio-classification, deep-learning, eat, fairseq, pytorch, representation-learning, self-supervised-learning
- Language: Python
- Homepage:
- Size: 5.12 MB
- Stars: 129
- Watchers: 4
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
[](https://www.linux.org/)
[](https://www.python.org/)
[](https://pytorch.org/)
[](https://arxiv.org/abs/2401.03497)
[](https://github.com/facebookresearch/fairseq)
[](https://github.com/cwx-worst-one/EAT)**Guides**
- [Requirements and Installation](#requirements-and-installation)
- [Model Checkpoints](#model-checkpoints)
- [Feature Extraction](#feature-extraction)
- [Data Preparation](#data-preparation)
- [Pre-Training](#pre-training)
- [Fine-Tuning](#fine-tuning)
- [Inference and Evaluation](#inference-and-evaluation)## News :fire:
- We release EAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
- We have updated the checkpoints and code, and now EAT seamlessly supports variable-length audio throughout training, feature extraction, inference, and evaluation phases.## Introduction
EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paper [EAT: Self-Supervised Pre-Training with Efficient Audio Transformer](https://arxiv.org/abs/2401.03497).## Requirements and Installation
The minimum environment requirements are `Python >= 3.8` and `PyTorch >= 1.13`. You could find the versions of other dependencies we use in `requirements.txt`.
```shell
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT
```## Model Checkpoints
You could download the EAT-base (10 epochs) checkpoints by Google Drive.
- AS-2M [Pre-trained](https://drive.google.com/file/d/10pklbY_fKraQUIBizSg1kv4lJXNWxpxl/view?usp=sharing)
- AS-2M Pre-trained+[Fine-tuned](https://drive.google.com/file/d/1F07zN8N54rXU-szvKUlYaCFMCepc4wHR/view?usp=sharing) (AS-2M)
- AS-2M Pre-trained+[Fine-tuned](https://drive.google.com/file/d/1fRX_Mgj4sHxV2F6AVfoqXObfgzFMnHRA/view?usp=sharing) (AS-20K):warning: Due to the limited amount of AudioSet data we possess compared to other models, we highly **recommend** [pre-training](#pre-training) the EAT model with your own data, which would probably perform better than the given one.
**Update!!!!!** :new: (**RECOMMEND**)
We have introduced two new variants of the EAT pre-training model and their fine-tuned versions, each designed to enhance performance through either extended pre-training epochs or scaling up the model size.Links for model checkpoints:
- [EAT-base_epoch30](https://drive.google.com/file/d/19hfzLgHCkyqTOYmHt8dqVa9nm-weBq4f/view?usp=sharing) (pre-training)
- [EAT-base_epoch30](https://drive.google.com/file/d/1aCYiQmoZv_Gh1FxnR-CCWpNAp6DIJzn6/view?usp=sharing) (fine-tuning on AS-2M)
- [EAT-large_epoch20](https://drive.google.com/file/d/1PEgriRvHsqrtLzlA478VemX7Q0ZGl889/view?usp=sharing) (pre-training)
- [EAT-large_epoch20](https://drive.google.com/file/d/1b_f_nQAdjM1B6u72OFUtFiUu-4yM2shd/view?usp=sharing) (fine-tuning on AS-2M)Performance metrics:
|Model|Backbone|Parameters|Pre-training
Epoch|AS-20K
mAP(%)|AS-2M
mAP(%)|
|:-:|:-:|:-:|:-:|:-:|:-:|
|EAT-base|ViT-B|88M|10|40.3 | 48.6|
|EAT-base|ViT-B|88M|30|41.3 | 48.9|
|EAT-large|ViT-L|309M|20|**42.0** | **49.5**|## Feature Extraction
We provide the script for extracting audio features from the last layer of EAT encoder. The features are stored in `.npy` format and the sample rate of the extracted features is ~50Hz. EAT could provide frame-level features and utterance-level features (denoted by the CLS token).
To extract latent representations from audio clips, you could use our pre-trained [checkpoint](https://drive.google.com/file/d/19hfzLgHCkyqTOYmHt8dqVa9nm-weBq4f/view?usp=sharing), fine-tuned [checkpoint](https://drive.google.com/file/d/1aCYiQmoZv_Gh1FxnR-CCWpNAp6DIJzn6/view?usp=sharing) or your owns, then please run the script `feature_extract.sh` by:
```bash
bash EAT/scripts/feature_extract.sh
```## Data Preparation
The main dataset in our experiment is [AudioSet](https://research.google.com/audioset/). Regrettably, we are unable to release the data due to copyright restrictions. Data manifest is available at [here](https://drive.google.com/file/d/1LH2C0q3d4zndoR3-oGkVdYYqDCIdxIsm/view?usp=drive_link). We follow the file format in [wav2vec](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec) and [data2vec](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec), where `.tsv` format file is for index while `.lbl` and `.csv` format files are specific for classification task. You could modify the files for your own database.## Pre-Training
Our codes are adapted from [Audio-MAE](https://github.com/facebookresearch/AudioMAE) and [data2vec](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec). We employ `pretraining_AS2M.yaml` as our default pre-training config. To pre-train the EAT model on Audioset, you could run the script `pretraining_AS2M.sh` by:
```bash
bash EAT/scripts/pretraining_AS2M.sh
```
If you need to pre-train the EAT model on other datasets where audio lengths are not fixed at 10 seconds, you can refer to the instructions in
`feature_extract/readme.md`## Fine-Tuning
We employ `finetuning.yaml` as our default fine-tuning config. To fine-tune the EAT model in different downstream tasks, you could run the script `finetuning_{task}.sh`, where `{task}` includes `AS20K`, `AS2M`, `ESC50` and `SPCv2`. For example, you can fine-tune EAT on `AS20K` by executing:
```bash
bash EAT/scripts/finetuning_AS20K.sh
```## Inference and Evaluation
For inference on single AudioSet audio clip with fine-tuned models, you could use our EAT checkpoints fine-tuning on [AS-2M](https://drive.google.com/file/d/1F07zN8N54rXU-szvKUlYaCFMCepc4wHR/view?usp=sharing) (recommended) or [AS-20K](https://drive.google.com/file/d/1fRX_Mgj4sHxV2F6AVfoqXObfgzFMnHRA/view?usp=sharing)
and run the script `inference.sh` by:
```bash
bash EAT/scripts/inference.sh
```
An example output is as follows:
```
# top_k_prediction = 12
************ Acoustic Event Inference ************
LABEL PREDICTION
Percussion 0.523
Drum kit 0.437
Vibraphone 0.420
Drum 0.316
Music 0.303
Snare drum 0.277
Glockenspiel 0.225
Marimba, xylophone 0.223
Cymbal 0.213
Bass drum 0.207
Hi-hat 0.196
Mallet percussion 0.170
**************************************************
```
For comprehensive evaluation on the entire AudioSet eval dataset with fine-tuned EAT models, you could run the evaluation script `eval.sh` by:
```bash
bash EAT/scripts/eval.sh
```
This script will give you the evaluation value of mAP on AudioSet test dataset.
Per-class AP can be found under the path `./EAT/ap_log.txt`. You could also refer to our results of finetuned EAT models on evaluation set of Audioset under the path `./EAT/results`.## Performance
Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.
## Efficiency
EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. It costs only 10 epochs during EAT's pre-training on AS-2M.
## Experiment Logs
We report the experiment logs using [wandb](https://wandb.ai). We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit it [here](https://api.wandb.ai/links/wxc12/obqrpq36).## TODO
- [x] release the final EAT large
- [x] update codes and checkpoints for friendly usage
- [ ] release the docker image## Acknowledgement
Our codebase is based on the awesome [Audio-MAE](https://github.com/facebookresearch/AudioMAE) and [data2vec](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec) repo.## Institutional Contributors
| Institution | Contribution |
|:------|:-----|
| [Shanghai Jiao Tong University](https://www.seiee.sjtu.edu.cn/) | Researchers; Computing power |
| [Peng Cheng Laboratory](https://data-starcloud.pcl.ac.cn/) | Researchers; Computing power |## Citation
If you find our EAT codes and models useful, please cite the following paper:
```
@article{chen2024eat,
title={EAT: Self-Supervised Pre-Training with Efficient Audio Transformer},
author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
journal={arXiv preprint arXiv:2401.03497},
year={2024}
}
```