Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/labbeti/dcase2024-task6-baseline
DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)
https://github.com/labbeti/dcase2024-task6-baseline
audio-captioning baseline dcase2024
Last synced: 3 months ago
JSON representation
DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)
- Host: GitHub
- URL: https://github.com/labbeti/dcase2024-task6-baseline
- Owner: Labbeti
- License: mit
- Created: 2024-01-30T13:29:32.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-04-19T10:01:10.000Z (9 months ago)
- Last Synced: 2024-10-07T13:41:46.093Z (3 months ago)
- Topics: audio-captioning, baseline, dcase2024
- Language: Python
- Homepage: https://dcase.community/challenge2024/task-automated-audio-captioning
- Size: 308 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# dcase2024-task6-baseline
The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption.
For more information, please refer to the corresponding [DCASE task page](https://dcase.community/challenge2024/task-automated-audio-captioning).**This repository includes:**
- AAC model trained on the **Clotho** dataset
- Extract features using **ConvNeXt**
- System reaches **29.6% SPIDEr-FL** score on Clotho-eval (development-testing)
- Output detailed training characteristics (number of parameters, MACs, energy consumption...)## Installation
First, you need to create an environment that contains **python>=3.11** and **pip**. You can use venv, conda, micromamba or other python environment tool.Here is an example with [micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html):
```bash
micromamba env create -n env_dcase24 python=3.11 pip -c defaults
micromamba activate env_dcase24
```Then, you can clone this repository and install it:
```bash
git clone https://github.com/Labbeti/dcase2024-task6-baseline
cd dcase2024-task6-baseline
pip install -e .
pre-commit install
```You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable `AAC_METRICS_JAVA_PATH`.
## Usage
### Download external data, models and prepare
To download, extract and process data, you need to run:
```bash
dcase24t6-prepare
```
By default, the dataset is stored in `./data` directory. It will requires approximatively 33GB of disk space.### Train the default model
```bash
dcase24t6-train +expt=baseline
```By default, the model and results are saved in directory `./logs/SAVE_NAME`. `SAVE_NAME` is the name of the script with the starting date.
Metrics are computed at the end of the training with the best checkpoint.### Test a pretrained model
```bash
dcase24t6-test resume=./logs/SAVE_NAME
```
or specify each path separtely:
```bash
dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json
```
You need to replace `SAVE_NAME` by the save directory name and `MODEL` by the checkpoint filename.If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:
```bash
dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline
```### Inference on a file
If you want to test the baseline model on a single file, you can use the `baseline_pipeline` function:```python
from dcase24t6.nn.hub import baseline_pipelinesr = 44100
audio = torch.rand(1, sr * 15)model = baseline_pipeline()
item = {"audio": audio, "sr": sr}
outputs = model(item)
candidate = outputs["candidates"][0]print(candidate)
```## Code overview
The source code extensively use [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) for training and [Hydra](https://hydra.cc/) for configuration.
It is highly recommanded to learn about them if you want to understand this code.Installation has three main steps:
- Download external models ([ConvNeXt](https://github.com/topel/audioset-convnext-inf) to extract audio features)
- Download Clotho dataset using [aac-datasets](https://github.com/Labbeti/aac-datasets)
- Create HDF files containing each Clotho subset with preprocessed audio features using [torchoutil](https://github.com/Labbeti/torchoutil)Training follows the standard way to create a model with lightning:
- Initialize callbacks, tokenizer, datamodule, model.
- Start fitting the model on the specified datamodule.
- Evaluate the model using [aac-metrics](https://github.com/Labbeti/aac-metrics)## Model
The model outperforms previous baselines with a SPIDEr-FL score of **29.6%** on the Clotho evaluation subset.
The captioning model architecture is described in [this paper](https://arxiv.org/pdf/2309.00454.pdf) and called **CNext-trans**. The encoder part (ConvNeXt) is described in more detail in [this paper](https://arxiv.org/pdf/2306.00830.pdf).The pretrained weights of the AAC model are available on Zenodo: [ConvNeXt encoder (BL_AC)](https://zenodo.org/records/8020843), [Transformer decoder](https://zenodo.org/records/10849427). Both weights are automatically downloaded during `dcase24t6-prepare`.
### Main hyperparameters
| Hyperparameter | Value | Option |
| --- | --- | --- |
| Number of epochs | 400 | `trainer.max_epochs` |
| Batch size | 64 | `datamodule.batch_size` |
| Gradient accumulation | 8 | `trainer.accumulate_grad_batches` |
| Learning rate | 5e-4 | `model.lr` |
| Weight decay | 2 | `model.weight_decay` |
| Gradient clipping | 1 | `trainer.gradient_clip_val` |
| Beam size | 3 | `model.beam_size` |
| Model dimension size | 256 | `model.d_model` |
| Label smoothing | 0.2 | `model.label_smoothing` |
| Mixup alpha | 0.4 | `model.mixup_alpha` |### Detailed results
| Metric | Score on Clotho-eval |
| --- | --- |
| BLEU-1 | 0.5948 |
| BLEU-2 | 0.3924 |
| BLEU-3 | 0.2603 |
| BLEU-4 | 0.1695 |
| METEOR | 0.1897 |
| ROUGE-L | 0.3927 |
| CIDEr-D | 0.4619 |
| SPICE | 0.1335 |
| SPIDEr | 0.2977 |
| SPIDEr-FL | 0.2962 |
| SBERT-sim | 0.5059 |
| FER | 0.0038 |
| FENSE | 0.5040 |
| BERTScore | 0.9766 |
| Vocabulary (words) | 551 |Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":
| Name | Params (M) | MACs (G) |
| --- | --- | --- |
| Encoder | 29.4 | 44.4 |
| Decoder | 11.9 | 4.3 |
| Total | 41.3 | 48.8 |## Tips
- **Modify the model**.
The model class is located in `src/dcase24t6/models/trans_decoder.py`. It is recommanded to create another class and conf to keep different models architectures.
The loss is computed in the method called `training_step`. You can also modify the model architecture in the method called `setup`.- **Extract different audio features**.
For that, you can add a new pre-process function in `src/dcase24t6/pre_processes` and the related conf in `src/conf/pre_process`. Then, re-run `dcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false` to create new HDF files with your own features.
To train a new model on these features, you can specify the HDF files required in `dcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=...`. Depending on the features extracted, some parameters could be modified in the model to handle them.- **Using as a package**.
If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:```bash
pip install git+https://github.com/Labbeti/dcase2024-task6-baseline
```Then you will be able to import any object from the code like for example `from dcase24t6.models.trans_decoder import TransDecoderModel`. There is also several important dependencies that you can install separately:
- `aac-datasets` to download and load AAC datasets,
- `aac-metrics` to compute AAC metrics,
- `torchoutil[extras]` to pack datasets to HDF files.## Additional information
- The code has been made for **Ubuntu 20.04** and should work on more recent Ubuntu versions and Linux-based distributions.
- The GPU used is **NVIDIA GeForce RTX 2080 Ti** (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
- In this code, clotho subsets are named according to the **Clotho convention**, not the DCASE convention. See more information [on this page](https://aac-datasets.readthedocs.io/en/stable/data_subsets.html#clotho).## See also
- [DCASE2023 Audio Captioning baseline](https://github.com/felixgontier/dcase-2023-baseline)
- [DCASE2022 Audio Captioning baseline](https://github.com/felixgontier/dcase-2022-baseline)
- [DCASE2021 Audio Captioning baseline](https://github.com/audio-captioning/dcase-2021-baseline)
- [DCASE2020 Audio Captioning baseline](https://github.com/audio-captioning/dcase-2020-baseline)
- [aac-datasets](https://github.com/Labbeti/aac-datasets)
- [aac-metrics](https://github.com/Labbeti/aac-metrics)## Contact
Maintainer:
- [Étienne Labbé](https://labbeti.github.io/) "Labbeti": [email protected]