{"id":15906827,"url":"https://github.com/labbeti/dcase2024-task6-baseline","last_synced_at":"2025-10-05T20:10:26.148Z","repository":{"id":230558671,"uuid":"750348126","full_name":"Labbeti/dcase2024-task6-baseline","owner":"Labbeti","description":"DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)","archived":false,"fork":false,"pushed_at":"2024-04-19T10:01:10.000Z","size":315,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-21T00:43:11.260Z","etag":null,"topics":["audio-captioning","baseline","dcase2024"],"latest_commit_sha":null,"homepage":"https://dcase.community/challenge2024/task-automated-audio-captioning","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Labbeti.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-30T13:29:32.000Z","updated_at":"2024-11-21T09:08:09.000Z","dependencies_parsed_at":"2024-04-19T10:27:49.250Z","dependency_job_id":"2e5ef880-3002-415c-93c3-aaeec2a60b2f","html_url":"https://github.com/Labbeti/dcase2024-task6-baseline","commit_stats":null,"previous_names":["labbeti/dcase2024-task6-baseline"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fdcase2024-task6-baseline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fdcase2024-task6-baseline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fdcase2024-task6-baseline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fdcase2024-task6-baseline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Labbeti","download_url":"https://codeload.github.com/Labbeti/dcase2024-task6-baseline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235534421,"owners_count":19005470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-captioning","baseline","dcase2024"],"created_at":"2024-10-06T13:41:52.170Z","updated_at":"2025-10-05T20:10:21.116Z","avatar_url":"https://github.com/Labbeti.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dcase2024-task6-baseline\n\n\u003cdiv align=\"center\"\u003e\n\n**DCASE2024 Challenge Task 6 baseline system of Automated Audio Captioning (AAC)**\n\n\u003ca href=\"https://www.python.org/\"\u003e\n    \u003cimg alt=\"Python\" src=\"https://img.shields.io/badge/-Python 3.11-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://pytorch.org/get-started/locally/\"\u003e\n    \u003cimg alt=\"PyTorch\" src=\"https://img.shields.io/badge/-PyTorch 2.2-ee4c2c?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://black.readthedocs.io/en/stable/\"\u003e\n    \u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge\u0026labelColor=gray\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/Labbeti/dcase2024-task6-baseline/actions\"\u003e\n    \u003cimg alt=\"Build\" src=\"https://img.shields.io/github/actions/workflow/status/Labbeti/dcase2024-task6-baseline/test.yaml?branch=main\u0026style=for-the-badge\u0026logo=github\"\u003e\n\u003c/a\u003e\n\n\u003c/div\u003e\n\nThe main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption.\nFor more information, please refer to the corresponding [DCASE task page](https://dcase.community/challenge2024/task-automated-audio-captioning).\n\n**This repository includes:**\n- AAC model trained on the **Clotho** dataset\n- Extract features using **ConvNeXt**\n- System reaches **29.6% SPIDEr-FL** score on Clotho-eval (development-testing)\n- Output detailed training characteristics (number of parameters, MACs, energy consumption...)\n\n\n## Installation\nFirst, you need to create an environment that contains **python\u003e=3.11** and **pip**. You can use venv, conda, micromamba or other python environment tool.\n\nHere is an example with [micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html):\n```bash\nmicromamba env create -n env_dcase24 python=3.11 pip -c defaults\nmicromamba activate env_dcase24\n```\n\nThen, you can clone this repository and install it:\n```bash\ngit clone https://github.com/Labbeti/dcase2024-task6-baseline\ncd dcase2024-task6-baseline\npip install -e .\npre-commit install\n```\n\nYou also need to install Java \u003e= 1.8 and \u003c= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable `AAC_METRICS_JAVA_PATH`.\n\n\n## Usage\n\n### Download external data, models and prepare\n\nTo download, extract and process data, you need to run:\n```bash\ndcase24t6-prepare\n```\nBy default, the dataset is stored in `./data` directory. It will requires approximatively 33GB of disk space.\n\n### Train the default model\n\n```bash\ndcase24t6-train +expt=baseline\n```\n\nBy default, the model and results are saved in directory `./logs/SAVE_NAME`. `SAVE_NAME` is the name of the script with the starting date.\nMetrics are computed at the end of the training with the best checkpoint.\n\n### Test a pretrained model\n\n```bash\ndcase24t6-test resume=./logs/SAVE_NAME\n```\nor specify each path separtely:\n```bash\ndcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json\n```\nYou need to replace `SAVE_NAME` by the save directory name and `MODEL` by the checkpoint filename.\n\nIf you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:\n\n```bash\ndcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline\n```\n\n### Inference on a file\nIf you want to test the baseline model on a single file, you can use the `baseline_pipeline` function:\n\n```python\nfrom dcase24t6.nn.hub import baseline_pipeline\n\nsr = 44100\naudio = torch.rand(1, sr * 15)\n\nmodel = baseline_pipeline()\nitem = {\"audio\": audio, \"sr\": sr}\noutputs = model(item)\ncandidate = outputs[\"candidates\"][0]\n\nprint(candidate)\n```\n\n## Code overview\nThe source code extensively use [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) for training and [Hydra](https://hydra.cc/) for configuration.\nIt is highly recommanded to learn about them if you want to understand this code.\n\nInstallation has three main steps:\n- Download external models ([ConvNeXt](https://github.com/topel/audioset-convnext-inf) to extract audio features)\n- Download Clotho dataset using [aac-datasets](https://github.com/Labbeti/aac-datasets)\n- Create HDF files containing each Clotho subset with preprocessed audio features using [torchoutil](https://github.com/Labbeti/torchoutil)\n\nTraining follows the standard way to create a model with lightning:\n- Initialize callbacks, tokenizer, datamodule, model.\n- Start fitting the model on the specified datamodule.\n- Evaluate the model using [aac-metrics](https://github.com/Labbeti/aac-metrics)\n\n\n## Model\nThe model outperforms previous baselines with a SPIDEr-FL score of **29.6%** on the Clotho evaluation subset.\nThe captioning model architecture is described in [this paper](https://arxiv.org/pdf/2309.00454.pdf) and called **CNext-trans**. The encoder part (ConvNeXt) is described in more detail in [this paper](https://arxiv.org/pdf/2306.00830.pdf).\n\nThe pretrained weights of the AAC model are available on Zenodo: [ConvNeXt encoder (BL_AC)](https://zenodo.org/records/8020843), [Transformer decoder](https://zenodo.org/records/10849427). Both weights are automatically downloaded during `dcase24t6-prepare`.\n\n### Main hyperparameters\n\n| Hyperparameter | Value | Option |\n| --- | --- | --- |\n| Number of epochs | 400 | `trainer.max_epochs` |\n| Batch size | 64 | `datamodule.batch_size` |\n| Gradient accumulation | 8 | `trainer.accumulate_grad_batches` |\n| Learning rate | 5e-4 | `model.lr` |\n| Weight decay | 2 | `model.weight_decay` |\n| Gradient clipping | 1 | `trainer.gradient_clip_val` |\n| Beam size | 3 | `model.beam_size` |\n| Model dimension size | 256 | `model.d_model` |\n| Label smoothing | 0.2 | `model.label_smoothing` |\n| Mixup alpha | 0.4 | `model.mixup_alpha` |\n\n\n### Detailed results\n\n| Metric | Score on Clotho-eval |\n| --- | --- |\n| BLEU-1 | 0.5948 |\n| BLEU-2 | 0.3924 |\n| BLEU-3 | 0.2603 |\n| BLEU-4 | 0.1695 |\n| METEOR | 0.1897 |\n| ROUGE-L | 0.3927 |\n| CIDEr-D | 0.4619 |\n| SPICE | 0.1335 |\n| SPIDEr | 0.2977 |\n| SPIDEr-FL | 0.2962 |\n| SBERT-sim | 0.5059 |\n| FER | 0.0038 |\n| FENSE | 0.5040 |\n| BERTScore | 0.9766 |\n| Vocabulary (words) | 551 |\n\nHere is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file \"Santa Motor.wav\":\n\n\u003c!--\n# encoder:\nflops: 89724036608\nmacs: 44757425184\nparams: 29388303\nduration: 0.030155420303344727\n\n# decoder:\nforcing_flops: 471009792\nforcing_macs: 235300608\nforcing_params: 11911699\nforcing_duration: 0.016583681106567383\ngenerate_flops: 5589742080\ngenerate_macs: 2793307392\ngenerate_params: 11911699\ngenerate_duration: 0.14899301528930664\n--\u003e\n\n| Name | Params (M) | MACs (G) |\n| --- | --- | --- |\n| Encoder | 29.4 | 44.4 |\n| Decoder | 11.9 | 4.3 |\n| Total | 41.3 | 48.8 |\n\n## Tips\n- **Modify the model**.\nThe model class is located in `src/dcase24t6/models/trans_decoder.py`. It is recommanded to create another class and conf to keep different models architectures.\nThe loss is computed in the method called `training_step`. You can also modify the model architecture in the method called `setup`.\n\n- **Extract different audio features**.\nFor that, you can add a new pre-process function in `src/dcase24t6/pre_processes` and the related conf in `src/conf/pre_process`. Then, re-run `dcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false` to create new HDF files with your own features.\nTo train a new model on these features, you can specify the HDF files required in `dcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=...`. Depending on the features extracted, some parameters could be modified in the model to handle them.\n\n- **Using as a package**.\nIf you do not want ot use the entire codebase but only parts of it, you can install it as a package using:\n\n```bash\npip install git+https://github.com/Labbeti/dcase2024-task6-baseline\n```\n\nThen you will be able to import any object from the code like for example `from dcase24t6.models.trans_decoder import TransDecoderModel`. There is also several important dependencies that you can install separately:\n\n- `aac-datasets` to download and load AAC datasets,\n- `aac-metrics` to compute AAC metrics,\n- `torchoutil[extras]` to pack datasets to HDF files.\n\n\n## Additional information\n- The code has been made for **Ubuntu 20.04** and should work on more recent Ubuntu versions and Linux-based distributions.\n- The GPU used is **NVIDIA GeForce RTX 2080 Ti** (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.\n- In this code, clotho subsets are named according to the **Clotho convention**, not the DCASE convention. See more information [on this page](https://aac-datasets.readthedocs.io/en/stable/data_subsets.html#clotho).\n\n\n## See also\n- [DCASE2023 Audio Captioning baseline](https://github.com/felixgontier/dcase-2023-baseline)\n- [DCASE2022 Audio Captioning baseline](https://github.com/felixgontier/dcase-2022-baseline)\n- [DCASE2021 Audio Captioning baseline](https://github.com/audio-captioning/dcase-2021-baseline)\n- [DCASE2020 Audio Captioning baseline](https://github.com/audio-captioning/dcase-2020-baseline)\n- [aac-datasets](https://github.com/Labbeti/aac-datasets)\n- [aac-metrics](https://github.com/Labbeti/aac-metrics)\n\n\n## Contact\nMaintainer:\n- [Étienne Labbé](https://labbeti.github.io/) \"Labbeti\": labbeti.pub@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Fdcase2024-task6-baseline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabbeti%2Fdcase2024-task6-baseline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Fdcase2024-task6-baseline/lists"}