{"id":15899395,"url":"https://github.com/labbeti/conette-audio-captioning","last_synced_at":"2025-07-14T05:36:05.514Z","repository":{"id":186702865,"uuid":"670248035","full_name":"Labbeti/conette-audio-captioning","owner":"Labbeti","description":"CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding","archived":false,"fork":false,"pushed_at":"2024-12-13T13:47:01.000Z","size":5399,"stargazers_count":19,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-10T23:40:51.199Z","etag":null,"topics":["audio-captioning","automated-audio-captioning"],"latest_commit_sha":null,"homepage":"https://ieeexplore.ieee.org/document/10603439","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Labbeti.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-24T16:04:05.000Z","updated_at":"2025-05-31T17:32:38.000Z","dependencies_parsed_at":"2023-11-07T14:26:05.872Z","dependency_job_id":"34f77b34-bd25-4c8e-b4d8-451bb93cceac","html_url":"https://github.com/Labbeti/conette-audio-captioning","commit_stats":null,"previous_names":["labbeti/conette-audio-captioning"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/Labbeti/conette-audio-captioning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fconette-audio-captioning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fconette-audio-captioning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fconette-audio-captioning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fconette-audio-captioning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Labbeti","download_url":"https://codeload.github.com/Labbeti/conette-audio-captioning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Fconette-audio-captioning/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265246023,"owners_count":23734109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-captioning","automated-audio-captioning"],"created_at":"2024-10-06T10:20:57.508Z","updated_at":"2025-07-14T05:36:05.474Z","avatar_url":"https://github.com/Labbeti.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# CoNeTTE model for Audio Captioning\n\n[![](\u003chttps://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\u003e)](https://www.python.org/)\n[![](\u003chttps://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\u003e)](https://pytorch.org/get-started/locally/)\n[![](https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge\u0026labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![](https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/inference.yaml?branch=main\u0026style=for-the-badge\u0026logo=github)](https://github.com/Labbeti/conette-audio-captioning/actions)\n\n\u003c/div\u003e\n\nCoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the [corresponding paper on IEEE](https://ieeexplore.ieee.org/document/10603439) (you can also find an older pre-print version on [arXiv here](https://arxiv.org/pdf/2309.00454.pdf)). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/) :) ) during my PhD. A simple interface to test CoNeTTE is available on the [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).\n\n## Training\n### Requirements\n- Intended for Ubuntu 20.04 only. Requires **java** \u003c 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.\n- Recommanded GPU: NVIDIA V100 with 32GB VRAM.\n- WavCaps dataset might requires more than 2 TB of disk storage. Other datasets requires less than 50 GB.\n\n### Installation\nBy default, **only the pip inference requirements are installed for conette**. To install training requirements you need to use the following command:\n```bash\npython -m pip install conette[train]\n```\nIf you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.\n\n### Download external models and data\nThese steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).\n\nFirst, download the ConvNeXt, NLTK and spacy models :\n```bash\nconette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false\n```\n\nThen download the 4 datasets used to train CoNeTTE :\n```bash\ncommon_args=\"data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=cnext_bl_75 post_hdf_name=bl pretag=cnext_bl_75\"\n\nconette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}\nconette-prepare data=clotho audio_t.src_sr=44100 ${common_args}\nconette-prepare data=macs audio_t.src_sr=48000 ${common_args}\nconette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000\n```\n\n### Train a model\nCNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[clotho_cnext_bl] pl=baseline\n```\n\nCoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette\n```\n\nCoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette\n```\n\nNote 1: any training using AC data cannot be exactly reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.\nNote 2: paper results are averaged scores over 5 seeds (1234-1238). The default training only uses seed 1234.\n\n## Inference only (without training)\n\n### Installation\n```bash\npython -m pip install conette[test]\n```\n\n### Usage with command line\nSimply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.\n\n```bash\nconette-predict --audio \"/your/path/to/audio.wav\"\n```\n\n### Usage with python\n```py\nfrom conette import CoNeTTEConfig, CoNeTTEModel\n\nconfig = CoNeTTEConfig.from_pretrained(\"Labbeti/conette\")\nmodel = CoNeTTEModel.from_pretrained(\"Labbeti/conette\", config=config)\n\npath = \"/your/path/to/audio.wav\"\noutputs = model(path)\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n```\n\nThe model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:\n\n```py\nimport torchaudio\n\npath_1 = \"/your/path/to/audio_1.wav\"\npath_2 = \"/your/path/to/audio_2.wav\"\n\naudio_1, sr_1 = torchaudio.load(path_1)\naudio_2, sr_2 = torchaudio.load(path_2)\n\noutputs = model([audio_1, audio_2], sr=[sr_1, sr_2])\ncandidates = outputs[\"cands\"]\nprint(candidates)\n```\n\nThe model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is \"clotho\".\n\n```py\noutputs = model(path, task=\"clotho\")\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n\noutputs = model(path, task=\"audiocaps\")\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n```\n\n### Performance\nThe model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :\n\n| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_audiocaps_test.yaml) |\n| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_clotho_eval.yaml) |\n\nThis model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the \"audiocaps\" task.\n\n### Limitations\n- The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.\n- The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.\n\n## Citation\nThe final version of the paper describing CoNeTTE is available on IEEExplore: https://ieeexplore.ieee.org/document/10603439. A preprint version of the paper is also available on arXiv: https://arxiv.org/pdf/2309.00454.pdf.\n\n**Final version recommanded for citation (IEEE):**\n```bibtex\n@article{labbe2023conetteieee,\n\ttitle        = {CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding},\n\tauthor       = {Labbé, Étienne and Pellegrini, Thomas and Pinquier, Julien},\n\tyear         = 2024,\n\tjournal      = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},\n\tvolume       = 32,\n\tnumber       = {},\n\tpages        = {3785--3794},\n\tdoi          = {10.1109/TASLP.2024.3430813},\n\turl          = {https://ieeexplore.ieee.org/document/10603439},\n\tkeywords     = {Decoding;Task analysis;Transformers;Training;Convolutional neural networks;Speech processing;Tagging;Audio-language task;automated audio captioning;dataset biases;task embedding;deep learning}\n}\n```\n\n**Preprint version (arXiv):**\n```bibtex\n@misc{labbe2023conettearxiv,\n\ttitle        = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding},\n\tauthor       = {Étienne Labbé and Thomas Pellegrini and Julien Pinquier},\n\tyear         = 2023,\n\tjournal      = {arXiv preprint arXiv:2309.00454},\n\turl          = {https://arxiv.org/pdf/2309.00454.pdf},\n\teprint       = {2309.00454},\n\tarchiveprefix = {arXiv},\n\tprimaryclass = {cs.SD}\n}\n```\n\n## Additional information\n- CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.\n- Raw model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette\n- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/records/10987498 under the filename \"convnext_tiny_465mAP_BL_AC_75kit.pth\".\n\n## Contact\nMaintainer:\n- [Étienne Labbé](https://labbeti.github.io/) \"Labbeti\": labbeti.pub@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Fconette-audio-captioning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabbeti%2Fconette-audio-captioning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Fconette-audio-captioning/lists"}