{"id":14958194,"url":"https://github.com/labbeti/aac-datasets","last_synced_at":"2025-10-06T16:30:31.993Z","repository":{"id":41387153,"uuid":"493979158","full_name":"Labbeti/aac-datasets","owner":"Labbeti","description":"Audio Captioning datasets for PyTorch.","archived":false,"fork":false,"pushed_at":"2023-12-18T12:50:16.000Z","size":2502,"stargazers_count":66,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2023-12-18T13:54:51.190Z","etag":null,"topics":["audio","audio-captioning","caption","captioning","dataset","datasets","deep-learning","pytorch"],"latest_commit_sha":null,"homepage":"https://aac-datasets.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Labbeti.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-05-19T08:10:02.000Z","updated_at":"2023-12-20T15:53:21.820Z","dependencies_parsed_at":"2023-02-16T08:01:48.682Z","dependency_job_id":"6ed8623d-957a-4949-8019-09ce069098b5","html_url":"https://github.com/Labbeti/aac-datasets","commit_stats":null,"previous_names":[],"tags_count":9,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Faac-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Faac-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Faac-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labbeti%2Faac-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Labbeti","download_url":"https://codeload.github.com/Labbeti/aac-datasets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235534414,"owners_count":19005469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","audio-captioning","caption","captioning","dataset","datasets","deep-learning","pytorch"],"created_at":"2024-09-24T13:16:26.999Z","updated_at":"2025-10-06T16:30:31.988Z","avatar_url":"https://github.com/Labbeti.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- # -*- coding: utf-8 -*- --\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n# Audio Captioning datasets for PyTorch\n\n\u003ca href=\"https://www.python.org/\"\u003e\u003cimg alt=\"Python\" src=\"https://img.shields.io/badge/-Python 3.8+-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pytorch.org/get-started/locally/\"\u003e\u003cimg alt=\"PyTorch\" src=\"https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://black.readthedocs.io/en/stable/\"\u003e\u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge\u0026labelColor=gray\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/Labbeti/aac-datasets/actions\"\u003e\u003cimg alt=\"Build\" src=\"https://img.shields.io/github/actions/workflow/status/Labbeti/aac-datasets/python-package-pip.yaml?branch=main\u0026style=for-the-badge\u0026logo=github\"\u003e\u003c/a\u003e\n\u003ca href='https://aac-datasets.readthedocs.io/en/stable/?badge=stable'\u003e\n    \u003cimg src='https://readthedocs.org/projects/aac-datasets/badge/?version=stable\u0026style=for-the-badge' alt='Documentation Status' /\u003e\n\u003c/a\u003e\n\nAudio Captioning unofficial datasets source code for **AudioCaps** [[1]](#audiocaps), **Clotho** [[2]](#clotho), **MACS** [[3]](#macs), and **WavCaps** [[4]](#wavcaps), designed for PyTorch.\n\n\u003c/div\u003e\n\n## Installation\n```bash\npip install aac-datasets\n```\n\nIf you want to check if the package has been installed and the version, you can use this command:\n```bash\naac-datasets-info\n```\n\n## Examples\n\n### Create Clotho dataset\n\n```python\nfrom aac_datasets import Clotho\n\ndataset = Clotho(root=\".\", download=True)\nitem = dataset[0]\naudio, captions = item[\"audio\"], item[\"captions\"]\n# audio: Tensor of shape (n_channels=1, audio_max_size)\n# captions: list of str\n```\n\n### Build PyTorch dataloader with Clotho\n\n```python\nfrom torch.utils.data.dataloader import DataLoader\nfrom aac_datasets import Clotho\nfrom aac_datasets.utils.collate import BasicCollate\n\ndataset = Clotho(root=\".\", download=True)\ndataloader = DataLoader(dataset, batch_size=4, collate_fn=BasicCollate())\n\nfor batch in dataloader:\n    # batch[\"audio\"]: list of 4 tensors of shape (n_channels, audio_size)\n    # batch[\"captions\"]: list of 4 lists of str\n    ...\n```\n\n## Download datasets\nTo download a dataset, you can use `download` argument in dataset construction :\n```python\ndataset = Clotho(root=\".\", subset=\"dev\", download=True)\n```\nHowever, if you want to download datasets from a script, you can also use the following command :\n```bash\naac-datasets-download --root \".\" clotho --subsets \"dev\"\n```\n\n## Datasets information\n`aac-datasets` package contains 4 different datasets :\n\n\u003c!-- | | AudioCaps | Clotho | MACS | WavCaps |\n|:---:|:---:|:---:|:---:|:---:|\n| Subsets | `train`, `val`, `test` | `dev`, `val`, `eval`, `dcase_aac_test`, `dcase_aac_analysis`, `dcase_t2a_audio`, `dcase_t2a_captions` | `full` | `as`, `as_noac`, `bbc`, `fsd`, `fsd_nocl`, `sb` |\n| Sample rate (kHz) | 32 | 44.1 | 48 | 32 |\n| Estimated size (GB) | 43 | 53 | 13 | 941 |\n| Audio source | AudioSet | Freesound | TAU Urban Acoustic Scenes 2019 | AudioSet, BBC Sound Effects, Freesound, SoundBible | --\u003e\n\n| Dataset | Sampling\u003cbr\u003erate (kHz) | Estimated\u003cbr\u003esize (GB) | Source | Subsets |\n|:---:|:---:|:---:|:---:|:---:|\n| AudioCaps | 32 | 43 | AudioSet | `train`\u003cbr\u003e`val`\u003cbr\u003e`test`\u003cbr\u003e`train_fixed` |\n| Clotho | 44.1 | 53  | Freesound | `dev`\u003cbr\u003e`val`\u003cbr\u003e`eval`\u003cbr\u003e`dcase_aac_test`\u003cbr\u003e`dcase_aac_analysis`\u003cbr\u003e`dcase_t2a_audio`\u003cbr\u003e`dcase_t2a_captions` |\n| MACS | 48 | 13 | TAU Urban Acoustic Scenes 2019 | `full` |\n| WavCaps | 32 | 941 | AudioSet\u003cbr\u003eBBC Sound Effects\u003cbr\u003eFreesound\u003cbr\u003eSoundBible | `audioset`\u003cbr\u003e`audioset_no_audiocaps_v1`\u003cbr\u003e`bbc`\u003cbr\u003e`freesound`\u003cbr\u003e`freesound_no_clotho_v2`\u003cbr\u003e`soundbible` |\n\nFor Clotho, the **dev** subset should be used for training, val for validation and eval for testing.\n\nHere is additional statistics of the train subsets from AudioCaps (v1), Clotho (v2.1), MACS and WavCaps:\n\n| | AudioCaps/train | Clotho/dev | MACS/full | WavCaps/full |\n|:---:|:---:|:---:|:---:|:---:|\n| Nb audios | 49,838 | 3,840 | 3,930 | 403,050 |\n| Total audio duration (h) | 136.6\u003csup\u003e1\u003c/sup\u003e | 24.0 | 10.9 | 7563.3 |\n| Audio duration range (s) | 0.5-10 | 15-30 | 10 | 1-67,109 |\n| Nb captions per audio | 1 | 5 | 2-5 | 1 |\n| Nb captions | 49,838 | 19,195 | 17,275 | 403,050 |\n| Total nb words\u003csup\u003e2\u003c/sup\u003e | 402,482 | 217,362 | 160,006 | 3,161,823 |\n| Sentence size\u003csup\u003e2\u003c/sup\u003e | 2-52 | 8-20 | 5-40 | 2-38 |\n| Vocabulary\u003csup\u003e2\u003c/sup\u003e | 4724 | 4369 | 2721 | 24,600 |\n| Annotated by | Human | Human | Human | Machine |\n| Corrected by | Human | Human | None | None |\n\n\u003csup\u003e1\u003c/sup\u003e This duration is estimated on the total duration of 46230/49838 files of 126.7h.\n\n\u003csup\u003e2\u003c/sup\u003e The sentences are cleaned (lowercase+remove punctuation) and tokenized using the spacy tokenizer to count the words.\n\n## Requirements\n\nThis package has been developped for Ubuntu 20.04, and it is expected to work on most Linux-based distributions.\nIt has been tested with Python versions 3.7 and 3.13.\n\n### Python packages\n\nPython requirements are automatically installed when using pip on this repository.\n```\ntorch \u003e= 1.10.1\ntorchaudio \u003e= 0.10.1\npy7zr \u003e= 0.17.2\npyyaml \u003e= 6.0\ntqdm \u003e= 4.64.0\nhuggingface-hub \u003e= 0.15.1\nnumpy \u003e= 1.21.2\n```\n\n### External requirements (AudioCaps only)\n\nThe external requirements needed to download **AudioCaps** are **ffmpeg** and **yt-dlp**.\n**ffmpeg** can be installed on Ubuntu using `sudo apt install ffmpeg` and **yt-dlp** from the [official repo](https://github.com/yt-dlp/yt-dlp).\n\nYou can also override their paths for AudioCaps:\n```python\nfrom aac_datasets import AudioCaps\ndataset = AudioCaps(\n    download=True,\n    ffmpeg_path=\"/my/path/to/ffmpeg\",\n    ytdl_path=\"/my/path/to/ytdlp\",\n)\n```\n\nSince YouTube prevents bots to download videos, you might want to use `ytdlp_opts` argument to use cookies to overcome failed downloads, e.g. `AudioCaps(ytdlp_opts=[\"--cookies-from-browser\", \"firefox\"])`. See more information [on the documentation of yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp).\n\n## Additional information\n### Compatibility with audiocaps-download\nIf you want to use [audiocaps-download 1.0](https://github.com/MorenoLaQuatra/audiocaps-download) package to download AudioCaps (v1 only), you will have to respect the AudioCaps folder tree:\n```python\nfrom audiocaps_download import Downloader\nroot = \"your/path/to/root\"\ndownloader = Downloader(root_path=f\"{root}/AUDIOCAPS/audio_32000Hz/\", n_jobs=16)\ndownloader.download(format=\"wav\")\n```\n\nThen disable audio download and set the correct audio format before init AudioCaps :\n```python\nfrom aac_datasets import AudioCaps\ndataset = AudioCaps(\n    root=root,\n    subset=\"train\",\n    download=True,\n    audio_format=\"wav\",\n    download_audio=False,  # this will only download labels and metadata files\n)\n```\n\n## References\n#### AudioCaps\n[1] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1011/\n\n#### Clotho\n[2] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” arXiv:1910.09387 [cs, eess], Oct. 2019, Available: http://arxiv.org/abs/1910.09387\n\n#### MACS\n[3] F. Font, A. Mesaros, D. P. W. Ellis, E. Fonseca, M. Fuentes, and B. Elizalde, Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021). Barcelona, Spain: Music Technology Group - Universitat Pompeu Fabra, Nov. 2021. Available: https://doi.org/10.5281/zenodo.5770113\n\n#### WavCaps\n[4] X. Mei et al., “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research,” arXiv preprint arXiv:2303.17395, 2023, [Online]. Available: https://arxiv.org/pdf/2303.17395.pdf\n\n## Cite the aac-datasets package\nIf you use this software, please consider cite it as \"Labbe, E. (2025). aac-datasets: Audio Captioning datasets for PyTorch.\", or use the following BibTeX citation:\n\n```\n@software{\n    Labbe_aac_datasets_2025,\n    author = {Labbé, Étienne},\n    license = {MIT},\n    month = {07},\n    title = {{aac-datasets}},\n    url = {https://github.com/Labbeti/aac-datasets/},\n    version = {0.7.0},\n    year = {2025}\n}\n```\n\n## See also\n- [AudioCaps official data repository](https://github.com/cdjkim/audiocaps/tree/master)\n- [Clotho official data repository](https://zenodo.org/records/4783391)\n- [MACS official data repository](https://zenodo.org/records/5114771)\n- [WavCaps official data repository](https://huggingface.co/datasets/cvssp/WavCaps)\n\n## Contact\nMaintainer:\n- [Étienne Labbé](https://labbeti.github.io/) \"Labbeti\": labbeti.pub@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Faac-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabbeti%2Faac-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabbeti%2Faac-datasets/lists"}