Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/labbeti/aac-datasets
Audio Captioning datasets for PyTorch.
https://github.com/labbeti/aac-datasets
audio audio-captioning caption captioning dataset datasets deep-learning pytorch
Last synced: about 1 month ago
JSON representation
Audio Captioning datasets for PyTorch.
- Host: GitHub
- URL: https://github.com/labbeti/aac-datasets
- Owner: Labbeti
- License: mit
- Created: 2022-05-19T08:10:02.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-18T12:50:16.000Z (11 months ago)
- Last Synced: 2023-12-18T13:54:51.190Z (11 months ago)
- Topics: audio, audio-captioning, caption, captioning, dataset, datasets, deep-learning, pytorch
- Language: Python
- Homepage: https://aac-datasets.readthedocs.io/
- Size: 2.39 MB
- Stars: 66
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Audio Captioning datasets for PyTorch
Audio Captioning unofficial datasets source code for **AudioCaps** [[1]](#audiocaps), **Clotho** [[2]](#clotho), **MACS** [[3]](#macs), and **WavCaps** [[4]](#wavcaps), designed for PyTorch.
## Installation
```bash
pip install aac-datasets
```If you want to check if the package has been installed and the version, you can use this command:
```bash
aac-datasets-info
```## Examples
### Create Clotho dataset
```python
from aac_datasets import Clothodataset = Clotho(root=".", download=True)
item = dataset[0]
audio, captions = item["audio"], item["captions"]
# audio: Tensor of shape (n_channels=1, audio_max_size)
# captions: list of str
```### Build PyTorch dataloader with Clotho
```python
from torch.utils.data.dataloader import DataLoader
from aac_datasets import Clotho
from aac_datasets.utils import BasicCollatedataset = Clotho(root=".", download=True)
dataloader = DataLoader(dataset, batch_size=4, collate_fn=BasicCollate())for batch in dataloader:
# batch["audio"]: list of 4 tensors of shape (n_channels, audio_size)
# batch["captions"]: list of 4 lists of str
...
```## Download datasets
To download a dataset, you can use `download` argument in dataset construction :
```python
dataset = Clotho(root=".", subset="dev", download=True)
```
However, if you want to download datasets from a script, you can also use the following command :
```bash
aac-datasets-download --root "." clotho --subsets "dev"
```## Datasets information
`aac-datasets` package contains 4 different datasets :| Dataset | Sampling
rate (kHz) | Estimated
size (GB) | Source | Subsets |
|:---:|:---:|:---:|:---:|:---:|
| AudioCaps | 32 | 43 | AudioSet | `train`
`val`
`test`
`train_v2` |
| Clotho | 44.1 | 53 | Freesound | `dev`
`val`
`eval`
`dcase_aac_test`
`dcase_aac_analysis`
`dcase_t2a_audio`
`dcase_t2a_captions` |
| MACS | 48 | 13 | TAU Urban Acoustic Scenes 2019 | `full` |
| WavCaps | 32 | 941 | AudioSet
BBC Sound Effects
Freesound
SoundBible | `audioset`
`audioset_no_audiocaps`
`bbc`
`freesound`
`freesound_no_clotho`
`freesound_no_clotho_v2`
`soundbible` |For Clotho, the **dev** subset should be used for training, val for validation and eval for testing.
Here is additional statistics on the train subset for AudioCaps, Clotho, MACS and WavCaps:
| | AudioCaps/train | Clotho/dev | MACS/full | WavCaps/full |
|:---:|:---:|:---:|:---:|:---:|
| Nb audios | 49,838 | 3,840 | 3,930 | 403,050 |
| Total audio duration (h) | 136.61 | 24.0 | 10.9 | 7563.3 |
| Audio duration range (s) | 0.5-10 | 15-30 | 10 | 1-67,109 |
| Nb captions per audio | 1 | 5 | 2-5 | 1 |
| Nb captions | 49,838 | 19,195 | 17,275 | 403,050 |
| Total nb words2 | 402,482 | 217,362 | 160,006 | 3,161,823 |
| Sentence size2 | 2-52 | 8-20 | 5-40 | 2-38 |
| Vocabulary2 | 4724 | 4369 | 2721 | 24,600 |
| Annotated by | Human | Human | Human | Machine |
| Corrected by | Human | Human | None | None |1 This duration is estimated on the total duration of 46230/49838 files of 126.7h.
2 The sentences are cleaned (lowercase+remove punctuation) and tokenized using the spacy tokenizer to count the words.
## Requirements
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux-based distributions.
### Python packagesPython requirements are automatically installed when using pip on this repository.
```
torch >= 1.10.1
torchaudio >= 0.10.1
py7zr >= 0.17.2
pyyaml >= 6.0
tqdm >= 4.64.0
huggingface-hub >= 0.15.1
numpy >= 1.21.2
```### External requirements (AudioCaps only)
The external requirements needed to download **AudioCaps** are **ffmpeg** and **yt-dlp**.
**ffmpeg** can be install on Ubuntu using `sudo apt install ffmpeg` and **yt-dlp** from the [official repo](https://github.com/yt-dlp/yt-dlp).You can also override their paths for AudioCaps:
```python
from aac_datasets import AudioCaps
dataset = AudioCaps(
download=True,
ffmpeg_path="/my/path/to/ffmpeg",
ytdl_path="/my/path/to/ytdlp",
)
```## Additional information
### Compatibility with audiocaps-download
If you want to use [audiocaps-download 1.0](https://github.com/MorenoLaQuatra/audiocaps-download) package to download AudioCaps, you will have to respect the AudioCaps folder tree:
```python
from audiocaps_download import Downloader
root = "your/path/to/root"
downloader = Downloader(root_path=f"{root}/AUDIOCAPS/audio_32000Hz/", n_jobs=16)
downloader.download(format="wav")
```Then disable audio download and set the correct audio format before init AudioCaps :
```python
from aac_datasets import AudioCaps
dataset = AudioCaps(
root=root,
subset="train",
download=True,
audio_format="wav",
download_audio=False, # this will only download labels and metadata files
)
```## References
#### AudioCaps
[1] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1011/#### Clotho
[2] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” arXiv:1910.09387 [cs, eess], Oct. 2019, Available: http://arxiv.org/abs/1910.09387#### MACS
[3] F. Font, A. Mesaros, D. P. W. Ellis, E. Fonseca, M. Fuentes, and B. Elizalde, Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021). Barcelona, Spain: Music Technology Group - Universitat Pompeu Fabra, Nov. 2021. Available: https://doi.org/10.5281/zenodo.5770113#### WavCaps
[4] X. Mei et al., “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research,” arXiv preprint arXiv:2303.17395, 2023, [Online]. Available: https://arxiv.org/pdf/2303.17395.pdf## Cite the aac-datasets package
If you use this software, please consider cite it as "Labbe, E. (2013). aac-datasets: Audio Captioning datasets for PyTorch.", or use the following BibTeX citation:```
@software{
Labbe_aac_datasets_2024,
author = {Labbé, Étienne},
license = {MIT},
month = {03},
title = {{aac-datasets}},
url = {https://github.com/Labbeti/aac-datasets/},
version = {0.5.2},
year = {2024}
}
```## Contact
Maintainer:
- [Étienne Labbé](https://labbeti.github.io/) "Labbeti": [email protected]