{"id":13861536,"url":"https://github.com/archinetai/audio-data-pytorch","last_synced_at":"2025-10-30T01:40:11.153Z","repository":{"id":49397892,"uuid":"517277175","full_name":"archinetai/audio-data-pytorch","owner":"archinetai","description":"A collection of useful audio datasets and transforms for PyTorch.","archived":false,"fork":false,"pushed_at":"2023-02-11T23:30:31.000Z","size":49,"stargazers_count":140,"open_issues_count":3,"forks_count":23,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-10-08T17:58:08.881Z","etag":null,"topics":["artifical-intelligense","audio-generation","datasets","deep-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/archinetai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-24T09:15:36.000Z","updated_at":"2025-08-25T18:15:55.000Z","dependencies_parsed_at":"2024-08-05T06:13:37.130Z","dependency_job_id":null,"html_url":"https://github.com/archinetai/audio-data-pytorch","commit_stats":{"total_commits":30,"total_committers":3,"mean_commits":10.0,"dds":0.09999999999999998,"last_synced_commit":"5c5ee949f7f863bd9fe84b3955833f8980bbd4a6"},"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/archinetai/audio-data-pytorch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archinetai%2Faudio-data-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archinetai%2Faudio-data-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archinetai%2Faudio-data-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archinetai%2Faudio-data-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/archinetai","download_url":"https://codeload.github.com/archinetai/audio-data-pytorch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archinetai%2Faudio-data-pytorch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281731374,"owners_count":26551804,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-29T02:00:06.901Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artifical-intelligense","audio-generation","datasets","deep-learning","pytorch"],"created_at":"2024-08-05T06:01:24.621Z","updated_at":"2025-10-30T01:40:11.123Z","avatar_url":"https://github.com/archinetai.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n# Audio Data - PyTorch\n\nA collection of useful audio datasets and transforms for PyTorch.\n\n## Install\n\n```bash\npip install audio-data-pytorch\n```\n\n[![PyPI - Python Version](https://img.shields.io/pypi/v/audio-data-pytorch?style=flat\u0026colorA=0f0f0f\u0026colorB=0f0f0f)](https://pypi.org/project/audio-data-pytorch/)\n\n## Datasets\n\n### WAV Dataset\n\nLoad one or multiple folders of `.wav` files as dataset.\n\n```py\nfrom audio_data_pytorch import WAVDataset\n\ndataset = WAVDataset(path=['my/path1', 'my/path2'])\n```\n\n#### Full API:\n```py\nWAVDataset(\n    path: Union[str, Sequence[str]], # Path or list of paths from which to load files\n    recursive: bool = False # Recursively load files from provided paths\n    sample_rate: bool = False, # Specify sample rate to convert files to on read\n    random_crop_size: int = None, # Load small portions of files randomly\n    transforms: Optional[Callable] = None, # Transforms to apply to audio files\n    check_silence: bool = True # Discards silent samples if true\n)\n```\n\n\n### AudioWebDataset\nA [`WebDataset`](https://webdataset.github.io/webdataset/) extension for audio data. Assumes that the `.tar` file comes with pairs of `.wav` (or `.flac`) and `.json` data.\n```py\nfrom audio_data_pytorch import AudioWebDataset\n\ndataset = AudioWebDataset(\n    urls='mywebdataset.tar'\n)\n\nwaveform, info = next(iter(dataset))\n\nprint(waveform.shape) # torch.Size([2, 480000])\nprint(info.keys()) # dict_keys(['text'])\n```\n\n#### Full API:\n```py\ndataset = AudioWebDataset(\n    urls: Union[str, Sequence[str]],\n    shuffle: Optional[int] = None,\n    batch_size: Optional[int] = None,\n    transforms: Optional[Callable] = None,# Transforms to apply to audio files\n    use_wav_processor: bool = False, # Set this to True if your tar files only use .wav\n    crop_size: Optional[int] = None,\n    max_crops: Optional[int] = None,\n    **kwargs, # Forwarded to WebDataset class\n\n)\n```\n\n### LJSpeech Dataset\nAn unsupervised dataset for LJSpeech with voice-only data.\n```py\nfrom audio_data_pytorch import LJSpeechDataset\n\ndataset = LJSpeechDataset(root='./data')\n\ndataset[0] # (1, 158621)\ndataset[1] # (1, 153757)\n```\n\n#### Full API:\n```py\nLJSpeechDataset(\n    root: str = \"./data\", # The root where the dataset will be downloaded\n    transforms: Optional[Callable] = None, # Transforms to apply to audio files\n)\n```\n\n### LibriSpeech Dataset\nWrapper for the [LibriSpeech](https://www.openslr.org/12) dataset (EN only). Requires `pip install datasets`. Note that this dataset requires several GBs of storage.\n\n```py\nfrom audio_data_pytorch import LibriSpeechDataset\n\ndataset = LibriSpeechDataset(\n    root=\"./data\",\n)\n\ndataset[0] # (1, 222336)\n```\n\n#### Full API:\n```py\nLibriSpeechDataset(\n    root: str = \"./data\", # The root where the dataset will be downloaded\n    with_info: bool = False, # Whether to return info (i.e. text, sampling rate, speaker_id)\n    transforms: Optional[Callable] = None, # Transforms to apply to audio files\n)\n```\n\n### Common Voice Dataset\nMultilanguage wrapper for the [Common Voice](https://commonvoice.mozilla.org/). Requires `pip install datasets`. Note that each language requires several GBs of storage, and that you have to confirm access for each distinct version you use e.g. [here](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0), to validate your Huggingface access token. You can provide a list of `languages` and to avoid an unbalanced dataset the values will be interleaved by downsampling the majority language to have the same number of samples as the minority language.\n\n```py\nfrom audio_data_pytorch import CommonVoiceDataset\n\ndataset = CommonVoiceDataset(\n    auth_token=\"hf_xxx\",\n    version=1,\n    root=\"../data\",\n    languages=['it']\n)\n```\n\n#### Full API:\n```py\nCommonVoiceDataset(\n    auth_token: str, # Your Huggingface access token\n    version: int, # Common Voice dataset version\n    sub_version: int = 0, # Subversion: common_voice_{version}_{sub_version}\n    root: str = \"./data\", # The root where the dataset will be downloaded\n    languages: Sequence[str] = ['en'], # List of languages to include in the dataset\n    with_info: bool = False,  #  Whether to return info (i.e. text, sampling rate, age, gender, accent, locale)\n    transforms: Optional[Callable] = None, # Transforms to apply to audio files\n)\n```\n\n### Youtube Dataset\nA wrapper around yt-dlp that automatically downloads the audio source of Youtube videos. Requires `pip install yt-dlp`.\n\n```py\nfrom audio_data_pytorch import YoutubeDataset\n\ndataset = YoutubeDataset(\n    root='./data',\n    urls=[\n        \"https://www.youtube.com/watch?v=dQw4w9WgXcQ\",\n        \"https://www.youtube.com/watch?v=BZ-_KQezKmU\",\n    ],\n    crop_length=10 # Crop source in 10s chunks (optional but suggested)\n)\ndataset[0] # (2, 480000)\n```\n\n#### Full API:\n```py\ndataset = YoutubeDataset(\n    urls: Sequence[str], # The list of youtube urls\n    root: str = \"./data\", # The root where the dataset will be downloaded\n    crop_length: Optional[int] = None, # Crops the source into chunks of `crop_length` seconds\n    with_sample_rate: bool = False, # Returns sample rate as second argument\n    transforms: Optional[Callable] = None, # Transforms to apply to audio files\n)\n```\n\n### Clotho Dataset\nA wrapper for the [Clotho](https://zenodo.org/record/3490684#.Y0VVVOxBwR0) dataset extending `AudioWebDataset`. Requires `pip install py7zr` to decompress `.7z` archive.\n\n```py\nfrom audio_data_pytorch import ClothoDataset, Crop, Stereo, Mono\n\ndataset = ClothoDataset(\n    root='./data/',\n    preprocess_sample_rate=48000, # Added to all files during preprocessing\n    preprocess_transforms=nn.Sequential(Crop(48000*10), Stereo()), # Added to all files during preprocessing\n    transforms=Mono() # Added dynamically at iteration time\n)\n```\n\n\n#### Full API:\n```py\ndataset = ClothoDataset(\n    root: str, # Path where the dataset is saved\n    split: str = 'train', # Dataset split, one of: 'train', 'valid'\n    preprocess_sample_rate: Optional[int] = None, # Preprocesses dataset to this sample rate\n    preprocess_transforms: Optional[Callable] = None, # Preprocesses dataset with the provided transfomrs\n    reset: bool = False, # Re-compute preprocessing if `true`\n    **kwargs # Forwarded to `AudioWebDataset`\n)\n```\n\n### MetaDataset\nExtends `WAVDataset` with artist and genres read from ID3 tags and returned as string arrays or optionally mapped to integers stored in a json file at `metadata_mapping_path`.\n\n\n```py\nfrom audio_data_pytorch import MetaDataset\n\ndataset = MetaDataset(\n    path: Union[str, Sequence[str]], # Path or list of paths from which to load files\n    metadata_mapping_path: Optional[str] = None, # Path where mapping from artist/genres to numbers will be saved\n)\n\nwaveform, artists, genres = next(iter(dataset))\n\n# Convert an artist ID back to a string\nartist_name = dataset.mappings['artists'].invert[insert_artist_id]\n\n# Convert a genre ID back to a string\ngenre_name = dataset.mappings['genres'].invert[insert_genre_id]\n\n# If given a metadata_mapping_path, metadata is returned as an int Tensor\nwaveform, artist_genre_tensor = next(iter(dataset))\n```\n\n\n#### Full API:\n```py\ndataset = MetaDataset(\n    path: Union[str, Sequence[str]], # Path or list of paths from which to load files\n    metadata_mapping_path: Optional[str] = None, # Path where mapping from artist/genres to numbers will be saved\n    max_artists: int = 4, # Max number of artists to return\n    max_genres: int = 4, # Max number of artists to return\n    **kwargs # Forwarded to `WAVDataset`\n)\n```\n\n\n## Transforms\n\nYou can use the following individual transforms, or merge them with `nn.Sequential()`:\n\n```py\nfrom audio_data_pytorch import Crop\ncrop = Crop(size=22050*2, start=0) # Crop 2 seconds at 22050 Hz from the start of the file\n\nfrom audio_data_pytorch import RandomCrop\nrandom_crop = RandomCrop(size=22050*2) # Crop 2 seconds at 22050 Hz from a random position\n\nfrom audio_data_pytorch import Resample\nresample = Resample(source=48000, target=22050), # Resamples from 48kHz to 22kHz\n\nfrom audio_data_pytorch import Mono\noverlap = Mono() # Overap channels by sum to get mono soruce (C, N) -\u003e (1, N)\n\nfrom audio_data_pytorch import Stereo\nstereo = Stereo() # Duplicate channels (1, N) -\u003e (2, N) or (2, N) -\u003e (2, N)\n\nfrom audio_data_pytorch import Scale\nscale = Scale(scale=0.8) # Scale waveform amplitude by 0.8\n\nfrom audio_data_pytorch import Loudness\nloudness = Loudness(sampling_rate=22050, target=-20) # Normalize loudness to -20dB, requires `pip install pyloudnorm`\n```\n\nOr use this wrapper to apply a subset of them in one go, API:\n```py\nfrom audio_data_pytorch import AllTransform\n\ntransform = AllTransform(\n    source_rate: Optional[int] = None,\n    target_rate: Optional[int] = None,\n    crop_size: Optional[int] = None,\n    random_crop_size: Optional[int] = None,\n    loudness: Optional[int] = None,\n    scale: Optional[float] = None,\n    mono: bool = False,\n    stereo: bool = False,\n)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchinetai%2Faudio-data-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchinetai%2Faudio-data-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchinetai%2Faudio-data-pytorch/lists"}