{"id":42816496,"url":"https://github.com/crlandsc/tiny-audio-diffusion","last_synced_at":"2026-01-30T06:31:13.985Z","repository":{"id":173179566,"uuid":"649947225","full_name":"crlandsc/tiny-audio-diffusion","owner":"crlandsc","description":"A repository for generating and training short audio samples with unconditional waveform diffusion on accessible consumer hardware (\u003c2GB VRAM GPU)","archived":false,"fork":false,"pushed_at":"2024-06-06T00:01:04.000Z","size":15491,"stargazers_count":134,"open_issues_count":1,"forks_count":14,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-06-06T01:25:17.984Z","etag":null,"topics":["deep-learning","diffusion","generative-audio","machine-learning"],"latest_commit_sha":null,"homepage":"https://towardsdatascience.com/tiny-audio-diffusion-ddc19e90af9b","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crlandsc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-06T01:58:19.000Z","updated_at":"2024-06-06T01:25:20.184Z","dependencies_parsed_at":null,"dependency_job_id":"914b2cdb-6b38-4f24-aec4-bc63a42c0ddb","html_url":"https://github.com/crlandsc/tiny-audio-diffusion","commit_stats":null,"previous_names":["crlandsc/tiny-audio-diffusion"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/crlandsc/tiny-audio-diffusion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crlandsc%2Ftiny-audio-diffusion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crlandsc%2Ftiny-audio-diffusion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crlandsc%2Ftiny-audio-diffusion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crlandsc%2Ftiny-audio-diffusion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crlandsc","download_url":"https://codeload.github.com/crlandsc/tiny-audio-diffusion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crlandsc%2Ftiny-audio-diffusion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28906586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-30T04:02:34.702Z","status":"ssl_error","status_checked_at":"2026-01-30T04:02:33.562Z","response_time":66,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","diffusion","generative-audio","machine-learning"],"created_at":"2026-01-30T06:31:11.467Z","updated_at":"2026-01-30T06:31:13.977Z","avatar_url":"https://github.com/crlandsc.png","language":"Python","funding_links":[],"categories":["Audio Gen"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1 style=\"font-size: 36px;\"\u003eTiny Audio Diffusion\u003c/h1\u003e\n  \u003cimg src=\"./images/tiny-audio-diffusion.png\" width=\"250px\" alt=\"Tiny Audio Diffusion Logo\" /\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n[![Hugging Face Spaces Badge](https://img.shields.io/badge/%F0%9F%A4%97_Spaces_Demo-blue)](https://huggingface.co/spaces/crlandsc/tiny-audio-diffusion) [![YouTube Tutorial Badge](https://img.shields.io/badge/Repo_Tutorial-red?logo=YouTube)](https://youtu.be/m6Eh2srtTro) [![Towards Data Science Badge](https://img.shields.io/badge/Towards_Data_Science-red?logo=Medium\u0026color=black)](https://medium.com/towards-data-science/tiny-audio-diffusion-ddc19e90af9b) [![GitHub License](https://img.shields.io/github/license/crlandsc/tiny-audio-diffusion)](https://github.com/crlandsc/tiny-audio-diffusion/blob/main/LICENSE) [![GitHub Repo stars](https://img.shields.io/github/stars/crlandsc/tiny-audio-diffusion?color=gold)](https://github.com/crlandsc/tiny-audio-diffusion/stargazers) [![GitHub forks](https://img.shields.io/github/forks/crlandsc/tiny-audio-diffusion?color=green)](https://github.com/crlandsc/tiny-audio-diffusion/forks)\n\nThis is a repository for generating short audio samples and training waveform diffusion models on a consumer-grade GPU with less than 2GB VRAM.\n\n## Motivation\n\nThe purpose of this project is to provide access to stereo high-resolution (44.1kHz) conditional and unconditional audio waveform (1D U-Net) diffusion code for those interested in exploration but who have limited resources. There are many methods for audio generation on low-level hardware, but less so specifically for waveform-based diffusion.\n\nThe repository is built heavily adapting code from Archinet's [audio-diffusion-pytorch](https://github.com/archinetai/audio-diffusion-pytorch) libary. A huge thank you to [Flavio Schneider](https://github.com/flavioschneider) for his incredible open-source work in this field!\n\n\n## Background\n\nDirect waveform diffusion is inherently computationally intensive. For example, an audio sample with the industry standard 44.1kHz sampling rate requires 44,100 samples for just 1 second of audio. Now multiply that by 2 for a stereo file. However, it has a significant advantage over many methods that reduce audio into spectrograms or downsample - the network retains and learns from *phase* information. Phase is challenging to represent on its own in visual methods, such as spectrograms, as it appears similar to that of random noise. Because of this, many generative methods discard phase information and then implement ways of estimating and regenerating it. However, it plays a key role in defining the timbral qualities of sounds and should not be dispensed with so easily.\n\nWaveform diffusion is able to retain this important feature as it does not perform any transforms on the audio before feeding it into the network. This is how humans perceive sounds, with both amplitude and phase information bundled together in a single signal. As mentioned previously, this comes at the expense of computational requirements and is often reserved for training on a cluster of GPUs with high speeds and lots of memory. Because of this, it is hard to begin to experiment with waveform diffusion with limited resources.\n\nThis repository seeks to offer some base code to those looking to experiment with and learn more about waveform diffusion on their own computer without having to purchase cloud resources or upgrade hardware. This goes for not only *inference*, but *training* your own models as well!\n\nTo make this feasible, however, there must be a tradeoff of quality, speed, and sample length. Because of this, I have focused on training base models for one-shot drum samples - as they are inherently short in sample length.\n\nThe current configuration is set up to be able to train ~0.75 second stereo samples at 44.1kHz, allowing for the generation of high-quality one-shot audio samples. The network configuration can be adjusted to improve the resolution, sample rate, training and inference speed, sample length, etc. but, of course, more hardware resources will be required.\n\nOther methods of diffusion, such as diffusion in the latent space ([Stable Diffusion's](https://stability.ai/stablediffusion) secret sauce), compared to this repo's raw waveform diffusion can offer an improvement and other tradeoffs between quality, memory requirements, speed, etc. I recommend this repo to remain up-to-date with the latest research in generative audio: https://github.com/archinetai/audio-ai-timeline\n\nAlso recommended is [Harmonai's](https://www.harmonai.org/) community project, [Dance Diffusion](https://github.com/Harmonai-org/sample-generator), which implements similar functionality to this repo on a larger scale with several pre-trained models. [Colab notebook](https://colab.research.google.com/github/Harmonai-org/sample-generator/blob/main/Dance_Diffusion.ipynb) available.\n\n**April 2024 update:**\n\nSome additional useful generative audio tools/repos:\n- [Stable Audio Tools](https://github.com/Stability-AI/stable-audio-tools) (used in [Stable Audio](https://www.stableaudio.com/)) - Useful audio tools for building and training models.\n- [audiocraft](https://github.com/facebookresearch/audiocraft) (used in [MusicGen](https://audiocraft.metademolab.com/musicgen.html) \u0026 [AudioGen](https://audiocraft.metademolab.com/audiogen.html)) - Useful audio tools for building and training models.\n- [audiomentations](https://github.com/iver56/audiomentations) - Good library for implementing audio augmentations on CPU for training. See [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations) for GPU implementation.\n\n---\n\n## Setup\n\nFollow these steps to set up an environment for both generating audio samples and training models.\n\n*NOTE:* To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed for your specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found [here](https://developer.nvidia.com/cuda-toolkit).\n\n#### 1. Create a Virtual Environment:\n\nEnsure that [Anaconda (or Miniconda)](https://docs.anaconda.com/free/anaconda/install/index.html) is installed and activated. From the command line, `cd` into the [`setup/`](setup/) folder and run the following lines:\n```bash\nconda env create -f environment.yml\nconda activate tiny-audio-diffusion\n```\n\nThis will create and activate a conda environment from the [`setup/environment.yml`](setup/environment.yml) file and install the dependencies in [`setup/requirements.txt`](setup/requirements.txt).\n\n#### 2. Install Python Kernel For Jupyter Notebook\n\nRun the following line to create a kernel for the current environment to run the inference notebook.\n\n```bash\npython -m ipykernel install --user --name tiny-audio-diffusion --display-name \"tiny-audio-diffusion (Python 3.10)\"\n```\n\n#### 3. Define Environment Variables\n\nRename [`.env.tmp`](.env.tmp) to `.env` and replace the entries with your own variables (example values are random).\n\n```bash\nDIR_LOGS=/logs\nDIR_DATA=/data\n\n# Required if using Weights \u0026 Biases (W\u0026B) logger\nWANDB_PROJECT=tiny_drum_diffusion # Custom W\u0026B name for current project\nWANDB_ENTITY=johnsmith # W\u0026B username\nWANDB_API_KEY=a21dzbqlybbzccqla4txa21dzbqlybbzccqla4tx # W\u0026B API key\n```\n\n*NOTE:* Sign up for a [Weights \u0026 Biases](https://wandb.ai/site) account to log audio samples, spectrograms, and other metrics while training (it's free!).\n\nW\u0026B logging example for this repo [here](https://wandb.ai/crlandsc/unconditional-drum-diffusion?workspace=user-crlandsc).\n\n---\n\n## Pre-trained Models\n\nPretrained models can be found on Hugging Face (each model contains a `.ckpt` and `.yaml` file):\n\n|Model|Link|\n|---|---|\n|Kicks|[crlandsc/tiny-audio-diffusion-kicks](https://huggingface.co/crlandsc/tiny-audio-diffusion-kicks)|\n|Snares|[crlandsc/tiny-audio-diffusion-snares](https://huggingface.co/crlandsc/tiny-audio-diffusion-snares)|\n|Hi-hats|[crlandsc/tiny-audio-diffusion-hihats](https://huggingface.co/crlandsc/tiny-audio-diffusion-hihats)|\n|Percussion (all drum types)|[crlandsc/tiny-audio-diffusion-percussion](https://huggingface.co/crlandsc/tiny-audio-diffusion-percussion)|\n\n*See W\u0026B model training metrics [here](https://wandb.ai/crlandsc/unconditional-drum-diffusion?workspace=user-crlandsc).*\n\nPre-trained models can be downloaded to generate samples via the [inference notebook](Inference.ipynb). They can also be used as a base model to fine-tune on custom data. It is recommended to create subfolders within the [`saved_models`](saved_models/) folder to store each model's `.ckpt` and `.yaml` files.\n\n---\n\n## Inference\n### Hugging Face Spaces\nGenerate samples without code on [🤗 Hugging Face Spaces](https://huggingface.co/spaces/crlandsc/tiny-audio-diffusion)!\n\n### Jupyter Notebook\n#### Audio Sample Generation\nCurrent Capabilities:\n- Unconditional Generation\n- Conditional \"Style-transfer\" Generation\n\nOpen the [`Inference.ipynb`](Inference.ipynb) in Jupyter Notebook and follow the instructions to generate new audio samples. Ensure that the `\"tiny-audio-diffusion (Python 3.10)\"` kernel is active in Jupyter to run the notebook and you have downloaded the [pre-trained model](#Pre\\-trained-Models) of interest from Hugging Face.\n\n---\n\n## Train\n\nThe model architecture has been constructed with [PyTorch Lightning](https://lightning.ai/docs/pytorch/latest/) and [Hydra](https://hydra.cc/docs/intro/) frameworks. All configurations for the model are contained within `.yaml` files and should be edited there rather than hardcoded.\n\n[`exp/drum_diffusion.yaml`](exp/drum_diffusion.yaml) contains the default model configuration. Additional custom model configurations can be added to the [`exp`](exp/) folder.\n\nCustom models can be trained or fine-tuned on custom datasets. Datasets should consist of a folder of `.wav` audio files with a 44.1kHz sampling rate.\n\nTo train or finetune models, run one of the following commands in the terminal from the repo's root folder and replace `\u003cpath/to/your/train/data\u003e` with the path to your custom training set.\n\n\n**Train model from scratch (on CPU):**\n*(not recommended)*\n\n```bash\npython train.py exp=drum_diffusion datamodule.dataset.path=\u003cpath/to/your/train/data\u003e\n```\n\n\n**Train model from scratch (on GPU):**\n\n```bash\npython train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=\u003cpath/to/your/train/data\u003e\n```\n\n*NOTE:* To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found [here](https://developer.nvidia.com/cuda-toolkit).\n\n\n**Resume run from a checkpoint (with GPU):**\n\n```bash\npython train.py exp=drum_diffusion trainer.gpus=1 +ckpt=\u003c/path/to/checkpoint.ckpt\u003e datamodule.dataset.path=\u003cpath/to/your/train/data\u003e\n```\n\n---\n\n## Dataset\n\nThe data used to train the checkpoints listed above can be found on [🤗 Hugging Face](https://huggingface.co/datasets/crlandsc/tiny-audio-diffusion-drums).\n\n***Note:*** *This is a small and unbalanced dataset consisting of free samples that I had from my music production. These samples are not covered under the MIT license of this repository and cannot be used to train any commercial models, but can be used in personal and research contexts.*\n\n***Note:*** *For appropriately diverse models, larger datasets should be used to avoid memorization of training data.*\n\n---\n\n## Repository Structure\n\nThe structure of this repository is as follows:\n```\n├── main\n│   ├── diffusion_module.py     - contains pl model, data loading, and logging functionalities for training\n│   └── utils.py                - contains utility functions for training\n├── exp\n│   └── *.yaml                  - Hydra configuration files\n├── setup\n│   ├── environment.yml         - file to set up conda environment\n│   └── requirements.txt        - contains repo dependencies\n├── images                      - directory containing images for README.md\n│   └── *.png\n├── samples                     - directory containing sample outputs from tiny-audio-diffusion models\n│   └── *.wav\n├── .env.tmp                    - temporary environment variables (rename to .env)\n├── .gitignore\n├── README.md\n├── Inference.ipynb             - Jupyter notebook for running inference to generate new samples\n├── config.yaml                 - Hydra base configs\n├── train.py                    - script for training\n├── data                        - directory to host custom training data\n│   └── wav_dataset\n│       └── (*.wav)\n└── saved_models                - directory to host model checkpoints and hyper-parameters for inference\n    └── (kicks/snare/etc.)\n        ├── (*.ckpt)            - pl model checkpoint file\n        └── (config.yaml)       - pl model hydra hyperparameters (required for inference)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrlandsc%2Ftiny-audio-diffusion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrlandsc%2Ftiny-audio-diffusion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrlandsc%2Ftiny-audio-diffusion/lists"}