{"id":50124645,"url":"https://github.com/stability-ai/stable-audio-3","last_synced_at":"2026-05-23T19:03:37.715Z","repository":{"id":359179512,"uuid":"1186729718","full_name":"Stability-AI/stable-audio-3","owner":"Stability-AI","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-20T21:43:43.000Z","size":962,"stargazers_count":43,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-20T23:51:27.044Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stability-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-19T23:43:10.000Z","updated_at":"2026-05-20T23:30:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Stability-AI/stable-audio-3","commit_stats":null,"previous_names":["stability-ai/stable-audio-3"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Stability-AI/stable-audio-3","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stability-AI","download_url":"https://codeload.github.com/Stability-AI/stable-audio-3/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-3/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33408490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T18:09:33.147Z","status":"ssl_error","status_checked_at":"2026-05-23T18:09:31.380Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-23T19:03:26.792Z","updated_at":"2026-05-23T19:03:37.701Z","avatar_url":"https://github.com/Stability-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stable Audio 3\n\n**A state-of-the-art open platform for fast, high-quality generated audio and music.**\n\n[Technical Report](https://arxiv.org/abs/2605.17991) · [🤗 Models](https://huggingface.co/collections/stabilityai/stable-audio-3) · [🤗 Extra Models](https://huggingface.co/collections/stabilityai/stable-audio-3-extra) · [Discord](https://discord.gg/cKpvjey8b) · [Demo](https://huggingface.co/spaces/stabilityai/stable-audio-3) · [Blog Post](https://stability.ai/news-updates/meet-stable-audio-3-the-model-family-built-for-artistic-experimentation-with-open-weight-models)\n\n![Stable Audio 3 Architecture](stable-audio-3.png)\n\n\nStable Audio 3 is the next generation of Stable Audio: a focused, streamlined platform for inference and fine-tuning, built on lessons from [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools). If you're doing foundational research or working with previous Stable Audio models, that repo is still the place to go.\n\n\n---\n\n## Models\n\n| Model | Model ID | Autoencoder | Hardware | Params | Max length | Use case |\n|---|---|---|---|---|---|---|\n| [**Stable Audio 3 Small-Music**](https://huggingface.co/stabilityai/stable-audio-3-small-music) | `small-music` | SAME-Small | CPU | 433M | 120s | Lightweight music-only inference, no GPU required |\n| [**Stable Audio 3 Small-SFX**](https://huggingface.co/stabilityai/stable-audio-3-small-sfx) | `small-sfx` | SAME-Small | CPU | 433M | 120s | Lightweight sound effects-only inference, no GPU required |\n| [**Stable Audio 3 Medium**](https://huggingface.co/stabilityai/stable-audio-3-medium) | `medium` | SAME-Large | GPU (CUDA) | 1.4B | 380s | High Quality, Fast Inference |\n| **Stable Audio 3 Large** | — | SAME-Large | API only | 2.7B | 380s | Highest quality, API only. Not supported by this repo, see the [API docs](https://stableaudio.com/user-guide) |\n\nBase (un-post-trained) checkpoints, the SAME autoencoders, and optimized variants are available in the [Extra Models collection](https://huggingface.co/collections/stabilityai/stable-audio-3-extra).\n\n### Performance\n\n| Model | Duration | H200 | H200 + TensorRT | Mac CPU* | Mac CoreML | Peak VRAM† |\n|---|---|---|---|---|---|---|\n| `small` | 5s | 0.41s | 0.017s | 0.70s | 0.23s | 1.69 GB |\n| `small` | 30s | 0.46s | 0.022s | 1.72s | 0.63s | 1.89 GB |\n| `small` | 120s | 0.45s | 0.044s | 5.92s | 3.09s | 2.40 GB |\n| `medium` | 5s | 0.60s | 0.02s | – | – | 5.07 GB |\n| `medium` | 30s | 0.65s | 0.05s | – | – | 5.49 GB |\n| `medium` | 120s | 0.78s | 0.13s | – | – | 6.49 GB |\n| `medium` | 380s | 1.31s | 0.43s | – | – | 6.52 GB |\n\n\\* CPU-only via CoreML (Diffusion Transformer) + TFLite (SAME-S decoder)\n† Peak allocated VRAM on H200, unchunked decode. Chunked decoding reduces this — e.g. `medium` at 120s drops from 6.49 GB to ~5.14 GB.\n\n---\n\n## Features\n- ⚡ **Fast, state-of-the-art generation** - Generate minutes of audio in milliseconds\n- 🎛️ **Three inference modes** — text-to-audio, audio-to-audio editing, and inpainting/continuation\n- ↔️ **Variable-length generation** — handles generation of a variety of sequences without wasting inference time and VRAM on unused latents\n- 🎯 **Personalization through LoRA fine-tuning** — adapt any model to a target style; stackable, adjustable at runtime\n- 💻 **Broad hardware support** — CPU (Small), CUDA/TensorRT (Medium), Apple Silicon via CoreML, Others coming soon\n- 🎵 **SAME autoencoder** — new Semantic-Acoustic Music Encoder; stereo, 44.1 kHz, 256-dimensional latents optimized for both generative tractability and high-quality reconstruction\n\n\n## Installation\n\nStable Audio 3 uses [uv](https://github.com/astral-sh/uv) for fast, lightweight installs. Install only what you need.\n\n```bash\n# Base install (Python API only)\nuv sync\n\n# With Gradio UI\nuv sync --extra ui\n\n# With LoRA training support\nuv sync --extra lora\n\n# Everything\nuv sync --extra ui --extra lora\n```\n\n### CUDA Version\n\nBy default, `uv sync` installs PyTorch built against CUDA 12.6. If you need a different CUDA version, install torch and torchaudio manually first (pinning the same version as `pyproject.toml`), then sync without reinstalling them, for example:\n\n```bash\nuv pip install torch==2.7.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu118\nuv sync --no-install-package torch --no-install-package torchaudio\n```\n\nReplace `cu118` with your target version. For torch 2.7.1, available CUDA variants are `cu118`, `cu126`, and `cu128`. Not all versions are published for every CUDA channel — check the [PyTorch install page](https://pytorch.org/get-started/locally/) to confirm your target is available.\n\n### Flash Attention\n\nStable Audio 3 Medium requires [Flash Attention 2](https://github.com/Dao-AILab/flash-attention).\n\n**Install from a pre-built wheel** (fast, no compilation). The easiest source is the [flash-attention-prebuild-wheels](https://github.com/mjun0812/flash-attention-prebuild-wheels) community repo — browse the releases for a wheel matching your CUDA, PyTorch, and Python versions, then install it directly:\n\n```bash\nuv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.6.3+cu126torch2.7-cp310-cp310-linux_x86_64.whl\n```\n\nThe filename encodes the requirements — `cu126` is CUDA 12.6, `torch2.7` is PyTorch 2.7, `cp310` is Python 3.10. Pick the URL that matches your environment.\n\nIf no pre-built wheel matches your setup, build from source. Install `ninja` first to speed up the C++ compile, then set the environment variables for your machine:\n\n```bash\nuv pip install ninja\n.venv/bin/python -m ensurepip\nFLASH_ATTENTION_SKIP_CUDA_BUILD=FALSE \\\nFLASH_ATTENTION_FORCE_BUILD=TRUE \\\nTORCH_CUDA_ARCH_LIST=\"9.0\" \\\nMAX_JOBS=8 \\\n.venv/bin/pip3 install flash-attn --no-build-isolation --no-binary flash-attn \\\n    --force-reinstall --no-cache-dir --no-deps\n```\n\n- `TORCH_CUDA_ARCH_LIST` — set to your GPU's compute capability: `8.0` (A100), `8.6` (A10/RTX 3090), `8.9` (L4/RTX 4090), `9.0` (H100/H200)\n- `MAX_JOBS` — number of parallel compile jobs; 4–8 is typical, reduce if you run out of RAM during compilation\n\n**Note:** `flash-attn` is not declared in `pyproject.toml`, so a plain `uv sync` will remove it. Use `uv sync --inexact` to install/update dependencies without removing packages that aren't in the lockfile:\n\n```bash\nuv sync --inexact\n```\n\n## Quick Start\n\nLaunch the Gradio UI:\n\n```bash\nuv run python run_gradio.py --model medium\n```\n\nThis starts a local web interface with a shareable link. To load a LoRA checkpoint:\n\n```bash\nuv run python run_gradio.py --model medium --lora-ckpt-path path/to/lora.ckpt\n```\n\n## Usage\n\nStable Audio 3 supports several inference modes. For full details, see [Inference Methods](docs/workflows/inference.md).\n\n**Text-to-Audio** — Generate audio from a text prompt:\n\n```python\nfrom stable_audio_3 import StableAudioModel\n\nmodel = StableAudioModel.from_pretrained(\"medium\")\naudio = model.generate(\n    prompt=\"House music that encapsulates the feeling of being at a festival in the sunny weather with all your friends 124 BPM\",\n    duration=250,\n)\n```\n\n**Audio-to-Audio** — Edit an existing recording using a prompt to steer style and mood:\n\n```python\nimport torchaudio\nfrom stable_audio_3 import StableAudioModel\n\nmodel = StableAudioModel.from_pretrained(\"medium\")\ninit_audio = torchaudio.load(\"/path/to/audio.wav\")\naudio = model.generate(\n    init_audio=init_audio,\n    init_noise_level=0.9,\n    prompt=\"bossa nova bassline\",\n    duration=30,\n)\n```\n\n**Inpainting / Continuation** — Regenerate a specific region of an audio file while keeping the rest intact:\n\n```python\nimport torchaudio\nfrom stable_audio_3 import StableAudioModel\n\nmodel = StableAudioModel.from_pretrained(\"medium\")\n\ninpaint_audio = torchaudio.load(\"/path/to/audio.wav\")\naudio = model.generate(\n    inpaint_audio=inpaint_audio,\n    inpaint_mask_start_seconds=4.0,\n    inpaint_mask_end_seconds=8.0,\n    prompt=\"punchy kick drum fill\",\n    duration=30,\n)\n```\n\nTo regenerate **multiple non-contiguous regions** in one pass, pass lists to both mask parameters:\n\n```python\naudio = model.generate(\n    inpaint_audio=inpaint_audio,\n    inpaint_mask_start_seconds=[4.0, 16.0],\n    inpaint_mask_end_seconds=[8.0, 20.0],\n    prompt=\"punchy kick drum fill\",\n    duration=30,\n)\n```\n\nTo extend an audio clip (continuation), set `inpaint_mask_start_seconds` to the length of the source file and choose a longer `duration`. See [Inference Methods](docs/workflows/inference.md) for the full controls reference.\n\n\n**Encoding / Decoding** — Use the autoencoder directly to encode audio to latents or decode latents back to audio:\n\n```python\nimport torchaudio\nfrom stable_audio_3 import AutoencoderModel\n\nae = AutoencoderModel.from_pretrained(\"same-l\")\nwaveform, sr = torchaudio.load(\"audio.wav\")\nlatents = ae.encode(waveform, sr)\naudio_out = ae.decode(latents)\n```\n\nSee [Autoencoder Workflows](docs/workflows/autoencoder.md) for encoding batches, chunked processing, and pre-encoding datasets for LoRA training.\n\n## CLI\n\nA `stable-audio` cli is included for running generation without writing any Python.\n\n**Text-to-audio:**\n```bash\nstable-audio --model small-music -p \"lo-fi hip hop beat, 90 BPM\" --duration 30 -o beat.wav\n```\n\n**Audio-to-audio** — restyle an existing recording:\n```bash\nstable-audio -p \"bossa nova bassline\" --init-audio input.wav --init-noise-level 0.8 -o out.wav\n```\n\n**Inpainting** — regenerate a region while keeping the rest:\n```bash\nstable-audio -p \"punchy kick drum fill\" --inpaint-audio input.wav --inpaint-start 4 --inpaint-end 8 -o out.wav\n```\n\n**Continuation** — extend a clip beyond its original length:\n```bash\nstable-audio -p \"dreamy synth outro\" --inpaint-audio input.wav --inpaint-start 10 --inpaint-end 30 --duration 30 -o out.wav\n```\n\n**With a LoRA:**\n```bash\nstable-audio -p \"orchestral strings\" --lora-ckpt-path my_lora.safetensors --lora-strength 0.8 -o out.wav\n```\n\nRun `stable-audio --help` for the full list of flags.\n\n## Hardware Support\nStable Audio 3 scales from a laptop to a GPU server.\n\n*Hardware Support Scripts COMING SOON*\n\n\n## Docs\n\n| Guide | Description |\n|-------|-------------|\n| [Inference Methods](docs/workflows/inference.md) | Overview of inference modes (text-to-audio, inpainting, etc.) |\n| [LoRA Training](docs/workflows/lora.md) | Fine-tune with LoRA: setup, training loop, and checkpointing |\n| [Autoencoder Workflows](docs/workflows/autoencoder.md) | Encode and decode audio with the VAE directly |\n| [Prompting Guide](docs/guides/prompting.md) | Prompt and control signal reference |\n| [Model Overview](docs/guides/model-overview.md) | Architecture and design overview |\n\n---\n\n## Community\n\n- [Harmonai Discord](https://discord.gg/cKpvjey8b): Check out our Harmonai Discord server run by the research team. Besides good discussions, we host weekly office hours talking all things AI audio and music and want to hear what you come up with!\n\n- [Underfit](https://github.com/dada-bots/underfit): A LoRA training poweruser dream from Dadabots. If LoRA training in this repo is not enough, check out some experimental tools there like agentic LoRA orchestrations and monitoring.\n\n---\n\n## Troubleshooting\n\n#### Output audio is a static glitch sound (affects Stable Audio 3 Medium-only)\n\nLikely an issue with flash-attention. Verify it is importable:\n\n```bash\nuv run python -c \"import flash_attn; from flash_attn import flash_attn_func; print('Version:', flash_attn.__version__, '| flash_attn_func:', flash_attn_func)\"\n```\n\nIf this errors, flash-attn is not installed correctly — see the [Flash Attention install instructions](#flash-attention) above.\n\n---\n\n## License\n\nPlease refer to the [Stability AI Community License](https://stability.ai/license)\n\n\n## Testing\n\nInstall dev dependencies:\n\n```bash\nuv sync --group dev\n```\n\nRun the test suite:\n\n```bash\nuv run pytest\n```\n\nSave generated audio outputs to `test_audio_outputs/` for manual inspection:\n\n```bash\nuv run pytest --save-audio\n```\n\n\n## Citation\n\nFor Stable Audio 3, please cite\n```BibTeX\n@misc{evans2026stableaudio3,\n  title={Stable Audio 3},\n  author={Zach Evans and Julian D. Parker and Matthew Rice and CJ Carr and Zack Zukowski and Josiah Taylor and Jordi Pons},\n  year={2026},\n  eprint={2605.17991},\n  archivePrefix={arXiv},\n  primaryClass={cs.SD},\n  url={https://arxiv.org/abs/2605.17991}\n}\n```\n\nFor SAME, please cite\n```BibTeX\n@misc{parker2026SAME,\n  title={SAME: A Semantically-Aligned Music Autoencoder},\n  author={Julian D. Parker and Zach Evans and CJ Carr and Zack Zukowski and Josiah Taylor and Matthew Rice and Jordi Pons},\n  year={2026},\n  eprint={2605.18613},\n  archivePrefix={arXiv},\n  primaryClass={cs.SD},\n  url={https://arxiv.org/abs/2605.18613}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstability-ai%2Fstable-audio-3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstability-ai%2Fstable-audio-3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstability-ai%2Fstable-audio-3/lists"}