{"id":13677718,"url":"https://github.com/Stability-AI/stable-audio-tools","last_synced_at":"2025-04-29T11:31:35.749Z","repository":{"id":199965075,"uuid":"644481540","full_name":"Stability-AI/stable-audio-tools","owner":"Stability-AI","description":"Generative models for conditional audio generation","archived":false,"fork":false,"pushed_at":"2025-03-21T22:07:45.000Z","size":485,"stargazers_count":3032,"open_issues_count":94,"forks_count":305,"subscribers_count":42,"default_branch":"main","last_synced_at":"2025-04-23T21:50:01.595Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stability-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-23T15:51:04.000Z","updated_at":"2025-04-23T06:19:16.000Z","dependencies_parsed_at":"2024-07-10T22:26:00.933Z","dependency_job_id":"6b2e0c6b-0016-426a-b5f9-c2743d2e35a0","html_url":"https://github.com/Stability-AI/stable-audio-tools","commit_stats":null,"previous_names":["stability-ai/stable-audio-tools","harmonai-org/harmonai-tools"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fstable-audio-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stability-AI","download_url":"https://codeload.github.com/Stability-AI/stable-audio-tools/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251493899,"owners_count":21598188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:46.157Z","updated_at":"2025-04-29T11:31:34.792Z","avatar_url":"https://github.com/Stability-AI.png","language":"Python","funding_links":[],"categories":["Official Resources","Python","Open-Source Music Generation Landscape","Repos","Multimodal","6. Generative Media Tools","Music Generation \u0026 AI","排行榜 [2025-03-18]"],"sub_categories":["LoRA Adapters and Quantized Models","1. Audio","Text-to-Speech"],"readme":"# stable-audio-tools\nTraining and inference code for audio generation models\n\n# Install\n\nThe library can be installed from PyPI with:\n```bash\n$ pip install stable-audio-tools\n```\n\nTo run the training scripts or inference code, you'll want to clone this repository, navigate to the root, and run:\n```bash\n$ pip install .\n```\n\n# Requirements\nRequires PyTorch 2.0 or later for Flash Attention support\n\nDevelopment for the repo is done in Python 3.8.10\n\n# Interface\n\nA basic Gradio interface is provided to test out trained models. \n\nFor example, to create an interface for the [`stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) model, once you've accepted the terms for the model on Hugging Face, you can run:\n```bash\n$ python3 ./run_gradio.py --pretrained-name stabilityai/stable-audio-open-1.0\n```\n\nThe `run_gradio.py` script accepts the following command line arguments:\n\n- `--pretrained-name`\n  - Hugging Face repository name for a Stable Audio Tools model\n  - Will prioritize `model.safetensors` over `model.ckpt` in the repo\n  - Optional, used in place of `model-config` and `ckpt-path` when using pre-trained model checkpoints on Hugging Face\n- `--model-config`\n  - Path to the model config file for a local model\n- `--ckpt-path`\n  - Path to unwrapped model checkpoint file for a local model\n- `--pretransform-ckpt-path` \n  - Path to an unwrapped pretransform checkpoint, replaces the pretransform in the model, useful for testing out fine-tuned decoders\n  - Optional\n- `--share`\n  - If true, a publicly shareable link will be created for the Gradio demo\n  - Optional\n- `--username` and `--password`\n  - Used together to set a login for the Gradio demo\n  - Optional\n- `--model-half`\n  - If true, the model weights to half-precision\n  - Optional\n\n# Training\n\n## Prerequisites\nBefore starting your training run, you'll need a model config file, as well as a dataset config file. For more information about those, refer to the Configurations section below\n\nThe training code also requires a Weights \u0026 Biases account to log the training outputs and demos. Create an account and log in with:\n```bash\n$ wandb login\n```\n\n## Start training\nTo start a training run, run the `train.py` script in the repo root with:\n```bash\n$ python3 ./train.py --dataset-config /path/to/dataset/config --model-config /path/to/model/config --name harmonai_train\n```\n\nThe `--name` parameter will set the project name for your Weights and Biases run.\n\n## Training wrappers and model unwrapping\n`stable-audio-tools` uses PyTorch Lightning to facilitate multi-GPU and multi-node training. \n\nWhen a model is being trained, it is wrapped in a \"training wrapper\", which is a `pl.LightningModule` that contains all of the relevant objects needed only for training. That includes things like discriminators for autoencoders, EMA copies of models, and all of the optimizer states.\n\nThe checkpoint files created during training include this training wrapper, which greatly increases the size of the checkpoint file.\n\n`unwrap_model.py` in the repo root will take in a wrapped model checkpoint and save a new checkpoint file including only the model itself.\n\nThat can be run with from the repo root with:\n```bash\n$ python3 ./unwrap_model.py --model-config /path/to/model/config --ckpt-path /path/to/wrapped/ckpt --name model_unwrap\n```\n\nUnwrapped model checkpoints are required for:\n  - Inference scripts\n  - Using a model as a pretransform for another model (e.g. using an autoencoder model for latent diffusion)\n  - Fine-tuning a pre-trained model with a modified configuration (i.e. partial initialization)\n\n## Fine-tuning\nFine-tuning a model involves continuning a training run from a pre-trained checkpoint. \n\nTo continue a training run from a wrapped model checkpoint, you can pass in the checkpoint path to `train.py` with the `--ckpt-path` flag.\n\nTo start a fresh training run using a pre-trained unwrapped model, you can pass in the unwrapped checkpoint to `train.py` with the `--pretrained-ckpt-path` flag.\n\n## Additional training flags\n\nAdditional optional flags for `train.py` include:\n- `--config-file`\n  - The path to the defaults.ini file in the repo root, required if running `train.py` from a directory other than the repo root\n- `--pretransform-ckpt-path`\n  - Used in various model types such as latent diffusion models to load a pre-trained autoencoder. Requires an unwrapped model checkpoint.\n- `--save-dir`\n  - The directory in which to save the model checkpoints\n- `--checkpoint-every`\n  - The number of steps between saved checkpoints.\n  - *Default*: 10000\n- `--batch-size`\n  - Number of samples per-GPU during training. Should be set as large as your GPU VRAM will allow.\n  - *Default*: 8\n- `--num-gpus`\n  - Number of GPUs per-node to use for training\n  - *Default*: 1\n- `--num-nodes`\n  - Number of GPU nodes being used for training\n  - *Default*: 1\n- `--accum-batches`\n  - Enables and sets the number of batches for gradient batch accumulation. Useful for increasing effective batch size when training on smaller GPUs.\n- `--strategy`\n  - Multi-GPU strategy for distributed training. Setting to `deepspeed` will enable DeepSpeed ZeRO Stage 2.\n  - *Default*: `ddp` if `--num_gpus` \u003e 1, else None\n- `--precision`\n  - floating-point precision to use during training\n  - *Default*: 16\n- `--num-workers`\n  - Number of CPU workers used by the data loader\n- `--seed`\n  - RNG seed for PyTorch, helps with deterministic training\n\n# Configurations\nTraining and inference code for `stable-audio-tools` is based around JSON configuration files that define model hyperparameters, training settings, and information about your training dataset.\n\n## Model config\nThe model config file defines all of the information needed to load a model for training or inference. It also contains the training configuration needed to fine-tune a model or train from scratch.\n\nThe following properties are defined in the top level of the model configuration:\n\n- `model_type`\n  - The type of model being defined, currently limited to one of `\"autoencoder\", \"diffusion_uncond\", \"diffusion_cond\", \"diffusion_cond_inpaint\", \"diffusion_autoencoder\", \"lm\"`.\n- `sample_size`\n  - The length of the audio provided to the model during training, in samples. For diffusion models, this is also the raw audio sample length used for inference.\n- `sample_rate`\n  - The sample rate of the audio provided to the model during training, and generated during inference, in Hz.\n- `audio_channels`\n  - The number of channels of audio provided to the model during training, and generated during inference. Defaults to 2. Set to 1 for mono.\n- `model`\n  - The specific configuration for the model being defined, varies based on `model_type`\n- `training`\n  - The training configuration for the model, varies based on `model_type`. Provides parameters for training as well as demos.\n\n## Dataset config\n`stable-audio-tools` currently supports two kinds of data sources: local directories of audio files, and WebDataset datasets stored in Amazon S3. More information can be found in [the dataset config documentation](docs/datasets.md)\n\n# Todo\n- [ ] Add troubleshooting section\n- [ ] Add contribution guidelines \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStability-AI%2Fstable-audio-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FStability-AI%2Fstable-audio-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStability-AI%2Fstable-audio-tools/lists"}