https://github.com/psyb0t/docker-audiolla
Self-hosted audio API in one Docker container. Stem separation, mastering, BPM/key match, fingerprinting, similarity, EQ, sidechain duck, MIDI composition + rendering, MIR analysis, effects chain, loudness normalization. REST + MCP. CPU and CUDA. Drive it from a shell, DAW pipeline, or LLM agent.
https://github.com/psyb0t/docker-audiolla
audio audio-fingerprinting bpm-detection demucs docker fastapi fluidsynth key-detection librosa llm-agents loudness mastering matchering mcp midi midi-generation music-production pedalboard self-hosted stem-separation
Last synced: 1 day ago
JSON representation
Self-hosted audio API in one Docker container. Stem separation, mastering, BPM/key match, fingerprinting, similarity, EQ, sidechain duck, MIDI composition + rendering, MIR analysis, effects chain, loudness normalization. REST + MCP. CPU and CUDA. Drive it from a shell, DAW pipeline, or LLM agent.
- Host: GitHub
- URL: https://github.com/psyb0t/docker-audiolla
- Owner: psyb0t
- License: wtfpl
- Created: 2026-05-31T10:12:56.000Z (9 days ago)
- Default Branch: main
- Last Pushed: 2026-06-03T17:54:38.000Z (5 days ago)
- Last Synced: 2026-06-03T19:08:35.241Z (5 days ago)
- Topics: audio, audio-fingerprinting, bpm-detection, demucs, docker, fastapi, fluidsynth, key-detection, librosa, llm-agents, loudness, mastering, matchering, mcp, midi, midi-generation, music-production, pedalboard, self-hosted, stem-separation
- Language: Python
- Homepage:
- Size: 840 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# audiolla
[](https://hub.docker.com/r/psyb0t/audiolla)
[](https://hub.docker.com/r/psyb0t/audiolla)
[](http://www.wtfpl.net/)
[](https://www.python.org/downloads/)
**Thirty audio engines. One port. Zero cloud. Fire-and-forget async jobs. Webhooks.**
You needed Demucs for stems. Then librosa for BPM and key. Then basic-pitch for MIDI transcription. Then pyannote for speaker diarization. Then DeepFilterNet for speech enhancement. Then you spent three days debugging Python version conflicts and now you hate everything.
audiolla is what happens when you stop doing that.
Every audio processing tool worth using β wrapped in one HTTP API, running in one Docker container. POST a file. Get audio, JSON, or MIDI back. Drive it from curl, shell scripts, Python notebooks, Makefiles, or point an LLM agent at the MCP endpoint and let it rip.
No account. No subscription. No per-minute billing. No vendor lock-in. `docker run` and you're done.
---
## What's in the box
| | |
|--|--|
| ποΈ **Stem separation** | Demucs β htdemucs, fine-tuned, 6-stem, MDX variants |
| ποΈ **Mastering** | Reference mastering (matchering) + custom pedalboard chains |
| π **Analysis** | BPM Β· key Β· LUFS Β· beats Β· onsets Β· melody Β· structural segments |
| πΉ **Chords + key** | Chord detection + Krumhansl-Schmuckler key estimation |
| π΅ **Audio β MIDI** | Polyphonic transcription via Spotify's basic-pitch (ONNX, no TF) |
| π§Ή **Restoration** | De-reverb Β· de-echo Β· de-noise via UVR BS-Roformer + MelBand Roformer |
| π£οΈ **Speech** | Enhancement (DeepFilterNet) Β· VAD (silero-vad) Β· diarization (pyannote) |
| πΌοΈ **Visuals** | Spectrogram + waveform PNGs + 8-mode animated MP4/WebM |
| π **Fingerprint** | Chromaprint acoustic fingerprinting (AcoustID-compatible) |
| βοΈ **Silence** | Detect gaps Β· trim edges Β· strip all silence |
| πΌ **MIDI pipeline** | Compose from JSON Β· inspect Β· transform Β· render via fluidsynth |
| πΈ **Effects** | 23-effect pedalboard chain β Compressor, Reverb, PitchShift, filtersβ¦ |
| π§ **Transforms** | Sox DSP β pitch, tempo, EQ, reverb, gain |
| π’ **Loudness** | Measure LUFS Β· normalize to target |
| π₯ **HPSS** | Harmonic/percussive source separation via librosa median filter |
| π **Noise reduction** | Spectral noise reduction via noisereduce β stationary + adaptive modes |
| β© **Time-stretch** | Independent tempo factor + pitch shift via librosa phase vocoder |
| π·οΈ **Audio tagging** | Top-K AudioSet class labels via Audio Spectrogram Transformer |
| π **Audio embeddings** | 512-dim semantic embeddings via LAION CLAP + optional text similarity |
| π·οΈ **Zero-shot classify** | CLAP cosine similarity against any free-form text labels β genres, moods, instruments |
| π **Audio info** | ffprobe metadata β duration, sample rate, channels, codec, bit depth |
| βοΈ **Trim** | Cut a clip by start/end seconds β any format in, any format out |
| ποΈ **Mix** | Combine N staged tracks with per-track gain_db β pure ffmpeg, no model |
| π **Concat** | Stitch N audio files end-to-end in order |
| β© **Speed** | Change playback speed without pitch shift (0.1Γ β 10Γ) via ffmpeg atempo |
| π **Convert** | Re-encode: format, sample rate, channel count in one call |
| π **Similar** | Cosine similarity between two audio files via CLAP embeddings |
| πΉ **MIDI quantize** | Snap MIDI note timings to a rhythmic grid (16th, 8th, quarterβ¦) |
| π
**Fade** | Fade-in and/or fade-out with 13 curve shapes |
| βͺ **Reverse** | Flip audio backwards |
| π **Loop** | Repeat audio N times |
| π― **BPM match** | Auto-detect BPM then stretch to a target β no manual math |
| π **Loudness curve** | RMS envelope over time β time-stamped dB values for gain automation |
| π€ **Pitch correct** | Auto-tune toward nearest chromatic semitone β configurable strength |
| π§ **Repair** | Declip + dehum β fix clipped peaks and remove power-line hum |
| π **Loop point** | Find best seamless loop boundary β score, bar count, candidates list |
| π₯ **Drum machine** | Step-sequencer spec β GM drum MIDI β 16-step pattern, swing, tempo |
| πΌ **Chords to MIDI** | Chord progression β MIDI file β root+3rd+5th voicings per segment |
| βοΈ **Stereo width** | Widen or collapse the stereo image via M/S processing |
| βοΈ **Split** | Split into N equal parts or on silence β returns ZIP of segments |
| π **Pan** | Position audio in the stereo field (-1 left β 0 center β 1 right) |
| ποΈ **EQ** | Parametric EQ β JSON array of freq/gain_db/width_hz bands |
| π΅ **Key match** | Detect source key then pitch-shift to a target key |
| ποΈ **Sidechain duck** | Duck music when a trigger track (voice) is loud |
| π·οΈ **Metadata** | Read and write ID3/Vorbis/FLAC/WAV audio tags via mutagen |
| π΄ **Clip detect** | Detect digital clipping β count, ratio, peak dBFS |
| βοΈ **Mid/Side** | Encode L/R β Mid+Side or decode Mid+Side β L/R |
| βοΈ **Beat slice** | Slice audio at detected beat positions β returns ZIP of segments |
| ποΈ **Conv reverb** | Convolution reverb via impulse response β wet_mix control |
| π₯ **Transient shaper** | Attack/sustain dual-compressor β punch up drums, cut room tail |
| ποΈ **Multiband compress** | N-band compressor with zero-phase LR4 crossovers β mastering-grade dynamics |
| ποΈ **DJ prep** | One call: BPM + key + Camelot wheel position + integrated LUFS |
| π¦ **Batch** | Run trim/convert/fade/reverse/speed/eq on staged files in sequence |
| π§© **Presets + pipeline** | Curated YAML workflows (`master-for-spotify`, `podcast-cleanup`, β¦) + ad-hoc op chaining server-side |
| ποΈ **Catalog** | `GET /v1/catalog` β machine-readable endpoint list grouped by category for discovery |
| β‘ **Async jobs** | Every endpoint supports `async_job=true` β fire-and-forget + webhook callbacks |
---
## Table of Contents
- [Run it](#run-it)
- [Quick start](#quick-start)
- [What it can do](#what-it-can-do)
- [Split stems](#split-stems)
- [Master](#master)
- [Analyze](#analyze)
- [Beats, onsets, melody, segments](#beats-onsets-melody-segments)
- [Silence detection and trimming](#silence-detection-and-trimming)
- [Visualize (spectrogram, waveform, video)](#visualize-spectrogram-waveform-video)
- [Acoustic fingerprint](#acoustic-fingerprint)
- [De-reverb, de-echo, de-noise](#de-reverb-de-echo-de-noise)
- [Audio-to-MIDI transcription](#audio-to-midi-transcription)
- [Neural speech and vocal enhancement](#neural-speech-and-vocal-enhancement)
- [Chord and key detection](#chord-and-key-detection)
- [Voice activity detection](#voice-activity-detection)
- [Speaker diarization](#speaker-diarization)
- [Transform](#transform)
- [Loudness measurement](#loudness-measurement)
- [Loudness curve](#loudness-curve)
- [Loudness normalization](#loudness-normalization)
- [HPSS (harmonic/percussive split)](#hpss-harmonicpercussive-split)
- [Spectral noise reduction](#spectral-noise-reduction)
- [Time-stretch and pitch-shift](#time-stretch-and-pitch-shift)
- [Pitch correct](#pitch-correct)
- [Repair](#repair)
- [Audio tagging](#audio-tagging)
- [Audio embeddings](#audio-embeddings)
- [Zero-shot classification](#zero-shot-classification)
- [Audio info](#audio-info)
- [Trim](#trim)
- [Mix](#mix)
- [Concat](#concat)
- [Speed](#speed)
- [Convert](#convert)
- [Similar](#similar)
- [MIDI quantize](#midi-quantize)
- [Fade](#fade)
- [Reverse](#reverse)
- [Loop](#loop)
- [BPM match](#bpm-match)
- [Stereo width](#stereo-width)
- [Split](#split)
- [Pan](#pan)
- [EQ](#eq)
- [Key match](#key-match)
- [Sidechain duck](#sidechain-duck)
- [Effects chain](#effects-chain)
- [Loop point](#loop-point)
- [Compose MIDI](#compose-midi)
- [Inspect MIDI](#inspect-midi)
- [Transform MIDI](#transform-midi)
- [Render MIDI to audio](#render-midi-to-audio)
- [Generate music from a spec](#generate-music-from-a-spec)
- [Drum pattern](#drum-pattern)
- [Chords to MIDI](#chords-to-midi)
- [Audio metadata tags](#audio-metadata-tags)
- [Clip detection](#clip-detection)
- [Mid/Side encode and decode](#midside-encode-and-decode)
- [Beat slice](#beat-slice)
- [Convolution reverb](#convolution-reverb)
- [Transient shaper](#transient-shaper)
- [Multiband compression](#multiband-compression)
- [DJ prep](#dj-prep)
- [De-ess](#de-ess)
- [Stereo field analysis](#stereo-field-analysis)
- [Audio thumbnail](#audio-thumbnail)
- [MIDI humanize](#midi-humanize)
- [Batch operations](#batch-operations)
- [Async jobs and webhooks](#async-jobs-and-webhooks)
- [Stage files](#stage-files)
- [Remote URLs](#remote-urls)
- [Engines](#engines)
- [Workflows β presets + pipeline](#workflows--presets--pipeline)
- [API catalog](#api-catalog)
- [Endpoints](#endpoints)
- [MCP](#mcp)
- [Configuration](#configuration)
- [What's not in here](#whats-not-in-here)
- [Build & dev](#build--dev)
- [Supply chain](#supply-chain)
- [License](#license)
---
## Run it
```bash
# no GPU
docker run --rm -it \
-v $HOME/.audiolla-data:/data \
-p 8000:8000 \
psyb0t/audiolla:latest
# GPU
docker run --rm -it --gpus all \
-v $HOME/.audiolla-data:/data \
-e AUDIOLLA_DEVICE=cuda \
-p 8000:8000 \
psyb0t/audiolla:latest-cuda
```
Demucs weights prefetch at container startup (for whichever variants are enabled) and cache in `/data/torch_cache/`. First boot downloads them; same `-v` mount next time and they're already there. Other engines (matchering, pedalboard, librosa, sox, fx, midi) have no weights β they're ready as soon as `/healthz` is green.
---
## Quick start
Once the container is up, this is a complete audio pipeline in six curl commands:
```bash
# rip the vocals out of a track
curl -X POST http://localhost:8000/v1/audio/separate \
-F "file=@song.wav" -F "engine=htdemucs" -F "stems=vocals" \
-o vocals.wav
# what key is it in? what are the chords?
curl -X POST http://localhost:8000/v1/audio/chords -F "file=@song.wav"
# β {"key":"F# minor","key_confidence":0.91,"chords":[{"chord":"F#m","start_sec":0.0,...},...]}
# transcribe that vocal melody to MIDI
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
-F "file=@vocals.wav" -o melody.mid
# render the MIDI back to audio through a SoundFont
curl -X POST http://localhost:8000/v1/midi/render \
-F "file=@melody.mid" -o rendered.wav
# strip background noise from a voice recording
curl -X POST http://localhost:8000/v1/audio/noise-reduce/uvr-denoise \
-F "file=@interview.wav" -o clean.wav
# who's speaking and when?
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
-F "file=@interview.wav"
# β {"num_speakers":2,"segments":[{"speaker":"SPEAKER_00","start_sec":0.5,"end_sec":8.2},...]}
```
Audio in. MIDI out. Chords detected. Speakers identified. De-noised. Re-synthesized. No Python environment to set up. No API keys. No account. Just HTTP.
---
## What it can do
Output defaults to `wav`. Pass `-F "output_format=mp3"` to get mp3 instead (`flac`, `opus`, `aac`, `pcm` also work).
**Input** β every audio endpoint accepts exactly one of:
- `file` β multipart upload (the default in the examples below)
- `file_path` β path inside the `/v1/files` staging area
- `file_url` β remote URL the server fetches (disabled by default β see [Remote URLs](#remote-urls))
**Output** β audio-producing endpoints also accept:
- `output_path` β server writes to `/v1/files/`, returns JSON
- `output_url` β server PUTs to a presigned URL, returns JSON
- neither β raw audio bytes (the default)
### Split stems
```bash
# vocals only
curl -X POST http://localhost:8000/v1/audio/separate \
-F "file=@track.wav" \
-F "engine=htdemucs" \
-F "stems=vocals" \
-o vocals.wav
# all 4 stems as a ZIP
curl -X POST http://localhost:8000/v1/audio/separate \
-F "file=@track.wav" \
-F "engine=htdemucs" \
-o stems.zip
```
### Master
```bash
# match EQ + loudness to a reference track
curl -X POST http://localhost:8000/v1/audio/master \
-F "file=@track.wav" \
-F "mode=reference" \
-F "reference=@ref.wav" \
-o mastered.wav
# run a built-in pedalboard chain (presets: transparent, loud)
curl -X POST http://localhost:8000/v1/audio/master \
-F "file=@track.wav" \
-F "mode=chain" \
-F "preset=loud" \
-o mastered.wav
```
### Analyze
```bash
# returns JSON. features: bpm, key, loudness, duration,
# spectral_centroid, rms, zcr. Omit features= to get them all.
curl -X POST http://localhost:8000/v1/audio/analyze \
-F "file=@track.wav" \
-F "features=bpm" \
-F "features=key" \
-F "features=loudness"
```
### Beats, onsets, melody, segments
```bash
# beat grid β returns bpm + beat timestamps
curl -X POST http://localhost:8000/v1/audio/beats \
-F "file=@track.wav"
# onset timestamps β note attacks, transients
curl -X POST http://localhost:8000/v1/audio/onsets \
-F "file=@track.wav"
# dominant melody contour β pitch in Hz per frame
curl -X POST http://localhost:8000/v1/audio/melody \
-F "file=@track.wav"
# structural segmentation β labels recurring sections A, B, C...
curl -X POST http://localhost:8000/v1/audio/segments \
-F "file=@track.wav" \
-F "num_segments=4"
```
Beat detection also generates a click-track file when `click_track=true` β handy for aligning a mix to a grid. Pass `start_bpm=140` to seed the tracker when you already know the rough tempo (faster, more accurate). Melody can be exported as a single-track MIDI file via `as_midi=true`.
### Silence detection and trimming
```bash
# find silent gaps in a recording
curl -X POST http://localhost:8000/v1/audio/silence \
-F "file=@track.wav" \
-F "threshold_db=-30" \
-F "min_duration_sec=1.0"
# trim all silence and get a shorter file back
curl -X POST http://localhost:8000/v1/audio/silence \
-F "file=@track.wav" \
-F "threshold_db=-30" \
-F "min_duration_sec=0.5" \
-F "trim_mode=all" \
-o trimmed.wav
# trim only leading/trailing silence, write to staging
curl -X POST http://localhost:8000/v1/audio/silence \
-F "file=@track.wav" \
-F "threshold_db=-40" \
-F "min_duration_sec=0.3" \
-F "trim_mode=edges" \
-F "output_path=processed/trimmed.wav"
```
`trim_mode=edges` β chop leading + trailing silence only. `trim_mode=all` β remove every detected gap (compress a talk recording, tighten a loop). Without `trim_mode`, the response is JSON only: `silent_ranges`, `non_silent_ranges`, `duration`.
### Visualize (spectrogram, waveform, video)
Visual output splits into two sub-namespaces by output type:
```bash
# Static PNG spectrogram (color + scale params)
curl -X POST http://localhost:8000/v1/audio/visualize/image/spectrogram \
-F "file=@track.wav" \
-F "width=1280" \
-F "height=720" \
-o spec.png
# Static PNG waveform (color param)
curl -X POST http://localhost:8000/v1/audio/visualize/image/waveform \
-F "file=@track.wav" \
-F "width=1280" \
-F "height=240" \
-o wave.png
# Animated MP4 spectrum analyser (fps + container params)
curl -X POST http://localhost:8000/v1/audio/visualize/video/spectrum \
-F "file=@track.wav" \
-F "width=1280" \
-F "height=720" \
-F "fps=30" \
-F "container=mp4" \
-o viz.mp4
```
**`/image/spectrogram`**: returns `image/png`. Params: `width`, `height`, `color` (default `intensity`), `scale` (`log`/`lin`).
**`/image/waveform`**: returns `image/png`. Params: `width`, `height`, `color` (default `lime`).
**`/video/{mode}`**: `spectrum` (scrolling FFT), `waves` (oscilloscope), `cqt` (constant-Q transform), `freqs` (bar-graph analyzer), `volume` (VU meter), `vectorscope` (stereo X/Y scope), `phasemeter`, `histogram`. Params: `width`, `height`, `fps`, `container` (`mp4` default, `webm`).
### Acoustic fingerprint
```bash
# Chromaprint fingerprint β identifies a recording regardless of encoding
curl -X POST http://localhost:8000/v1/audio/fingerprint \
-F "file=@track.wav"
# β {"duration": 215.34, "fingerprint": "AQADtEqRRIuQ..."}
# include the raw integer array (for custom similarity scoring)
curl -X POST http://localhost:8000/v1/audio/fingerprint \
-F "file=@track.wav" \
-F "return_raw=true"
```
The base64 fingerprint string is compatible with the [AcoustID](https://acoustid.org) lookup service.
### De-reverb, de-echo, de-noise
AI audio restoration via UVR ecosystem models β BS-Roformer and MelBand Roformer. All three are unified under `POST /v1/audio/restore/{engine}`.
```bash
# Remove room reverb (BS-Roformer, SDR 19+)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-dereverb \
-F "file=@track.wav" \
-o dry.wav
# Remove echo β normal mode
curl -X POST http://localhost:8000/v1/audio/restore/uvr-deecho \
-F "file=@track.wav" -o noecho.wav
# Remove echo β aggressive mode (same engine, harder suppression)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-deecho \
-F "file=@track.wav" \
-F "aggressive=true" \
-o noecho.wav
# Remove broadband background noise β ML (MelBand Roformer, SDR 28)
curl -X POST http://localhost:8000/v1/audio/restore/uvr-denoise \
-F "file=@track.wav" \
-o clean.wav
```
All support `output_format`, `output_path`, `output_url`. For DSP-based noise reduction (no GPU) use `noise-reduce/noise-reduce`.
UVR engines also work through `/v1/audio/separate` β `uvr-vocal-bsr` (BS-Roformer, SDR 13) and `uvr-karaoke` return vocal + instrumental stems like Demucs but often with higher quality.
### Audio-to-MIDI transcription
Polyphonic audio-to-MIDI via Spotify's basic-pitch (ONNX backend, no TensorFlow). Play guitar, hum a melody, record a piano riff β get a MIDI file back with all the notes.
```bash
# Any audio β MIDI bytes
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
-F "file=@guitar_riff.wav" \
-o riff.mid
# Tune the detection thresholds
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
-F "file=@piano.wav" \
-F "onset_threshold=0.6" \
-F "frame_threshold=0.3" \
-F "minimum_note_length_ms=80" \
-o piano.mid
# Write directly to staging
curl -X POST http://localhost:8000/v1/audio/to_midi/basic-pitch \
-F "file_path=recordings/bass.wav" \
-F "output_path=midi/bass_notes.mid"
# β {"path":"midi/bass_notes.mid","size":...,"engine":"basic-pitch","output_format":"mid"}
```
Optional params: `onset_threshold` (0β1, default 0.5), `frame_threshold` (0β1, default 0.3), `minimum_note_length_ms` (default 58), `minimum_frequency` / `maximum_frequency` (Hz, default unconstrained), `multiple_pitch_bends` (bool, default false), `melodia_trick` (bool, default true β helps with melodic content). Default engine: `basic-pitch`.
The MIDI file is piped straight into `/v1/midi/inspect` or `/v1/midi/render` β audio β MIDI β audio is a complete round-trip.
### Neural speech and vocal enhancement
DeepFilterNet DF3 β deep learning noise suppression trained on speech. Better than broadband de-noise for voice recordings; more surgical than UVR's de-noise on vocals specifically.
```bash
# Enhance a vocal recording
curl -X POST http://localhost:8000/v1/audio/enhance/deepfilter \
-F "file=@vocal_recording.wav" \
-o enhanced.wav
# Stage the output, mp3
curl -X POST http://localhost:8000/v1/audio/enhance/deepfilter \
-F "file_path=vocals/raw.wav" \
-F "output_format=mp3" \
-F "output_path=vocals/enhanced.mp3"
```
Supports `output_format`, `output_path`, `output_url`.
### Chord and key detection
Krumhansl-Schmuckler key estimation + chroma-template chord segmentation via librosa. No extra deps beyond the librosa stack.
```bash
curl -X POST http://localhost:8000/v1/audio/chords \
-F "file=@track.wav"
# β {
# "key": "C major",
# "key_confidence": 0.87,
# "duration": 183.4,
# "chords": [
# {"chord": "C", "start_sec": 0.0, "end_sec": 2.3, "confidence": 0.91},
# {"chord": "Am", "start_sec": 2.3, "end_sec": 4.6, "confidence": 0.85},
# ...
# ]
# }
# Tune the hop length (lower = finer time resolution)
curl -X POST http://localhost:8000/v1/audio/chords \
-F "file=@track.wav" \
-F "hop_length=256"
```
Optional params: `hop_length` (default 512), `segment_min_duration_sec` (default 0.5 β merge very short chord segments).
### Voice activity detection
silero-vad β ONNX-based VAD, fast and accurate on both speech and music. Returns timestamped speech and non-speech segments.
```bash
curl -X POST http://localhost:8000/v1/audio/vad \
-F "file=@interview.wav"
# β {
# "speech_ratio": 0.73,
# "duration": 120.0,
# "threshold": 0.5,
# "speech_segments": [
# {"start_sec": 1.2, "end_sec": 8.4},
# ...
# ],
# "non_speech_segments": [
# {"start_sec": 0.0, "end_sec": 1.2},
# ...
# ]
# }
# Tighter detection
curl -X POST http://localhost:8000/v1/audio/vad \
-F "file=@podcast.wav" \
-F "threshold=0.7" \
-F "min_speech_duration_ms=300" \
-F "min_silence_duration_ms=200"
```
Optional params: `threshold` (0β1, default 0.5), `min_speech_duration_ms` (default 250), `min_silence_duration_ms` (default 100).
### Speaker diarization
pyannote/speaker-diarization-3.1 β state-of-the-art speaker diarization from HuggingFace Hub. Returns per-speaker timestamped segments and speaker count.
> **Note:** This engine requires a HuggingFace account. You must accept the model terms at
> [https://huggingface.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
> and then set `HUGGINGFACE_TOKEN` when starting the container. A read-only token with model access is enough.
```bash
docker run ... \
-e HUGGINGFACE_TOKEN=hf_your_token_here \
psyb0t/audiolla:latest
```
```bash
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
-F "file=@interview.wav"
# β {
# "num_speakers": 2,
# "speakers": ["SPEAKER_00", "SPEAKER_01"],
# "duration": 120.0,
# "segments": [
# {"speaker": "SPEAKER_00", "start_sec": 0.5, "end_sec": 8.2, "duration_sec": 7.7},
# {"speaker": "SPEAKER_01", "start_sec": 8.5, "end_sec": 14.1, "duration_sec": 5.6},
# ...
# ]
# }
# Hint the expected speaker count
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
-F "file=@roundtable.wav" \
-F "num_speakers=4"
# Or constrain the range
curl -X POST http://localhost:8000/v1/audio/diarize/pyannote \
-F "file=@panel.wav" \
-F "min_speakers=2" \
-F "max_speakers=6"
```
Optional params: `num_speakers` (exact count hint), `min_speakers`, `max_speakers`.
### Transform
```bash
# pitch shift up 2 semitones + add reverb, export mp3.
# operations is a JSON array β ops: gain, equalizer, compand, reverb,
# pitch, tempo, rate, channels, trim, pad.
curl -X POST http://localhost:8000/v1/audio/transform \
-F "file=@track.wav" \
-F 'operations=[{"op":"pitch","params":{"n_semitones":2}},{"op":"reverb","params":{"reverberance":50}}]' \
-F "output_format=mp3" \
-o out.mp3
```
### Loudness measurement
```bash
# Measure integrated LUFS β returns JSON, no audio output
curl -X POST http://localhost:8000/v1/audio/loudness \
-F "file=@track.wav"
# β {"loudness_lufs": -18.4}
```
### Loudness curve
RMS envelope over time β returns a list of `{time_sec, rms_db}` points. Useful for generating gain automation curves, finding loud and quiet sections, or visualising dynamic range before mastering.
```bash
# Default hop (512 samples) β fine-grained envelope
curl -X POST http://localhost:8000/v1/audio/loudness/curve \
-F "file=@track.wav" | jq '.curve[:5]'
# β [
# {"time_sec": 0.0, "rms_db": -18.4},
# {"time_sec": 0.012, "rms_db": -17.9},
# ...
# ]
# Coarser envelope (2048-sample hop)
curl -X POST http://localhost:8000/v1/audio/loudness/curve \
-F "file=@track.wav" \
-F "hop_length=2048" | jq '{duration, sample_rate, points}'
```
Response fields: `curve` (array of `{time_sec, rms_db}`), `duration` (seconds), `sample_rate`, `points` (total curve length). Optional param: `hop_length` (default 512).
### Loudness normalization
```bash
# Normalize to -14 LUFS (streaming platform standard) β returns audio
curl -X POST http://localhost:8000/v1/audio/normalize \
-F "file=@track.wav" \
-F "target_lufs=-14" \
-o normalized.wav
# Write to staging, check measured LUFS from header
curl -X POST http://localhost:8000/v1/audio/normalize \
-F "file=@track.wav" \
-F "target_lufs=-23" \
-F "output_path=mastered/norm.wav"
```
`target_lufs` is required. The response carries `X-Loudness-LUFS` with the measured pre-normalization level.
### HPSS (harmonic/percussive split)
Median-filter harmonic/percussive source separation via librosa. Harmonic = tonal content (pitched instruments, pads); percussive = transients (drums, percussion). No ML β pure DSP, fast, no GPU needed.
```bash
# Get both stems in a ZIP
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
-F "file=@track.wav" \
-o stems.zip
# β stems.zip contains harmonic.wav + percussive.wav
# Wider margin = harder separation (more aggressive)
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
-F "file=@track.wav" \
-F "margin=3.0" \
-o stems.zip
# Output to staging
curl -X POST http://localhost:8000/v1/audio/separate/hpss \
-F "file=@track.wav" \
-F "output_path=hpss/stems.zip"
```
Params: `margin` (default 1.0 β β₯1.0, higher = more aggressive), `kernel_size` (default 31 β odd int, median filter width), `output_format` (default `wav`).
### Spectral noise reduction
Noise reduction with two engine options under the same endpoint β pick DSP for no-GPU fast cleanup or ML for higher-quality removal.
```bash
# DSP (noisereduce) β no GPU, pure spectral subtraction + Wiener filtering
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
-F "file=@recording.wav" \
-o clean.wav
# Stationary mode β constant hum, hiss, fan noise
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
-F "file=@recording.wav" \
-F "stationary=true" \
-o clean.wav
# Partial reduction β subtle noise floor cleanup
curl -X POST http://localhost:8000/v1/audio/noise-reduce/noise-reduce \
-F "file=@recording.wav" \
-F "prop_decrease=0.5" \
-o clean.wav
# ML (UVR MelBand Roformer, SDR 28) β higher quality, GPU-accelerated
curl -X POST http://localhost:8000/v1/audio/noise-reduce/uvr-denoise \
-F "file=@recording.wav" \
-o clean.wav
```
DSP params (only apply to `noise-reduce` engine): `stationary` (bool, default `false`), `prop_decrease` (0β1, default 1.0). Both engines accept `output_format`, `output_path`, `output_url`.
### Time-stretch and pitch-shift
Independent tempo factor and semitone offset via librosa phase vocoder. Slow a track down to learn it; shift a vocal up 3 semitones for a different key; transpose a MIDI melody to a different register first, then render.
```bash
# Slow down to 80% speed, no pitch change
curl -X POST http://localhost:8000/v1/audio/stretch \
-F "file=@track.wav" \
-F "tempo_factor=0.8" \
-o slow.wav
# Shift up 3 semitones, no tempo change
curl -X POST http://localhost:8000/v1/audio/stretch \
-F "file=@vocal.wav" \
-F "pitch_semitones=3" \
-o pitched.wav
# Both β pitch-corrected time stretch (traditional chipmunk effect)
curl -X POST http://localhost:8000/v1/audio/stretch \
-F "file=@track.wav" \
-F "tempo_factor=0.5" \
-F "pitch_semitones=6" \
-F "output_format=mp3" \
-o stretched.mp3
```
Params: `tempo_factor` (default 1.0 β 0.5 = half speed), `pitch_semitones` (default 0.0 β Β±semitones), `output_format`, `output_path`.
### Pitch correct
Auto-tune audio toward the nearest chromatic semitone using librosa's phase vocoder. Full `strength=1.0` snaps hard to pitch; lower values blend the corrected and original signal.
```bash
# Hard auto-tune β snap every note to the nearest semitone
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
-F "file=@vocal.wav" \
-o tuned.wav
# Subtle correction β 50% blend
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
-F "file=@vocal.wav" \
-F "strength=0.5" \
-F "output_format=mp3" \
-o tuned.mp3
# Async for long files, staged output
curl -X POST http://localhost:8000/v1/audio/pitch-correct \
-F "file_path=sessions/take1.wav" \
-F "strength=1.0" \
-F "async_job=true" \
-F "output_path=sessions/take1_tuned.wav"
```
Params: `strength` (0.0β1.0, default 1.0), `output_format`, `output_path`, `async_job`, `webhook_url`. Requires `librosa-analyze` engine.
### Repair
Declip clipped peaks and/or remove power-line hum. Declipping uses cubic interpolation to reconstruct flattened waveform tops and bottoms. Dehumming applies a notch filter at `hum_freq` (and harmonics).
```bash
# Declip only (default)
curl -X POST http://localhost:8000/v1/audio/repair \
-F "file=@overdriven.wav" \
-o repaired.wav
# Remove 60 Hz hum (North American power grid)
curl -X POST http://localhost:8000/v1/audio/repair \
-F "file=@recording.wav" \
-F "declip=false" \
-F "dehum=true" \
-F "hum_freq=60.0" \
-o clean.wav
# Both β declip a 50 Hz humming mic recording
curl -X POST http://localhost:8000/v1/audio/repair \
-F "file=@problem_track.wav" \
-F "declip=true" \
-F "dehum=true" \
-F "hum_freq=50.0" \
-F "output_format=flac" \
-o repaired.flac
```
Params: `declip` (bool, default `true`), `dehum` (bool, default `false`), `hum_freq` (Hz, default 50.0), `output_format`, `output_path`, `async_job`, `webhook_url`.
### Audio tagging
Top-K AudioSet class label classification via Audio Spectrogram Transformer (MIT/ast-finetuned-audioset-10-10-0.4593). Identifies what's in a recording β music, speech, specific instruments, environmental sounds, etc.
```bash
curl -X POST http://localhost:8000/v1/audio/tag \
-F "file=@recording.wav"
# β {
# "tags": [
# {"label": "Music", "score": 0.94},
# {"label": "Drum", "score": 0.87},
# {"label": "Guitar", "score": 0.71},
# ...
# ],
# "duration": 5.2
# }
# Get top 20 results instead of the default 10
curl -X POST http://localhost:8000/v1/audio/tag \
-F "file=@soundscape.wav" \
-F "top_k=20"
```
Requires the HF model cache. First run downloads the weights to `/data/hf/`. Optional: `top_k` (default 10).
> Run the container once with `-e HF_HUB_OFFLINE=0` and send one request to pull the model down. Subsequent runs use the cache with `HF_HUB_OFFLINE=1`.
### Audio embeddings
512-dimensional L2-normalized audio embeddings via LAION CLAP (laion/larger_clap_music_and_speech). Useful for semantic audio search, similarity scoring, and clustering.
```bash
# Get the embedding vector
curl -X POST http://localhost:8000/v1/audio/embed \
-F "file=@track.wav"
# β {"embedding": [0.032, -0.11, ...], "dim": 512, "norm": 1.0}
# Semantic similarity β how well does the audio match a text description?
curl -X POST http://localhost:8000/v1/audio/embed \
-F "file=@track.wav" \
-F "query_text=energetic rock guitar riff"
# β {"embedding": [...], "dim": 512, "norm": 1.0,
# "query_text": "energetic rock guitar riff", "similarity": 0.73}
```
`similarity` is cosine similarity in [-1, 1]. Requires HF model cache β same first-run download caveat as audio tagging.
### Zero-shot classification
Given audio and a list of free-form text labels, return cosine similarity scores for each using the existing CLAP model. No extra model download β uses the same `clap-embed` engine. Works for genres, moods, instruments, sonic descriptors β anything CLAP understands.
```bash
# Genre detection
curl -X POST http://localhost:8000/v1/audio/classify \
-F "file=@track.wav" \
-F 'labels=["jazz", "hip-hop", "classical", "electronic", "rock"]'
# β {"results": [
# {"label": "hip-hop", "score": 0.42},
# {"label": "electronic", "score": 0.38},
# ...
# ]}
# Mood / energy
curl -X POST http://localhost:8000/v1/audio/classify \
-F "file=@track.wav" \
-F 'labels=["energetic", "calm", "melancholic", "aggressive", "uplifting"]'
# Speaker gender
curl -X POST http://localhost:8000/v1/audio/classify \
-F "file=@interview.wav" \
-F 'labels=["male voice", "female voice", "child voice", "multiple speakers"]'
```
Results are sorted by descending score. Scores are cosine similarities in [-1, 1] β higher = more similar. Requires `clap-embed` model cache.
### Audio info
Probe any audio file for metadata without loading it into memory for processing. Uses ffprobe β handles any format.
```bash
curl -X POST http://localhost:8000/v1/audio/info \
-F "file=@track.wav"
# β {
# "size_bytes": 52428800,
# "duration_sec": 297.241,
# "sample_rate": 44100,
# "channels": 2,
# "codec": "pcm_s16le",
# "sample_fmt": "s16",
# "format": "wav",
# "bit_depth": 16,
# "bit_rate": 1411200
# }
# Works on staged files too
curl -X POST http://localhost:8000/v1/audio/info \
-F "file_path=recordings/interview.mp3"
# β {"codec": "mp3", "bit_rate": 192000, ...}
```
### Trim
Cut a precise time range out of any audio file. Common use: extract a chorus, clip a sample, chop a stem at bar boundaries.
```bash
# Extract seconds 30β90 from a track
curl -X POST http://localhost:8000/v1/audio/trim \
-F "file=@track.wav" \
-F "start_sec=30.0" \
-F "end_sec=90.0" \
-o chorus.wav
# Clip a specific beat range, export as mp3
curl -X POST http://localhost:8000/v1/audio/trim \
-F "file=@stem.wav" \
-F "start_sec=0.0" \
-F "end_sec=8.0" \
-F "output_format=mp3" \
-o loop.mp3
# From staged file, write to staging
curl -X POST http://localhost:8000/v1/audio/trim \
-F "file_path=sessions/full.wav" \
-F "start_sec=120.5" \
-F "end_sec=180.0" \
-F "output_path=clips/verse.wav"
```
`start_sec` defaults to 0. `end_sec` is required and must be greater than `start_sec`. Supports all standard `output_format` values.
### Mix
Combine multiple staged or URL-accessible tracks into one. Per-track `gain_db` lets you balance levels before mixing. Useful for bouncing separated stems back together at custom levels, layering synth parts, or combining click-track + music.
```bash
# Mix drums and bass at equal levels
curl -X POST http://localhost:8000/v1/audio/mix \
-F 'tracks=[{"file_path":"stems/drums.wav"},{"file_path":"stems/bass.wav"}]' \
-o rhythm.wav
# Stems at custom levels (drums -3 dB, bass 0 dB, vocals +2 dB)
curl -X POST http://localhost:8000/v1/audio/mix \
-F 'tracks=[
{"file_path":"stems/drums.wav","gain_db":-3},
{"file_path":"stems/bass.wav","gain_db":0},
{"file_path":"stems/vocals.wav","gain_db":2}
]' \
-F "output_format=wav" \
-o custom_mix.wav
# Write to staging
curl -X POST http://localhost:8000/v1/audio/mix \
-F 'tracks=[{"file_path":"stems/harmonic.wav"},{"file_path":"stems/percussive.wav","gain_db":-6}]' \
-F "output_path=mixed/recombined.wav"
```
`tracks` is a required JSON array. Each entry needs `file_path` or `file_url` and an optional `gain_db` (default 0.0). Requires at least 2 tracks. Shorter tracks are padded with silence to match the longest.
### Concat
Stitch N audio files together in order. Handles different sample rates and channel counts automatically (ffmpeg resamples on the fly).
```bash
curl -X POST http://localhost:8000/v1/audio/concat \
-F 'files=[{"file_path":"intro.wav"},{"file_path":"verse.wav"},{"file_path":"outro.wav"}]' \
-o full_track.wav
# output_format and staging also work
curl -X POST http://localhost:8000/v1/audio/concat \
-F 'files=[{"file_path":"a.wav"},{"file_path":"b.wav"}]' \
-F "output_format=mp3" \
-F "output_path=concat/result.mp3"
```
`files` is a required JSON array of `{file_path?, file_url?}` objects. Requires at least 2 entries.
### Speed
Change playback speed without pitch shifting β useful for auditioning at half/double speed, or creating slow-motion effects. Uses ffmpeg `atempo` filter chained for extreme multipliers.
```bash
# Half speed
curl -X POST http://localhost:8000/v1/audio/speed \
-F "file=@track.wav" -F "speed=0.5" -o slow.wav
# Double speed
curl -X POST http://localhost:8000/v1/audio/speed \
-F "file=@track.wav" -F "speed=2.0" -o fast.wav
# 4Γ speed (chains two atempo=2.0 filters internally)
curl -X POST http://localhost:8000/v1/audio/speed \
-F "file_path=track.wav" -F "speed=4.0" -F "output_format=mp3" -o fast.mp3
```
`speed` is required. Range: 0.1β10.0. Note: this changes duration but not pitch. For pitch-preserving tempo changes use `/v1/audio/stretch`.
### Convert
Re-encode audio to a different format, sample rate, or channel count in a single call.
```bash
# WAV β 16 kHz mono FLAC (for speech models)
curl -X POST http://localhost:8000/v1/audio/convert \
-F "file=@recording.wav" \
-F "output_format=flac" \
-F "sample_rate=16000" \
-F "channels=1" \
-o prepared.flac
# Stereo β mono WAV
curl -X POST http://localhost:8000/v1/audio/convert \
-F "file_path=stereo.wav" \
-F "channels=1" \
-o mono.wav
# Any format β Opus at 48 kHz
curl -X POST http://localhost:8000/v1/audio/convert \
-F "file=@audio.mp3" \
-F "output_format=opus" \
-F "sample_rate=48000" \
-o out.opus
```
`output_format` defaults to `wav`. `sample_rate` and `channels` are optional; if omitted, the source values are preserved.
### Similar
Compute cosine similarity between two audio files using CLAP embeddings. Returns a score in [-1, 1] β 1 = identical sound, 0 = unrelated, negative = acoustically opposite. Useful for duplicate detection, cover matching, or finding the closest sample in a library.
```bash
curl -X POST http://localhost:8000/v1/audio/similar \
-F "file=@original.wav" \
-F "reference_file=@remix.wav"
# β {"similarity": 0.847, "dim": 512}
# Using staged files
curl -X POST http://localhost:8000/v1/audio/similar \
-F "file_path=stems/vocals.wav" \
-F "reference_file_path=stems/vocals_ref.wav"
```
Primary file: `file` / `file_path` / `file_url`. Reference file: `reference_file` / `reference_file_path` / `reference_file_url`. Requires `clap-embed` engine.
### MIDI quantize
Snap all note timings in a MIDI file to the nearest rhythmic grid. Cleaner dedicated endpoint than `/v1/midi/transform`'s `quantize_grid_beats` param.
```bash
# Quantize to 16th notes (0.25 beats)
curl -X POST http://localhost:8000/v1/midi/quantize \
-F "file=@sloppy.mid" \
-F "grid_beats=0.25" \
-o tight.mid
# 8th note grid
curl -X POST http://localhost:8000/v1/midi/quantize \
-F "file_path=recorded.mid" \
-F "grid_beats=0.5" \
-F "output_path=midi/quantized.mid"
```
`grid_beats`: grid size in beats β `0.25` = 16th note, `0.5` = 8th, `1.0` = quarter note. Default: `0.25`.
### Fade
Apply fade-in, fade-out, or both. 13 curve shapes: `tri`, `qsin`, `esin`, `hsin`, `log`, `ipar`, `qua`, `cub`, `squ`, `cbr`, `par`, `exp`, `lin`.
```bash
# 2s fade-in
curl -X POST http://localhost:8000/v1/audio/fade \
-F "file=@track.wav" -F "fade_in=2.0" -o faded.wav
# 3s fade-out with exponential curve
curl -X POST http://localhost:8000/v1/audio/fade \
-F "file=@track.wav" -F "fade_out=3.0" -F "curve=exp" -o faded.wav
# Both β 1s in, 2s out
curl -X POST http://localhost:8000/v1/audio/fade \
-F "file=@track.wav" -F "fade_in=1.0" -F "fade_out=2.0" -o faded.wav
```
At least one of `fade_in` / `fade_out` must be > 0.
### Reverse
Flip audio backwards via ffmpeg `areverse`.
```bash
curl -X POST http://localhost:8000/v1/audio/reverse \
-F "file=@sample.wav" -o reversed.wav
curl -X POST http://localhost:8000/v1/audio/reverse \
-F "file_path=stems/vocals.wav" -F "output_format=mp3" -o reversed.mp3
```
### Loop
Repeat audio N times. Uses ffmpeg `aloop` filter β no re-encoding overhead per iteration.
```bash
# Play 4 times total
curl -X POST http://localhost:8000/v1/audio/loop \
-F "file=@beat.wav" -F "count=4" -o looped.wav
# 8-bar loop β 32 bars
curl -X POST http://localhost:8000/v1/audio/loop \
-F "file_path=stems/drums.wav" -F "count=4" -F "output_path=loops/drums32.wav"
```
`count` must be β₯ 2 (total plays, not extra loops).
### BPM match
Detect the source BPM via librosa, then time-stretch to the target β no manual math.
```bash
# Stretch anything to 128 BPM
curl -X POST http://localhost:8000/v1/audio/bpm-match \
-F "file=@loop.wav" -F "target_bpm=128" -o matched.wav
# Match tempo and also shift pitch
curl -X POST http://localhost:8000/v1/audio/bpm-match \
-F "file=@loop.wav" \
-F "target_bpm=140" \
-F "pitch_semitones=2" \
-o matched.wav
```
Response includes `X-Source-BPM`, `X-Target-BPM`, and `X-Tempo-Factor` headers (also in JSON when `output_path` is used). Requires both `librosa-analyze` and `stretch` engines.
### Stereo width
Widen or collapse the stereo image via M/S processing. `width=0.0` β mono, `1.0` β original, `>1.0` β wider. Works on mono input too (upmixes first).
```bash
# Widen to 1.5Γ
curl -X POST http://localhost:8000/v1/audio/stereo-width \
-F "file=@mix.wav" -F "width=1.5" -o wide.wav
# Collapse to mono
curl -X POST http://localhost:8000/v1/audio/stereo-width \
-F "file=@mix.wav" -F "width=0.0" -o mono.wav
# Subtle narrowing for mix bus
curl -X POST http://localhost:8000/v1/audio/stereo-width \
-F "file_path=master/mix.wav" -F "width=0.8" -F "output_path=master/narrow.wav"
```
Range: `[0.0, 3.0]`.
### Split
Split a file into segments. Two modes: `equal` (N equal time parts) or `silence` (split on quiet gaps). Returns a ZIP of numbered files.
```bash
# Split into 4 equal parts
curl -X POST http://localhost:8000/v1/audio/split \
-F "file=@track.wav" -F "mode=equal" -F "count=4" -o segments.zip
# Split a DJ mix on silence
curl -X POST http://localhost:8000/v1/audio/split \
-F "file=@djmix.wav" \
-F "mode=silence" \
-F "threshold_db=-40" \
-F "min_duration_sec=1.0" \
-o tracks.zip
# Split to mp3
curl -X POST http://localhost:8000/v1/audio/split \
-F "file=@album.flac" -F "mode=equal" -F "count=10" -F "output_format=mp3" -o parts.zip
```
`mode=equal` requires `count >= 2`. `mode=silence` uses `threshold_db` (default -30) and `min_duration_sec` (default 0.5); requires the `silence-detect` engine.
### Pan
Position audio in the stereo field. Works on mono and stereo input.
```bash
# Hard left
curl -X POST http://localhost:8000/v1/audio/pan \
-F "file=@vocal.wav" -F "position=-1.0" -o left.wav
# Slight right (e.g. guitar in mix)
curl -X POST http://localhost:8000/v1/audio/pan \
-F "file_path=stems/guitar.wav" -F "position=0.4" -o guitar_panned.wav
# Center (no-op but valid)
curl -X POST http://localhost:8000/v1/audio/pan \
-F "file=@mono.wav" -F "position=0.0" -o stereo.wav
```
`position`: -1.0 = hard left, 0.0 = center, 1.0 = hard right.
### EQ
Parametric EQ via ffmpeg `equalizer` filter. Pass any number of bands β each with a center frequency, gain, and optional bandwidth.
```bash
# Low-cut + presence boost
curl -X POST http://localhost:8000/v1/audio/eq \
-F "file=@vocal.wav" \
-F 'bands=[{"freq":100,"gain_db":-6,"width_hz":80},{"freq":3000,"gain_db":3,"width_hz":500}]' \
-o eq.wav
# Single band: cut 60 Hz hum
curl -X POST http://localhost:8000/v1/audio/eq \
-F "file=@recording.wav" \
-F 'bands=[{"freq":60,"gain_db":-20,"width_hz":30}]' \
-o clean.wav
```
Each band: `freq` (Hz, required), `gain_db` (dB, required, range Β±30), `width_hz` (optional, default 100).
### Key match
Detect the source key via CLAP chord analysis, then pitch-shift to a target key β one call instead of two.
```bash
# Shift everything to C major
curl -X POST http://localhost:8000/v1/audio/key-match \
-F "file=@loop.wav" -F "target_key=C" -o matched.wav
# Match to F# (response includes source_key + semitones shifted)
curl -X POST http://localhost:8000/v1/audio/key-match \
-F "file_path=stems/melody.wav" \
-F "target_key=F#" \
-F "output_path=matched/melody_fsharp.wav"
```
`target_key`: root note, e.g. `C`, `F#`, `Bb`, `D#`. Mode suffix (`major`/`minor`/`m`) is ignored β only the root matters for pitch. Requires `chord-detect` and `stretch` engines.
### Sidechain duck
Duck a primary track (music) whenever a trigger track (voice) is loud β the classic voiceover-over-music effect. Pure ffmpeg `sidechaincompress`, no model required.
```bash
curl -X POST http://localhost:8000/v1/audio/sidechain-duck \
-F "file=@music.wav" \
-F "trigger_file=@voice.wav" \
-F "threshold_db=-20" \
-F "ratio=4" \
-F "attack_ms=10" \
-F "release_ms=200" \
-o ducked.wav
# Aggressive duck for podcast-style music bed
curl -X POST http://localhost:8000/v1/audio/sidechain-duck \
-F "file_path=music/bed.wav" \
-F "trigger_file_path=voice/narration.wav" \
-F "threshold_db=-30" \
-F "ratio=10" \
-F "release_ms=400" \
-o "output_path=final/mix.wav"
```
Primary track is compressed whenever the trigger exceeds `threshold_db`. `ratio` sets compression intensity. Files must be the same duration for best results; shorter trigger is padded with silence.
### Effects chain
Apply an ordered chain of pedalboard effects β full catalog, you pick the order and params. Different from `/v1/audio/master` (which runs preset mastering chains).
```bash
# Compress, then add reverb, then drop -3 dB
curl -X POST http://localhost:8000/v1/audio/fx \
-F "file=@track.wav" \
-F 'effects=[
{"type":"Compressor","params":{"threshold_db":-18,"ratio":4.0}},
{"type":"Reverb","params":{"room_size":0.5,"wet_level":0.3}},
{"type":"Gain","params":{"gain_db":-3.0}}
]' \
-o out.wav
```
Allowed effects: `Compressor`, `Limiter`, `NoiseGate`, `Gain`, `Clipping`, `Distortion`, `Bitcrush`, `Reverb`, `Chorus`, `Delay`, `Phaser`, `PitchShift`, `HighShelfFilter`, `LowShelfFilter`, `PeakFilter`, `HighpassFilter`, `LowpassFilter`, `LadderFilter`, `IIRFilter`, `GSMFullRateCompressor`, `MP3Compressor`, `Resample`, `Invert`, `Convolution`.
VST3 / AudioUnit / external plugins are NOT in the allowlist β they load arbitrary native code.
### Loop point
Find the best seamless loop boundary in an audio file β audiolla analyses the beat grid and returns the start and end positions where a loop will repeat without a click or gap.
```bash
# Find best loop boundary (default: minimum 4 bars)
curl -X POST http://localhost:8000/v1/audio/loop-point \
-F "file=@beat.wav" | jq '{loop_start_sec, loop_end_sec, bars, score, tempo_bpm}'
# β {"loop_start_sec": 0.0, "loop_end_sec": 7.44, "bars": 4,
# "score": 0.94, "tempo_bpm": 128.0, "candidates": [...]}
# Require at least 8 bars, return top 3 candidates
curl -X POST http://localhost:8000/v1/audio/loop-point \
-F "file=@long_track.wav" \
-F "min_loop_bars=8" \
-F "num_candidates=3"
```
Response fields: `loop_start_sec`, `loop_end_sec`, `bars`, `score` (0β1, higher = tighter loop), `tempo_bpm`, `candidates` (array of ranked alternatives). Optional params: `min_loop_bars` (default 4), `num_candidates` (default 5). Requires `librosa-analyze` engine.
### Compose MIDI
POST a JSON song spec, get Standard MIDI File bytes back. Write the spec by hand, generate it from a tracker / DAW / sequencer, script it out of a Python notebook, or have an LLM produce it β audiolla doesn't care. No AI runs server-side; the spec is the music.
```bash
# 4-beat C major arpeggio at 120 BPM, piano + kick drum
curl -X POST http://localhost:8000/v1/midi/compose \
-H 'Content-Type: application/json' \
-d '{
"tempo_bpm": 120,
"tracks": [
{"name":"Lead","program":0,"channel":0,"notes":[
{"pitch":60,"start_beats":0.0,"duration_beats":0.5,"velocity":100},
{"pitch":64,"start_beats":0.5,"duration_beats":0.5,"velocity":100},
{"pitch":67,"start_beats":1.0,"duration_beats":0.5,"velocity":100},
{"pitch":72,"start_beats":1.5,"duration_beats":0.5,"velocity":100}
]},
{"name":"Kick","program":0,"channel":9,"notes":[
{"pitch":36,"start_beats":0.0,"duration_beats":0.1,"velocity":110},
{"pitch":36,"start_beats":1.0,"duration_beats":0.1,"velocity":110},
{"pitch":36,"start_beats":2.0,"duration_beats":0.1,"velocity":110},
{"pitch":36,"start_beats":3.0,"duration_beats":0.1,"velocity":110}
]}
]
}' \
-o song.mid
# Stage the MIDI for later via query-string output_path
curl -X POST 'http://localhost:8000/v1/midi/compose?output_path=midi/song.mid' \
-H 'Content-Type: application/json' \
-d @spec.json
```
Spec fields: `tempo_bpm` (default 120), `time_signature` (default `[4,4]`), `key_signature` (optional, e.g. `"C"`, `"Am"`), `ticks_per_beat` (default 480), `tracks[].{name, program, channel, volume, pan, notes[].{pitch, start_beats, duration_beats, velocity}}`. Time is in beats. `program` is GM program 0-127. Channel 9 is the GM drum channel β pitches there map to the drum kit (36 = kick, 38 = snare, 42 = closed hi-hat, etc.).
### Inspect MIDI
```bash
# read the structure of any Standard MIDI File
curl -X POST http://localhost:8000/v1/midi/inspect \
-F "file=@song.mid"
# β {type, ticks_per_beat, tempo_changes, time_signatures,
# tracks[{name, note_on_count, channels, programs, length_beats}], ...}
```
### Transform MIDI
```bash
# transpose all non-drum tracks up an octave
curl -X POST http://localhost:8000/v1/midi/transform \
-F "file=@song.mid" \
-F "transpose_semitones=12" \
-o transposed.mid
# override tempo to 140 BPM and save to staging
curl -X POST http://localhost:8000/v1/midi/transform \
-F "file=@song.mid" \
-F "tempo_bpm=140" \
-F "output_path=midi/fast.mid"
# drop the drum track (channel 9)
curl -X POST http://localhost:8000/v1/midi/transform \
-F "file=@song.mid" \
-F "drop_channels=9" \
-o no-drums.mid
# keep only channels 0 and 1 (comma-separated)
curl -X POST http://localhost:8000/v1/midi/transform \
-F "file=@song.mid" \
-F "keep_channels=0,1" \
-o two-ch.mid
# quantize to 1/16th notes
curl -X POST http://localhost:8000/v1/midi/transform \
-F "file=@song.mid" \
-F "quantize_grid_beats=0.25" \
-o quantized.mid
```
`transpose_semitones` Β±48. `quantize_grid_beats` is in beats (0.25 = 1/16th at 4/4). `keep_channels` and `drop_channels` take comma-separated channel numbers (`0,1,2`); only one can be set per request.
### Render MIDI to audio
```bash
# Synthesise via the bundled FluidR3_GM SoundFont
curl -X POST http://localhost:8000/v1/midi/render \
-F "file=@song.mid" \
-F "output_format=wav" \
-o song.wav
# Use your own SoundFont (must be staged first)
curl -X PUT http://localhost:8000/v1/files/sf/orchestral.sf2 --data-binary @my.sf2
curl -X POST http://localhost:8000/v1/midi/render \
-F "file=@song.mid" \
-F "soundfont_path=sf/orchestral.sf2" \
-F "output_format=flac" \
-o orch.flac
```
### Generate music from a spec
Compose + render in one call β spec in, WAV out.
```bash
curl -X POST 'http://localhost:8000/v1/midi/generate?output_format=wav' \
-H 'Content-Type: application/json' \
-d @spec.json \
-o song.wav
```
### Drum pattern
Step-sequencer spec β GM drum MIDI. Define a rhythmic pattern as arrays of 0/1 step values for each drum voice; the server maps them to GM channel 9 pitches and bakes a MIDI file. Optional swing shifts even-numbered 16th steps for a shuffled feel.
```bash
# 4-on-the-floor kick, snare on 2&4, busy hi-hat β 2 bars at 120 BPM
curl -X POST http://localhost:8000/v1/midi/drum \
-H "Content-Type: application/json" \
-d '{
"tempo_bpm": 120,
"steps": 16,
"bars": 2,
"swing": 0.0,
"pattern": {
"kick": [1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0],
"snare": [0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0],
"hihat": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
}
}' \
-o beat.mid
# Swing groove β 0.1 = subtle, 0.5 = strong shuffle
curl -X POST http://localhost:8000/v1/midi/drum \
-H "Content-Type: application/json" \
-d '{
"tempo_bpm": 95,
"steps": 16,
"bars": 1,
"swing": 0.2,
"pattern": {
"kick": [1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0],
"snare": [0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0],
"hihat": [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
}
}' \
-o groove.mid
```
Body fields: `tempo_bpm` (default 120), `steps` (steps per bar, default 16), `bars` (default 1), `swing` (0.0β0.5, default 0.0), `pattern` (object β keys are drum voice names, values are arrays of 0/1). Supported voices: `kick`, `snare`, `hihat`, `open_hihat`, `ride`, `crash`, `clap`, `tom_hi`, `tom_mid`, `tom_low`, `rim`, `cowbell`. Requires `midi-compose` engine.
### Chords to MIDI
Detect the chord progression from an audio file and convert each segment to a MIDI chord (root + 3rd + 5th). Useful for exporting a detected chord chart as playable MIDI, re-harmonising an arrangement, or seeding a DAW session.
```bash
# Audio β chord MIDI at the detected tempo
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
-F "file=@track.wav" \
-o chords.mid
# Override tempo, set velocity and octave
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
-F "file=@song.wav" \
-F "tempo_bpm=120" \
-F "velocity=90" \
-F "octave=3" \
-o chords.mid
# Stage the output
curl -X POST http://localhost:8000/v1/audio/chords-to-midi \
-F "file_path=sessions/song.wav" \
-F "output_path=midi/song_chords.mid"
```
Optional params: `tempo_bpm` (default: detected from audio), `velocity` (1β127, default 80), `octave` (0β8, default 4), `output_path`. Requires `chord-detect` engine. Each chord segment becomes a MIDI chord event (root + major 3rd/minor 3rd + perfect 5th, duration = segment length).
### Audio metadata tags
Read and write ID3 (MP3), Vorbis (OGG/FLAC), and WAV/M4A tags via mutagen. Requires the `metadata` engine.
```bash
# Read tags
curl -X POST http://localhost:8000/v1/audio/metadata \
-F "file=@track.mp3" | jq '{title, artist, bpm, key, duration_sec}'
# Write tags β returns updated tag set
curl -X POST http://localhost:8000/v1/audio/metadata \
-F "file=@track.mp3" \
-F 'tags={"title":"My Track","artist":"DJ Audiolla","bpm":"128","year":"2026"}'
```
### Clip detection
Detect digital clipping. No engine required β pure numpy arithmetic.
```bash
curl -X POST http://localhost:8000/v1/audio/clip-detect \
-F "file=@loud_master.wav" | jq '{clipped, clip_count, clip_ratio, peak_db}'
# β {"clipped":true,"clip_count":4219,"clip_ratio":0.0048,"peak_db":0.0}
```
### Mid/Side encode and decode
Encode L/R stereo to Mid+Side or decode back. Useful for stereo width surgery without touching the pedalboard chain.
```bash
# Encode L/R β M/S
curl -X POST http://localhost:8000/v1/audio/mid-side \
-F "file=@stereo.wav" \
-F "mode=encode" \
-o ms_encoded.wav
# Decode back to L/R
curl -X POST http://localhost:8000/v1/audio/mid-side \
-F "file=@ms_encoded.wav" \
-F "mode=decode" \
-o restored.wav
```
### Beat slice
Detect beat positions with librosa and return a ZIP of numbered WAV/MP3 slices β one file per beat interval.
```bash
curl -X POST http://localhost:8000/v1/audio/beat-slice \
-F "file=@loop.wav" \
-F "output_format=wav" \
-o slices.zip
# β slices.zip: beat_001.wav, beat_002.wav, beat_003.wav β¦
# With output_path: stages the ZIP and returns JSON
curl -X POST http://localhost:8000/v1/audio/beat-slice \
-F "file=@loop.wav" \
-F "output_path=beats/loop_slices.zip"
# β {"path":"beats/loop_slices.zip","beat_count":32,...}
```
### Convolution reverb
Apply an impulse response (IR) to audio via pedalboard's `Convolution`. Any WAV file can be used as the IR.
```bash
# Upload your IR first
curl -X PUT http://localhost:8000/v1/files/ir/plate.wav --data-binary @plate_reverb.wav
# Apply β wet_mix: 0.0=dry only, 1.0=wet only
curl -X POST http://localhost:8000/v1/audio/conv-reverb \
-F "file=@dry_vocal.wav" \
-F "ir_file_path=ir/plate.wav" \
-F "wet_mix=0.25" \
-F "output_format=wav" \
-o reverbed.wav
```
### Transient shaper
Attack/sustain dual-compressor blending. Positive `attack_gain_db` makes drums punchier; negative `sustain_gain_db` cuts room tail.
```bash
# Punchy drums: boost attack, cut sustain
curl -X POST http://localhost:8000/v1/audio/transient \
-F "file=@drums.wav" \
-F "attack_gain_db=6" \
-F "sustain_gain_db=-4" \
-o punchy_drums.wav
# Soft attack (pad-like)
curl -X POST http://localhost:8000/v1/audio/transient \
-F "file=@synth.wav" \
-F "attack_gain_db=-6" \
-F "sustain_gain_db=0" \
-o softened.wav
```
### Multiband compression
Split the signal into N+1 frequency bands and compress each one independently. Bands are split with zero-phase LR4-equivalent crossovers, so a bypassed chain reconstructs the original. Mastering-engineer staple β tame bass thump without squashing vocal sibilance, level out a busy mid-range, etc.
```bash
# 3-band mastering pass: low/mid/high
curl -X POST http://localhost:8000/v1/audio/multiband-compress \
-F "file=@mixdown.wav" \
-F 'crossovers_hz=[200, 3000]' \
-F 'bands=[
{"threshold_db":-18,"ratio":4,"attack_ms":15,"release_ms":150,"makeup_db":1.5},
{"threshold_db":-14,"ratio":3,"attack_ms":8, "release_ms":80, "makeup_db":1.0},
{"threshold_db":-10,"ratio":2,"attack_ms":3, "release_ms":40, "makeup_db":0.5}
]' \
-o mastered.wav
```
`crossovers_hz` length is N, `bands` length is N+1. Each band: required `threshold_db` + `ratio`, optional `attack_ms` (default 10), `release_ms` (default 100), `makeup_db` (default 0).
### DJ prep
One call returns everything a DJ needs about a track. Requires `librosa-analyze` + `chord-detect`. LUFS is reported when a loudness engine is available.
```bash
curl -X POST http://localhost:8000/v1/audio/dj-prep \
-F "file=@track.wav" | jq .
# β {"bpm":128.0,"key":"A minor","camelot":"8A","integrated_lufs":-9.4}
```
Camelot wheel positions let you quickly find harmonically compatible tracks for mixing.
### De-ess
Split-band high-frequency de-esser β attenuates sibilance above `frequency_hz` without affecting the rest of the signal. Implemented with a Butterworth HPF, envelope follower, and per-channel gain reduction. No engine required.
```bash
# Default settings (threshold -20 dB, 6 kHz, 4:1 ratio)
curl -X POST http://localhost:8000/v1/audio/deess \
-F "file=@vocal.wav" \
-o deessed.wav
# Gentle pass on a mix
curl -X POST http://localhost:8000/v1/audio/deess \
-F "file=@mix.wav" \
-F "threshold_db=-15" \
-F "frequency_hz=7000" \
-F "ratio=2.5" \
-o mix_deessed.wav
# Stage output
curl -X POST http://localhost:8000/v1/audio/deess \
-F "file=@vocal.wav" \
-F "output_path=sessions/vocal_deessed.wav"
# β {"path":"sessions/vocal_deessed.wav","threshold_db":-20.0,"frequency_hz":6000.0,"ratio":4.0,...}
```
Optional params: `threshold_db` (β€ 0, default -20), `frequency_hz` (2000β15000, default 6000), `ratio` (1.0β20.0, default 4.0), `output_format` (wav/mp3/flacβ¦), `output_path`.
### Stereo field analysis
Measure stereo width, phase correlation, mid/side balance, and mono compatibility. No engine required β pure numpy.
```bash
curl -X POST http://localhost:8000/v1/audio/stereo-field \
-F "file=@stereo_mix.wav" | jq .
# β {
# "correlation": 0.72, # Pearson L/R correlation [-1,1]
# "width": 0.41, # side_rms / mid_rms
# "balance_db": -0.3, # L vs R level difference
# "mono_compatible": true, # correlation >= 0.5
# "mid_level_db": -12.1,
# "side_level_db": -18.4,
# "phase_issues": false,
# "channels": 2,
# "sample_rate": 44100,
# "duration": 210.5
# }
# Analyze a staged file
curl -X POST http://localhost:8000/v1/audio/stereo-field \
-F "file_path=masters/track.wav" | jq '{correlation, width, mono_compatible}'
```
Mono files return `correlation=1.0`, `width=0.0`, `mono_compatible=true`. Use `correlation < 0` as a red flag for phase-cancelled material that will collapse on mono playback.
### Audio thumbnail
Extract the most energetic segment of an audio file β the passage with the highest onset density in a given window. Useful for generating preview clips, podcast teasers, or DJ cue points. Requires `librosa-analyze`.
```bash
# Default 30-second thumbnail
curl -X POST http://localhost:8000/v1/audio/thumbnail \
-F "file=@long_track.wav" \
-o preview.wav
# 10-second teaser
curl -X POST http://localhost:8000/v1/audio/thumbnail \
-F "file=@podcast.wav" \
-F "duration_sec=10" \
-F "output_format=mp3" \
-o teaser.mp3
# Stage + get timestamps
curl -X POST http://localhost:8000/v1/audio/thumbnail \
-F "file=@album_track.wav" \
-F "duration_sec=20" \
-F "output_path=previews/track_thumb.wav"
# β {"path":"previews/track_thumb.wav","start_sec":47.3,"end_sec":67.3,"duration_sec":20.0,...}
```
Optional params: `duration_sec` (1β300, default 30), `output_format`, `output_path`. When `output_path` is set the response JSON includes `start_sec` and `end_sec` so you know exactly where in the source the thumbnail was extracted.
### MIDI humanize
Add subtle timing and velocity variations to a MIDI file to make it sound less mechanical. Jitter is uniformly distributed and, when a `seed` is provided, fully deterministic. Requires `midi-compose`.
```bash
# Gentle humanize with defaults (Β±10 ms timing, Β±10% velocity)
curl -X POST http://localhost:8000/v1/midi/humanize \
-F "file=@rigid.mid" \
-o human.mid
# Heavier feel with a fixed seed for reproducible results
curl -X POST http://localhost:8000/v1/midi/humanize \
-F "file=@drums.mid" \
-F "timing_ms=20" \
-F "velocity_pct=15" \
-F "seed=42" \
-o drums_human.mid
# Stage output
curl -X POST http://localhost:8000/v1/midi/humanize \
-F "file=@pattern.mid" \
-F "timing_ms=8" \
-F "output_path=midi/pattern_human.mid"
# β {"path":"midi/pattern_human.mid","timing_ms":8.0,"velocity_pct":10.0,...}
```
Optional params: `timing_ms` (0β500, default 10), `velocity_pct` (0β50, default 10), `seed` (any int, optional), `output_path`. Non-MIDI input returns 400. Requires `midi-compose`.
### Batch operations
Run multiple operations on staged files in one HTTP call. Operations run sequentially; each gets an independent result entry even if earlier ops fail.
Supported ops: `convert`, `normalize`, `trim`, `fade`, `reverse`, `speed`, `eq`.
```bash
# Stage input
curl -X PUT http://localhost:8000/v1/files/work/track.wav --data-binary @track.wav
# Batch: trim, convert to MP3, reverse in one call
curl -X POST http://localhost:8000/v1/batch \
-H "Content-Type: application/json" \
-d '[
{"op":"trim","file_path":"work/track.wav","output_path":"work/chorus.wav","start_sec":30,"end_sec":60},
{"op":"convert","file_path":"work/track.wav","output_path":"work/track.mp3","output_format":"mp3"},
{"op":"reverse","file_path":"work/track.wav","output_path":"work/reversed.wav"}
]' | jq '.results[].status'
# β "ok" "ok" "ok"
```
### Async jobs and webhooks
Every audio endpoint accepts `async_job=true` β the request returns immediately with a job ID and the work happens in the background. Poll for status or register a webhook.
```bash
# Submit async with staging path β result written to /v1/files/stems/...
curl -X POST http://localhost:8000/v1/audio/separate \
-F "file=@track.wav" \
-F "engine=htdemucs" \
-F "async_job=true" \
-F "webhook_url=https://my-server.com/hooks/audio" \
-F "output_path=stems/track-vocals.wav"
# β {"job_id":"abc123","status":"pending"}
# Submit async with presigned S3 PUT URL β result uploaded on completion
curl -X POST http://localhost:8000/v1/audio/master \
-F "file=@track.wav" \
-F "async_job=true" \
-F "output_url=https://bucket.s3.amazonaws.com/result.wav?X-Amz-..."
# β {"job_id":"def456","status":"pending"}
# Poll
curl http://localhost:8000/v1/jobs/abc123 | jq '{status, duration_sec, result}'
# List all jobs (optional ?status=pending|running|completed|failed|cancelled)
curl http://localhost:8000/v1/jobs
# Cancel a running job
curl -X DELETE http://localhost:8000/v1/jobs/abc123
```
Webhook payload (POST to your URL when the job completes):
```json
{
"id": "abc123",
"endpoint": "/v1/audio/separate",
"status": "completed",
"duration_sec": 12.4,
"result": {"path": "stems/track-vocals.wav", "size": 3145728, ...}
}
```
Delivery has 4 attempts with exponential backoff (0 s, 1 s, 2 s, 4 s). Completed jobs stay in memory for `AUDIOLLA_JOB_TTL` seconds (default 1 hour) then are swept.
### Stage files
A simple server-side file store under `/v1/files`. Upload, list, download, delete.
```bash
# upload
curl -X PUT http://localhost:8000/v1/files/mytrack.wav \
--data-binary @track.wav
# list
curl http://localhost:8000/v1/files
# download
curl http://localhost:8000/v1/files/mytrack.wav -o copy.wav
# delete
curl -X DELETE http://localhost:8000/v1/files/mytrack.wav
```
Once staged, reference the file by path on any audio endpoint via `file_path`:
```bash
# Analyze a staged file
curl -X POST http://localhost:8000/v1/audio/analyze \
-F "file_path=mytrack.wav" \
-F "features=bpm"
# Separate stems and write the result back to staging
curl -X POST http://localhost:8000/v1/audio/separate \
-F "file_path=mytrack.wav" \
-F "engine=htdemucs" \
-F "stems=vocals" \
-F "output_path=stems/mytrack-vocals.wav"
# β {"path":"stems/mytrack-vocals.wav","size":...,"output_format":"wav",...}
```
### Remote URLs
Disabled by default. To allow the server to fetch `file_url` or PUT to
`output_url`, set the policy at container start:
```bash
docker run ... \
-e AUDIOLLA_FETCH_MODE=allowlist \
-e AUDIOLLA_FETCH_HOSTS="*.s3.amazonaws.com,*.r2.cloudflarestorage.com" \
psyb0t/audiolla:latest
```
Then:
```bash
# Fetch from S3, master, PUT result back to a presigned S3 URL
curl -X POST http://localhost:8000/v1/audio/master \
-F "file_url=https://my-bucket.s3.amazonaws.com/in.wav" \
-F "reference_url=https://my-bucket.s3.amazonaws.com/ref.wav" \
-F "mode=reference" \
-F "output_url=https://my-bucket.s3.amazonaws.com/out.wav?X-Amz-Signature=..."
# β {"url":"...","size":...,"output_format":"wav",...}
```
Policy modes:
- `disabled` (default) β `file_url` / `output_url` rejected with 400
- `allowlist` β only hosts matching `AUDIOLLA_FETCH_HOSTS` allowed
- `denylist` β anything except listed hosts allowed (pair with `AUDIOLLA_FETCH_ALLOW_PRIVATE=false` to block private IPs / metadata services)
Always-on protections:
- DNS-resolved private / loopback / link-local IPs rejected (toggleable)
- Only `https` by default; `http` opt-in via `AUDIOLLA_FETCH_SCHEMES`
- Redirects re-validated through the same policy
- Hard timeout + size cap = `AUDIOLLA_MAX_UPLOAD_BYTES`
- Every fetch / upload URL logged
See [Configuration](#configuration) for all `AUDIOLLA_FETCH_*` env vars.
---
## Engines
| Slug | What it does |
|------|--------------|
| `htdemucs` | 4-stem separation: drums, bass, other, vocals. Best speed/quality tradeoff. |
| `htdemucs_ft` | Same 4 stems, fine-tuned weights. Higher quality, ~4x slower. **CUDA-only** β rejected with 400 on the CPU image. |
| `htdemucs_6s` | 6 stems β also splits guitar and piano. Experimental. |
| `mdx_extra` | Strong on vocal isolation. MUSDB-trained, different architecture. |
| `matchering` | Reference-based mastering: EQ + loudness matched to a reference track. |
| `pedalboard-chain` | Preset mastering chains via pedalboard β `transparent` (light) or `loud` (4:1 squash). Backs `/v1/audio/master` with `mode=chain`. For arbitrary chains use `fx-chain` / `/v1/audio/fx`. |
| `librosa-analyze` | BPM, key, LUFS, duration, spectral features, beat grid, onset detection, melody (pyin), structural segmentation via librosa. |
| `sox-transform` | Gain, EQ, compression, reverb, pitch shift, tempo via pysox. |
| `fx-chain` | Arbitrary pedalboard effects chain β full catalog, your order and params. Backs `/v1/audio/fx`. |
| `midi-compose` | JSON spec β MIDI bytes. Also inspects and transforms existing MIDI files. Backs `/v1/midi/{compose,inspect,transform,generate}`. |
| `midi-render` | MIDI β audio via fluidsynth + SoundFont. Backs `/v1/midi/render` and `/v1/midi/generate`. |
| `silence-detect` | Locate silent gaps via ffmpeg `silencedetect`. Optional auto-trim. Backs `/v1/audio/silence`. |
| `ffmpeg-render` | Static PNG spectrogram/waveform + 8-mode animated MP4/WebM video via ffmpeg filters. Backs `/v1/audio/visualize/image/*` and `/v1/audio/visualize/video/{mode}`. |
| `audio-fingerprint` | Chromaprint acoustic fingerprint via `fpcalc`. Backs `/v1/audio/fingerprint`. |
| `uvr-dereverb` | BS-Roformer de-reverb β removes room reverb; `primary_stem=No Reverb`. |
| `uvr-deecho` | VR Architecture de-echo β normal and aggressive modes; pass `aggressive=true` for harder suppression. |
| `uvr-denoise` | MelBand Roformer de-noise (SDR 28) β removes broadband background noise. |
| `uvr-karaoke` | MelBand Roformer karaoke β remove lead vocals, keep backing; works via `/v1/audio/separate`. |
| `uvr-vocal-bsr` | BS-Roformer vocal/instrumental (SDR 13) β highest-quality vocal separation; works via `/v1/audio/separate`. |
| `basic-pitch` | Polyphonic audio-to-MIDI via Spotify basic-pitch (ONNX backend). Backs `/v1/audio/to_midi`. |
| `deepfilter` | Neural speech and vocal enhancement via DeepFilterNet DF3. Backs `/v1/audio/enhance`. |
| `chord-detect` | Chord and key detection via librosa β Krumhansl-Schmuckler key estimation + chroma template chord segmentation. Backs `/v1/audio/chords`. |
| `silero-vad` | Voice activity detection via silero-vad (ONNX) β returns speech/non-speech segments with timestamps and speech ratio. Backs `/v1/audio/vad`. |
| `pyannote` | Speaker diarization via pyannote/speaker-diarization-3.1 β returns per-speaker timestamped segments. Requires `HUGGINGFACE_TOKEN`. Backs `/v1/audio/diarize`. |
| `stretch` | Time-stretch + pitch-shift via librosa phase vocoder β independent tempo factor and semitone offset. Backs `/v1/audio/stretch`. |
| `ast-tag` | Audio tagging via Audio Spectrogram Transformer (MIT/ast-finetuned-audioset-10-10-0.4593) β top-K AudioSet class labels. Requires HF model cache. Backs `/v1/audio/tag`. |
| `clap-embed` | 512-dim L2-normalized audio embeddings via LAION CLAP (laion/larger_clap_music_and_speech) β semantic audio search. Requires HF model cache. Backs `/v1/audio/embed`. |
| `hpss` | Harmonic/percussive source separation via librosa HPSS median filter β returns harmonic + percussive stems as a ZIP. Backs `/v1/audio/separate/hpss`. |
| `noise-reduce` | Spectral noise reduction via noisereduce β stationary (constant hum/hiss) and non-stationary (adaptive) modes, no GPU required. Backs `/v1/audio/noise-reduce/noise-reduce`. |
| `metadata` | Read/write audio tags (ID3 for MP3, Vorbis for OGG/FLAC, INFO for WAV, MP4 for M4A) via mutagen. No ML weights. Backs `/v1/audio/metadata`. |
Each Demucs variant is its own checkpoint (hosted on `dl.fbaipublicfiles.com`). The entrypoint prefetches every enabled variant into `/data/torch_cache/` at startup so the first separation request doesn't sit there downloading.
`AUDIOLLA_ENABLED_ENGINES` β restrict which engines are available. `AUDIOLLA_PRELOAD` β load specific engines into memory at startup instead of waiting for the first request.
---
## Workflows β presets + pipeline
Two ways to chain operations server-side without re-uploading the audio between calls:
**Curated presets** β server-side YAML workflows shipped in `presets/`. Run one with a single POST:
```bash
# Master a mix for Spotify (-14 LUFS) β multiband compress + normalise
curl -X POST http://localhost:8000/v1/presets/master-for-spotify \
-F "file=@mix.wav" \
-o mastered.wav
# List available presets
curl http://localhost:8000/v1/presets | jq '.data[] | {name, description}'
# Inspect a preset's steps before running
curl http://localhost:8000/v1/presets/podcast-cleanup | jq '.steps'
```
Shipped presets: `master-for-spotify` (3-band master + -14 LUFS), `podcast-cleanup` (DeepFilterNet + de-ess + -16 LUFS), `vocal-cleanup` (UVR dereverb + denoise + de-ess + light comp). Add your own as a YAML file in `presets/`.
**Ad-hoc pipeline** β chain any registered ops in a single call:
```bash
# Restore + multiband + normalise in one request β intermediates stay
# server-side, no re-upload between steps.
curl -X POST http://localhost:8000/v1/pipeline \
-F "file=@track.wav" \
-F 'steps=[
{"op":"restore","params":{"engine":"uvr-denoise"}},
{"op":"multiband_compress","params":{
"crossovers_hz":[200,3000],
"bands":[
{"threshold_db":-18,"ratio":3},
{"threshold_db":-14,"ratio":2.5},
{"threshold_db":-10,"ratio":2}
]
}},
{"op":"normalize","params":{"target_lufs":-14}}
]' \
-o pipelined.wav
# Discover available ops
curl http://localhost:8000/v1/ops | jq .
```
The response of pipeline + preset endpoints includes a `steps` log so you can audit what ran. Both endpoints support `async_job=true`, `output_path`, `output_url` like every other audio-producing endpoint.
## API catalog
`GET /v1/catalog` returns the machine-readable list of every endpoint grouped by category (`separation`, `restoration`, `dynamics`, `eq-spatial`, `mastering`, `time-pitch`, `editing`, `analysis`, `effects-creative`, `visualize`, `midi`, `metadata`, `workflow`, `speech`, `files`, `jobs`, `management`). Use it for discovery; LLM agents and codegen scripts both consume it.
```bash
curl http://localhost:8000/v1/catalog | jq '.categories[] | {name, endpoint_count: (.endpoints | length)}'
```
## Endpoints
Full wire contract: [`openapi.yaml`](openapi.yaml).
### Audio processing
Every endpoint accepts exactly one of `file` / `file_path` / `file_url`.
Audio-producing endpoints additionally accept optional `output_path` /
`output_url` β when either is set, the response is JSON instead of audio
bytes.
| Method | Path | Default returns |
|--------|------|-----------------|
| `POST` | `/v1/audio/separate` | audio bytes for one stem; ZIP when requesting multiple (or all) stems |
| `POST` | `/v1/audio/master` | audio bytes |
| `POST` | `/v1/audio/analyze` | JSON β BPM, key, LUFS, spectral features |
| `POST` | `/v1/audio/beats` | JSON β BPM + beat timestamps; optional click-track WAV |
| `POST` | `/v1/audio/onsets` | JSON β onset timestamps |
| `POST` | `/v1/audio/melody` | JSON β dominant melody contour; optional MIDI export |
| `POST` | `/v1/audio/segments` | JSON β structural segment labels (A, B, Cβ¦) |
| `POST` | `/v1/audio/silence` | JSON β silent/non-silent ranges; optional trimmed audio |
| `POST` | `/v1/audio/visualize/image/spectrogram` | PNG bytes β static spectrogram (`color`, `scale` params) |
| `POST` | `/v1/audio/visualize/image/waveform` | PNG bytes β static waveform (`color` param) |
| `POST` | `/v1/audio/visualize/video/{mode}` | MP4/WebM bytes β animated video (8 modes: `spectrum`, `waves`, `cqt`, β¦) |
| `POST` | `/v1/audio/fingerprint` | JSON β Chromaprint fingerprint string |
| `POST` | `/v1/audio/restore/{engine}` | audio bytes β reverb/echo/noise removed; `aggressive=true` for uvr-deecho hard mode |
| `POST` | `/v1/audio/to_midi/{engine}` | MIDI bytes (`audio/midi`) β polyphonic transcription |
| `POST` | `/v1/audio/enhance/{engine}` | audio bytes β neural speech/vocal enhancement |
| `POST` | `/v1/audio/chords` | JSON β detected key and chord progression |
| `POST` | `/v1/audio/vad` | JSON β speech/non-speech segments with timestamps and speech ratio |
| `POST` | `/v1/audio/diarize/{engine}` | JSON β per-speaker timestamped segments |
| `POST` | `/v1/audio/transform` | audio bytes |
| `POST` | `/v1/audio/loudness` | JSON β `{loudness_lufs}` (measure only, no audio) |
| `POST` | `/v1/audio/loudness/curve` | JSON β `{curve:[{time_sec,rms_db}],duration,sample_rate,points}`; `hop_length` param |
| `POST` | `/v1/audio/normalize` | audio bytes β requires `target_lufs`; header `X-Loudness-LUFS` carries pre-normalization level |
| `POST` | `/v1/audio/separate/hpss` | ZIP containing `harmonic.` + `percussive.` |
| `POST` | `/v1/audio/noise-reduce/{engine}` | audio bytes β `engine=noise-reduce` (DSP, `stationary`/`prop_decrease`) or `uvr-denoise` (ML) |
| `POST` | `/v1/audio/stretch` | audio bytes |
| `POST` | `/v1/audio/pitch-correct` | audio bytes β `strength` [0.0β1.0]; requires `librosa-analyze` |
| `POST` | `/v1/audio/repair` | audio bytes β `declip` bool, `dehum` bool, `hum_freq` Hz |
| `POST` | `/v1/audio/tag` | JSON β top-K AudioSet labels with confidence scores |
| `POST` | `/v1/audio/embed` | JSON β 512-dim embedding; with `query_text` also returns cosine similarity |
| `POST` | `/v1/audio/classify` | JSON β `{results: [{label, score}]}` sorted descending; requires `clap-embed` |
| `POST` | `/v1/audio/info` | JSON β duration, sample_rate, channels, codec, bit_depth, format |
| `POST` | `/v1/audio/trim` | audio bytes β `start_sec` + `end_sec` required |
| `POST` | `/v1/audio/mix` | audio bytes β `tracks` JSON array required (β₯2 entries) |
| `POST` | `/v1/audio/concat` | audio bytes β `files` JSON array required (β₯2 entries) |
| `POST` | `/v1/audio/speed` | audio bytes β `speed` float required (0.1β10.0) |
| `POST` | `/v1/audio/convert` | audio bytes β format/sample_rate/channels conversion |
| `POST` | `/v1/audio/similar` | JSON β `{similarity, dim}`; requires `clap-embed` |
| `POST` | `/v1/audio/fade` | audio bytes β `fade_in`/`fade_out` seconds, 13 `curve` options |
| `POST` | `/v1/audio/reverse` | audio bytes β flips playback direction |
| `POST` | `/v1/audio/loop` | audio bytes β `count` total plays (β₯2) |
| `POST` | `/v1/audio/bpm-match` | audio bytes β `target_bpm` required; requires `librosa-analyze` + `stretch` |
| `POST` | `/v1/audio/stereo-width` | audio bytes β `width` [0.0β3.0]; M/S stereo processing |
| `POST` | `/v1/audio/split` | ZIP β `mode=equal` (requires `count`) or `mode=silence` |
| `POST` | `/v1/audio/pan` | audio bytes β `position` [-1.0β1.0] |
| `POST` | `/v1/audio/eq` | audio bytes β `bands` JSON array of `{freq, gain_db, width_hz}` |
| `POST` | `/v1/audio/key-match` | audio bytes β `target_key` required; requires `chord-detect` + `stretch` |
| `POST` | `/v1/audio/sidechain-duck` | audio bytes β primary + `trigger_file_*`; ffmpeg sidechaincompress |
| `POST` | `/v1/audio/fx` | audio bytes |
| `POST` | `/v1/audio/metadata` | JSON β tag fields (title, artist, bpm, key, duration, sample_rateβ¦); writes tags when `tags` JSON is provided |
| `POST` | `/v1/audio/clip-detect` | JSON β clipped, clip_count, clip_ratio, peak_db, duration_sec |
| `POST` | `/v1/audio/mid-side` | audio bytes β `mode=encode` (L/RβM/S) or `mode=decode` (M/SβL/R) |
| `POST` | `/v1/audio/beat-slice` | ZIP of numbered beat slices β requires `librosa-analyze` |
| `POST` | `/v1/audio/conv-reverb` | audio bytes β `ir_file` / `ir_file_path` / `ir_file_url` required; `wet_mix` [0.0β1.0] |
| `POST` | `/v1/audio/transient` | audio bytes β `attack_gain_db` + `sustain_gain_db` |
| `POST` | `/v1/audio/multiband-compress` | audio bytes β N-band compressor; `crossovers_hz` + `bands` JSON arrays |
| `POST` | `/v1/audio/dj-prep` | JSON β bpm, key, camelot, integrated_lufs; requires `librosa-analyze` + `chord-detect` |
| `POST` | `/v1/audio/loop-point` | JSON β `{loop_start_sec,loop_end_sec,bars,score,tempo_bpm,candidates}`; requires `librosa-analyze` |
| `POST` | `/v1/audio/chords-to-midi` | MIDI bytes β chord progression from audio; requires `chord-detect` |
| `POST` | `/v1/audio/deess` | audio bytes β split-band sibilance attenuation; `threshold_db`, `frequency_hz`, `ratio` |
| `POST` | `/v1/audio/stereo-field` | JSON β `{correlation, width, balance_db, mono_compatible, mid_level_db, side_level_db, phase_issues, β¦}` |
| `POST` | `/v1/audio/thumbnail` | audio bytes β most energetic `duration_sec` segment; `start_sec`/`end_sec` in JSON when `output_path` set; requires `librosa-analyze` |
### Workflow β presets, pipeline, catalog
Server-side multi-step chains + discovery. See [Workflows](#workflows--presets--pipeline) for narrative + curl examples.
| Method | Path | |
|--------|------|-|
| `GET` | `/v1/catalog` | machine-readable endpoint list grouped by category (17 categories) |
| `GET` | `/v1/ops` | list of pipeline op slugs (~24) usable in presets + `/v1/pipeline` |
| `GET` | `/v1/presets` | list curated server-side workflows (name + description) |
| `GET` | `/v1/presets/{name}` | describe one preset including all steps |
| `POST` | `/v1/presets/{name}` | audio bytes β run a curated preset (full async_job / output_path / output_url support) |
| `POST` | `/v1/pipeline` | audio bytes β ad-hoc `steps=[{op, params}, β¦]` chain, server-side intermediates |
### Batch
| Method | Path | |
|--------|------|-|
| `POST` | `/v1/batch` | JSON body: array of op objects `{op, file_path, output_path, β¦}`. Returns `{results:[β¦]}` β errors per-op, not a 4xx. Supported ops: `convert`, `normalize`, `trim`, `fade`, `reverse`, `speed`, `eq`. |
### Async jobs
Every audio endpoint accepts `async_job=true` (Form field). Adds `webhook_url` optional delivery.
| Method | Path | |
|--------|------|-|
| `GET` | `/v1/jobs` | list jobs; optional `?status=pending\|running\|completed\|failed\|cancelled` |
| `GET` | `/v1/jobs/{job_id}` | poll one job β returns status, result, duration_sec |
| `DELETE` | `/v1/jobs/{job_id}` | cancel running job or remove completed job |
### MIDI
| Method | Path | Default returns |
|--------|------|-----------------|
| `POST` | `/v1/midi/compose` | MIDI bytes (`audio/midi`) β body is `application/json` song spec |
| `POST` | `/v1/midi/inspect` | JSON β tempo, tracks, channels, note counts, time/key signatures |
| `POST` | `/v1/midi/transform` | MIDI bytes β transpose, quantize, tempo override, channel filter |
| `POST` | `/v1/midi/quantize` | MIDI bytes β `grid_beats` snaps all note timings to a rhythmic grid |
| `POST` | `/v1/midi/render` | audio bytes β input MIDI via `file` / `file_path` / `file_url` |
| `POST` | `/v1/midi/generate` | audio bytes β body is `application/json` song spec (compose + render in one) |
| `POST` | `/v1/midi/drum` | MIDI bytes β body is `application/json` step-sequencer spec; requires `midi-compose` |
| `POST` | `/v1/midi/humanize` | MIDI bytes β timing + velocity jitter; `timing_ms`, `velocity_pct`, `seed`; requires `midi-compose` |
### File staging
| Method | Path | |
|--------|------|-|
| `GET` | `/v1/files` | list staged files |
| `PUT` | `/v1/files/{path}` | upload |
| `GET` | `/v1/files/{path}` | download |
| `DELETE` | `/v1/files/{path}` | delete |
### Management
| Method | Path | |
|--------|------|-|
| `GET` | `/healthz` | liveness β always unauthenticated |
| `GET` | `/v1/engines` | list configured engines + `loaded` / `idle_seconds` per engine |
| `GET` | `/v1/ps` | list engines in memory right now |
| `DELETE` | `/v1/ps/{engine}` | evict one engine |
| `POST` | `/v1/unload` | evict everything |
---
## MCP
audiolla exposes a [Model Context Protocol](https://modelcontextprotocol.io) server at `/v1/mcp`. Point any MCP-capable LLM agent at it and it gets the full audio processing surface as callable tools β separate stems, detect chords, transcribe to MIDI, diarize speakers, compose music from a JSON spec, read/write tags, submit async jobs β all over JSON-RPC without writing a line of integration code.
Audio over MCP supports the same three output modes as REST: pass nothing β audio comes back **base64-encoded** in the response (JSON-RPC can't carry raw bytes natively); pass **`output_path`** β server stages the result in `FILES_DIR`, response is `{path, size, ...}` and the client retrieves it via the `get_file` tool or `/v1/files/` over HTTP; pass **`output_url`** (presigned PUT) β server PUTs the encoded bytes to the URL, response is `{url, size, ...}`. `output_path` and `output_url` are mutually exclusive β passing both raises `ValueError`. Use `list_jobs` / `get_job` / `cancel_job` to manage long-running async work.
**Endpoint:** `http://localhost:8000/v1/mcp`
**Tools:**
| Tool | What it does |
|------|--------------|
| `list_engines` | List configured engines and whether they're loaded |
| `list_presets` | List curated server-side workflows (name + description) |
| `describe_preset` | Show full step list of a preset before running |
| `list_ops` | List the ~24 pipeline op slugs available in `run_pipeline_tool` / presets |
| `run_preset` | Run a curated preset against an input file |
| `run_pipeline_tool` | Run an ad-hoc `[{op, params}, β¦]` chain server-side |
| `separate` | Demucs stem separation β base64 stems back, per-stem staging via `output_paths={stem:path}`, or per-stem PUT via `output_urls={stem:url}` |
| `master` | Reference mastering (matchering) or preset chain (pedalboard) |
| `analyze` | BPM, key, LUFS, spectral features via librosa |
| `beats` | Beat grid β BPM + timestamps; optional click-track audio |
| `onsets` | Note onset timestamps |
| `melody` | Dominant melody contour in Hz; optional MIDI export |
| `segments` | Structural segmentation β recurring section labels (A, B, Cβ¦) |
| `silence` | Detect silent gaps; optional auto-trim (edges or all) |
| `visualize` | PNG spectrogram/waveform or animated MP4/WebM β `engine` + `mode` select output type |
| `fingerprint` | Chromaprint acoustic fingerprint (AcoustID-compatible) |
| `restore` | Remove reverb/echo/noise via UVR β `engine` selects model; `aggressive=true` for harder echo suppression |
| `denoise` | Thin shim β prefer `restore` with `engine=uvr-denoise` or `noise_reduce` with `engine=uvr-denoise` |
| `audio_to_midi` | Polyphonic audio-to-MIDI transcription via basic-pitch (ONNX) β returns MIDI base64 |
| `enhance` | Neural speech and vocal enhancement via DeepFilterNet DF3 |
| `chords` | Chord and key detection via librosa β key + per-segment chord labels |
| `vad` | Voice activity detection via silero-vad β speech/non-speech segments with timestamps |
| `diarize` | Speaker diarization via pyannote β per-speaker timestamped segments |
| `transform` | Sox DSP chain β gain, EQ, reverb, pitch, tempo, etc. |
| `loudness` | Measure integrated LUFS β returns JSON only |
| `loudness_curve` | RMS envelope over time β `{curve:[{time_sec,rms_db}],duration,sample_rate,points}` |
| `normalize` | Normalize audio to a target LUFS level β returns base64 audio |
| `hpss` | Harmonic/percussive separation β returns per-stem base64 audio |
| `noise_reduce` | Noise reduction β `engine=noise-reduce` (DSP, stationary/prop_decrease) or `engine=uvr-denoise` (ML) |
| `stretch` | Time-stretch + pitch-shift via librosa phase vocoder |
| `pitch_correct` | Auto-tune toward nearest chromatic semitone β `strength` [0.0β1.0]; requires `librosa-analyze` |
| `repair_audio` | Declip + dehum β `declip` bool, `dehum` bool, `hum_freq` Hz |
| `tag` | Audio tagging via AST β top-K AudioSet labels with confidence scores |
| `embed` | 512-dim CLAP audio embedding; with `query_text` returns cosine similarity |
| `classify` | Zero-shot CLAP classification β cosine similarity against any list of text labels |
| `info` | Probe audio metadata β duration, sample_rate, channels, codec, bit_depth |
| `trim` | Cut audio to [start_sec, end_sec) β returns base64 audio |
| `mix` | Mix N tracks with per-track gain β `tracks` list of {file_path/url, gain_db} |
| `concat` | Stitch N audio files end-to-end in order β `files` list of {file_path/url} |
| `speed` | Change playback speed without pitch shift β `speed` float (0.1β10.0) |
| `convert` | Re-encode: format, sample_rate, channels in one call |
| `similar` | Cosine similarity between two audio files via CLAP β returns `{similarity, dim}` |
| `midi_quantize` | Snap MIDI note timings to a rhythmic grid β `grid_beats` in beats |
| `fade` | Fade-in/fade-out with configurable duration and curve shape |
| `reverse` | Flip audio backwards |
| `loop` | Repeat audio N times β `count` total plays |
| `bpm_match` | Detect BPM then stretch to `target_bpm` β returns source/target BPM + tempo_factor |
| `stereo_width` | M/S stereo width β `width=0` mono, `1` original, `>1` wider |
| `split` | Split into equal parts or on silence β returns `{segments:[{name,audio_base64}]}` |
| `pan` | Pan in the stereo field β `position` [-1.0β1.0] |
| `eq` | Parametric EQ β `bands` list of `{freq, gain_db, width_hz}` |
| `key_match` | Detect key then pitch-shift to `target_key` β returns source_key + semitones |
| `sidechain_duck` | Duck primary track on trigger β `threshold_db`, `ratio`, `attack_ms`, `release_ms` |
| `fx` | Generic pedalboard effects chain β full catalog, your order and params |
| `midi_compose` | JSON song spec β MIDI bytes (base64 or staged) |
| `midi_inspect` | Read MIDI structure β tempo, tracks, channels, note counts |
| `midi_transform` | Transpose, quantize, tempo override, channel filter on an existing MIDI file |
| `midi_render` | MIDI β audio via fluidsynth + SoundFont |
| `midi_generate` | One-shot compose + render β spec in, audio out |
| `drum_pattern` | Step-sequencer JSON spec β GM drum MIDI; `pattern` object of voice arrays, `swing`, `steps`, `bars` |
| `chords_to_midi` | Chord progression detected from audio β MIDI file; `tempo_bpm`, `velocity`, `octave` params |
| `audio_metadata` | Read or write audio tags β pass `tags` dict to write, omit to read |
| `detect_clipping` | Report digital clipping β clipped, clip_count, clip_ratio, peak_db |
| `mid_side` | M/S encode (`mode=encode`) or decode (`mode=decode`) stereo audio |
| `slice_at_beats` | Slice audio at beat positions β returns `{zip_base64, beat_count}` |
| `convolution_reverb` | Apply IR reverb β `ir_file_path`/`ir_file_url` + `wet_mix` [0.0β1.0] |
| `transient_shaper` | Attack/sustain shaping β `attack_gain_db`, `sustain_gain_db` |
| `multiband_compress` | N-band compressor β `crossovers_hz` list + `bands` list of per-band specs |
| `dj_prep` | BPM + key + Camelot wheel + LUFS in one call |
| `find_loop_point` | Find best seamless loop boundary β `{loop_start_sec,loop_end_sec,bars,score,tempo_bpm,candidates}` |
| `deess` | Split-band sibilance attenuation β `threshold_db`, `frequency_hz`, `ratio` |
| `stereo_field` | Stereo field analysis β correlation, width, balance_db, mono_compatible, mid/side levels |
| `audio_thumbnail` | Extract most energetic segment β `duration_sec`; returns base64 audio + `start_sec`/`end_sec` |
| `midi_humanize` | Add timing + velocity jitter to MIDI β `timing_ms`, `velocity_pct`, optional `seed` for deterministic output |
| `list_jobs` | List async jobs; optional `status` filter |
| `get_job` | Poll one async job by `job_id` |
| `cancel_job` | Cancel a running job or remove a completed one |
| `list_files` | List staged files |
| `put_file` | Upload a file (base64) to the staging area |
| `get_file` | Read a staged file back (base64) |
| `delete_file` | Remove a staged file |
Auth (`AUDIOLLA_AUTH_TOKEN`) covers `/v1/mcp` the same as the REST endpoints β pass the bearer token in the `Authorization` header.
---
## Configuration
| Variable | Default | |
|----------|---------|-|
| `AUDIOLLA_DEVICE` | `auto` | `auto`, `cpu`, `cuda`, or `cuda:N` |
| `AUDIOLLA_ENGINES_FILE` | `/app/engines.json` | path to engines registry |
| `AUDIOLLA_PRESETS_DIR` | `/app/presets` | directory of `*.yaml` preset workflows loaded at startup |
| `AUDIOLLA_DATA_DIR` | `/data` | where models and staged files live |
| `AUDIOLLA_UVR_MODELS_DIR` | `/uvr_models` | where UVR model files are cached |
| `AUDIOLLA_AUTH_TOKEN` | β | bearer token; empty means no auth |
| `HUGGINGFACE_TOKEN` | β | HuggingFace access token; required for `pyannote` speaker diarization (accept model terms at huggingface.co/pyannote/speaker-diarization-3.1 first) |
| `AUDIOLLA_ENABLED_ENGINES` | _(all)_ | comma-separated slugs to allow; empty = all |
| `AUDIOLLA_PRELOAD` | β | comma-separated slugs to load at startup |
| `AUDIOLLA_ENGINE_TTL` | `600` | seconds idle before an engine is unloaded (`10m` also works) |
| `AUDIOLLA_SWEEPER_INTERVAL` | `60` | how often the idle sweeper checks, in seconds |
| `AUDIOLLA_MAX_UPLOAD_BYTES` | `209715200` | upload cap (200 MB) β also caps URL fetch body size |
| `AUDIOLLA_FETCH_MODE` | `disabled` | `disabled`, `allowlist`, or `denylist` β controls server-side fetching for file_url / output_url |
| `AUDIOLLA_FETCH_HOSTS` | _(none)_ | comma-separated host patterns (`bucket.s3.amazonaws.com`, `*.s3.amazonaws.com`). Required when mode=allowlist. |
| `AUDIOLLA_FETCH_SCHEMES` | `https` | comma-separated schemes β `https`, `http` (http opt-in only) |
| `AUDIOLLA_FETCH_ALLOW_PRIVATE` | `false` | allow URLs that resolve to private / loopback / link-local IPs |
| `AUDIOLLA_FETCH_TIMEOUT` | `30` | hard timeout per fetch/upload, in seconds (also accepts `30s`, `1m`) |
| `AUDIOLLA_FETCH_MAX_REDIRECTS` | `5` | max redirects per fetch; each Location re-validated through the policy |
| `AUDIOLLA_JOB_TTL` | `3600` | Seconds a completed/failed/cancelled job stays in memory before being swept. Also accepts `1h`, `30m`. |
| `AUDIOLLA_JOB_MAX_CONCURRENT` | `8` | Maximum number of async jobs that can run simultaneously. |
| `AUDIOLLA_SOUNDFONT` | `/usr/share/sounds/sf2/FluidR3_GM.sf2` (prod images) | Default SoundFont path for `/v1/midi/render`. Override per request via `soundfont_path`. |
---
## What's not in here
| | Why |
|-|-----|
| Music generation | MusicGen is CC-BY-NC. Stable Audio Open needs a Stability AI commercial agreement. Nothing permissively licensed at production quality exists yet. |
| Essentia analysis | AGPL v3 β any network service using it has to publish full source. librosa handles the common cases without that. |
| Streaming separation | Demucs needs the whole file. No chunked or real-time inference. |
| VST3 plugin hosting | Pedalboard can do it but you'd need to mount your host plugin directory. Out of scope for the default image. |
| rubberband pitch/time-stretch | GPL v2 + commercial license. Sox handles basic pitch and tempo. Add it yourself if you accept the terms. |
---
## Build & dev
```bash
make build # CPU image
make build-cuda # CUDA image
make run # CPU image on port 8000
make run-cuda # CUDA image on port 8000
```
```bash
make dev-image # build the dev container
make shell # shell inside it
make lint # flake8 + mypy
make format # isort + black
make test-unit # unit tests (no GPU, no ML deps needed)
make test-unit-cov-gate # fail if coverage on support modules drops below 80%
make test-integration # integration tests (spins up Docker containers)
make generate # regenerate src/audiolla/schema/ from openapi.yaml
make clean # wipe build/cache artifacts
```
```bash
make pkg-lock # refresh uv.lock
make pkg-add PKG=name[==ver] # add a dep
make pkg-update PKG=name # upgrade one dep
make pkg-upgrade # upgrade everything
make pkg-remove PKG=name # remove a dep
make pkg-compile-heavy # recompile requirements-heavy-{cpu,cuda}.txt
```
Every `make pkg-*` bumps `[tool.uv] exclude-newer` to UTC midnight **7 days before** the bump date before touching anything β packages published in the last week are invisible to the resolver. The 7-day floor is the supply-chain attack window: fresh wheels (typosquats, hijacked maintainer releases) typically get caught and yanked within hours-to-days, so the floor gives malicious uploads a week of community scrutiny before they're eligible to enter the lockfile. Everything runs inside the dev container. Host needs `docker`, `make`, `git`.
---
## Supply chain
Both prod images do a two-layer install.
**Light deps** (`fastapi`, `uvicorn`, `pydantic`, etc.): locked in `uv.lock`, installed with `uv sync --frozen --no-dev`. Build fails if the lockfile doesn't match `pyproject.toml`. Wheel hashes verified by uv.
**Heavy ML/DSP deps** (torch, demucs, matchering, pedalboard, librosa, sox, numpy, soundfile, huggingface-hub): one hash-locked requirements file per image variant (`requirements-heavy-cpu.txt`, `requirements-heavy-cuda.txt`), because the torch wheel differs between CPU and CUDA and lives on a different index. Human specs in `scripts/heavy-deps-{cpu,cuda}.in`, compiled via `make pkg-compile-heavy`, installed with `uv pip install --require-hashes`. Both files are committed.
Base images and the `uv` binary pinned by `@sha256:` digest.
---
## License
[WTFPL](LICENSE).
matchering and pedalboard are GPL v3. Fine for self-hosted use. Distributing the image as a product needs a GPL compliance review.