https://github.com/blaizzy/mlx-audio
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
https://github.com/blaizzy/mlx-audio
apple-silicon audio-processing mlx multimodal speech-recognition speech-synthesis speech-to-text text-to-speech transformers
Last synced: about 2 months ago
JSON representation
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
- Host: GitHub
- URL: https://github.com/blaizzy/mlx-audio
- Owner: Blaizzy
- License: mit
- Created: 2024-11-27T21:14:34.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-10T13:14:57.000Z (9 months ago)
- Last Synced: 2025-06-26T18:12:37.235Z (9 months ago)
- Topics: apple-silicon, audio-processing, mlx, multimodal, speech-recognition, speech-synthesis, speech-to-text, text-to-speech, transformers
- Language: Python
- Homepage:
- Size: 87.4 MB
- Stars: 2,422
- Watchers: 22
- Forks: 178
- Open Issues: 58
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# MLX-Audio
The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
## Features
- Fast inference optimized for Apple Silicon (M series chips)
- Multiple model architectures for TTS, STT, and STS
- Multilingual support across models
- Voice customization and cloning capabilities
- Adjustable speech speed control
- Interactive web interface with 3D audio visualization
- OpenAI-compatible REST API
- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
- Swift package for iOS/macOS integration
## Installation
```bash
pip install mlx-audio
```
For development or web interface:
```bash
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"
```
## Quick Start
### Command Line
```bash
# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!"
# With voice selection and speed adjustment
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --voice af_heart --speed 1.2
# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --play
# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --output_path ./my_audio
```
### Python API
```python
from mlx_audio.tts.utils import load_model
# Load model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate speech
for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.array
```
## Supported Models
### Text-to-Speech (TTS)
| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Kokoro** | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | [mlx-community/Kokoro-82M-bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16) |
| **Qwen3-TTS** | Alibaba's multilingual TTS with voice design | ZH, EN, JA, KO, + more | [mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16](https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16) |
| **CSM** | Conversational Speech Model with voice cloning | EN | [mlx-community/csm-1b](https://huggingface.co/mlx-community/csm-1b) |
| **Dia** | Dialogue-focused TTS | EN | [mlx-community/Dia-1.6B-bf16](https://huggingface.co/mlx-community/Dia-1.6B-bf16) |
| **OuteTTS** | Efficient TTS model | EN | [mlx-community/OuteTTS-0.2-500M](https://huggingface.co/mlx-community/OuteTTS-0.2-500M) |
| **Spark** | SparkTTS model | EN, ZH | [mlx-community/SparkTTS-0.5B-bf16](https://huggingface.co/mlx-community/SparkTTS-0.5B-bf16) |
| **Chatterbox** | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | [mlx-community/Chatterbox-bf16](https://huggingface.co/mlx-community/Chatterbox-bf16) |
| **Soprano** | High-quality TTS | EN | [mlx-community/Soprano-bf16](https://huggingface.co/mlx-community/Soprano-bf16) |
### Speech-to-Text (STT)
| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Whisper** | OpenAI's robust STT model | 99+ languages | [mlx-community/whisper-large-v3-turbo-asr-fp16](https://huggingface.co/mlx-community/whisper-large-v3-turbo-asr-fp16) |
| **Parakeet** | NVIDIA's accurate STT | EN | [mlx-community/parakeet-tdt-0.6b-v2](https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v2) |
| **Voxtral** | Mistral's speech model | Multiple | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) |
### Speech-to-Speech (STS)
| Model | Description | Use Case | Repo |
|-------|-------------|----------|------|
| **SAM-Audio** | Text-guided source separation | Extract specific sounds | [mlx-community/sam-audio-large](https://huggingface.co/mlx-community/sam-audio-large) |
| **Liquid2.5-Audio*** | Speech-to-Speech, Text-to-Speech and Speech-to-Text | Speech interactions | [mlx-community/LFM2.5-Audio-1.5B-8bit](https://huggingface.co/mlx-community/LFM2.5-Audio-1.5B-8bit)
| **MossFormer2 SE** | Speech enhancement | Noise removal | [starkdmi/MossFormer2_SE_48K_MLX](https://huggingface.co/starkdmi/MossFormer2_SE_48K_MLX) |
## Model Examples
### Kokoro TTS
Kokoro is a fast, multilingual TTS model with 54 voice presets.
```python
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate with different voices
for result in model.generate(
text="Welcome to MLX-Audio!",
voice="af_heart", # American female
speed=1.0,
lang_code="a" # American English
):
audio = result.audio
```
**Available Voices:**
- American English: `af_heart`, `af_bella`, `af_nova`, `af_sky`, `am_adam`, `am_echo`, etc.
- British English: `bf_alice`, `bf_emma`, `bm_daniel`, `bm_george`, etc.
- Japanese: `jf_alpha`, `jm_kumo`, etc.
- Chinese: `zf_xiaobei`, `zm_yunxi`, etc.
**Language Codes:**
| Code | Language | Note |
|------|----------|------|
| `a` | American English | Default |
| `b` | British English | |
| `j` | Japanese | Requires `pip install misaki[ja]` |
| `z` | Mandarin Chinese | Requires `pip install misaki[zh]` |
| `e` | Spanish | |
| `f` | French | |
### Qwen3-TTS
Alibaba's state-of-the-art multilingual TTS with three model variants:
```python
from mlx_audio.tts.utils import load_model
# Base model with predefined voices
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
voice="Chelsie",
language="English",
))
# CustomVoice model - predefined voices with emotion control
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16")
results = list(model.generate_custom_voice(
text="I'm so excited to meet you!",
speaker="Vivian",
language="English",
instruct="Very happy and excited.",
))
# VoiceDesign model - create any voice from text description
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16")
results = list(model.generate_voice_design(
text="Big brother, you're back!",
language="English",
instruct="A cheerful young female voice with high pitch and energetic tone.",
))
# Access generated audio
audio = results[0].audio # mx.array
```
**Available Models:**
| Model | Method | Description |
|-------|--------|-------------|
| `mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16` | `generate()` | Fast, predefined voices |
| `mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16` | `generate()` | Higher quality |
| `mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16` | `generate_custom_voice()` | Voices + emotion |
| `mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16` | `generate_custom_voice()` | Better emotion control |
| `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16` | `generate_voice_design()` | Create any voice |
**Speakers (Base/CustomVoice):** `Chelsie`, `Ethan`, `Serena`, `Vivian`, `Ryan`, `Aiden`, `Eric`, `Dylan`
### CSM (Voice Cloning)
Clone any voice using a reference audio sample:
```bash
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--play
```
### Whisper STT
```python
from mlx_audio.stt.utils import load_model, transcribe
model = load_model("mlx-community/whisper-large-v3-turbo-asr-fp16")
result = transcribe("audio.wav", model=model)
print(result["text"])
```
### SAM-Audio (Source Separation)
Separate specific sounds from audio using text prompts:
```python
from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio
model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")
batch = processor(
descriptions=["A person speaking"],
audios=["mixed_audio.wav"],
)
result = model.separate_long(
batch.audios,
descriptions=batch.descriptions,
anchors=batch.anchor_ids,
chunk_seconds=10.0,
overlap_seconds=3.0,
ode_opt={"method": "midpoint", "step_size": 2/32},
)
save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")
```
### MossFormer2 (Speech Enhancement)
Remove noise from speech recordings:
```python
from mlx_audio.sts import MossFormer2SEModel, save_audio
model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)
```
## Web Interface & API Server
MLX-Audio includes a modern web interface and OpenAI-compatible API.
### Starting the Server
```bash
# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000
# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run dev
```
### API Endpoints
**Text-to-Speech** (OpenAI-compatible):
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
--output speech.wav
```
**Speech-to-Text**:
```bash
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"
```
## Quantization
- MLX
- Python 3.8+
- Apple Silicon Mac (for optimal performance)
- For the web interface and API:
- FastAPI
- Uvicorn
## Swift
Looking for Swift/iOS support? Check out [mlx-audio-swift](https://github.com/Blaizzy/mlx-audio-swift) for on-device TTS using MLX on macOS and iOS.
Reduce model size and improve performance with quantization using the convert script:
```bash
# Convert and quantize to 4-bit
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4 \
--upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)
# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-bf16 \
--dtype bfloat16 \
--upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)
```
**Options:**
| Flag | Description |
|------|-------------|
| `--hf-path` | Source Hugging Face model or local path |
| `--mlx-path` | Output directory for converted model |
| `-q, --quantize` | Enable quantization |
| `--q-bits` | Bits per weight (4, 6, or 8) |
| `--q-group-size` | Group size for quantization (default: 64) |
| `--dtype` | Weight dtype: `float16`, `bfloat16`, `float32` |
| `--upload-repo` | Upload converted model to HF Hub |
## Requirements
- Python 3.10+
- Apple Silicon Mac (M1/M2/M3/M4)
- MLX framework
- **ffmpeg** (required for MP3/FLAC audio encoding)
### Installing ffmpeg
ffmpeg is required for saving audio in MP3 or FLAC format. Install it using:
```bash
# macOS (using Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
```
WAV format works without ffmpeg.
## License
[MIT License](LICENSE)
## Citation
```bibtex
@misc{mlx-audio,
author = {Canuma, Prince},
title = {MLX Audio},
year = {2025},
howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}
```
## Acknowledgements
- [Apple MLX Team](https://github.com/ml-explore/mlx) for the MLX framework