https://github.com/Blaizzy/mlx-audio
A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.
https://github.com/Blaizzy/mlx-audio
apple-silicon audio-processing mlx multimodal speech-recognition speech-synthesis speech-to-text text-to-speech transformers
Last synced: about 1 year ago
JSON representation
A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.
- Host: GitHub
- URL: https://github.com/Blaizzy/mlx-audio
- Owner: Blaizzy
- License: mit
- Created: 2024-11-27T21:14:34.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-01T19:20:06.000Z (about 1 year ago)
- Last Synced: 2025-05-01T20:29:14.916Z (about 1 year ago)
- Topics: apple-silicon, audio-processing, mlx, multimodal, speech-recognition, speech-synthesis, speech-to-text, text-to-speech, transformers
- Language: Python
- Homepage:
- Size: 3.44 MB
- Stars: 723
- Watchers: 12
- Forks: 65
- Open Issues: 36
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awesome-rainmana - Blaizzy/mlx-audio - A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon. (Python)
- AiTreasureBox - Blaizzy/mlx-audio - 11-03_2776_1](https://img.shields.io/github/stars/Blaizzy/mlx-audio.svg)|A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.| (Repos)
- StarryDivineSky - Blaizzy/mlx-audio - audio 展现出三大差异化优势**:首先,其原生适配 Apple Silicon 的 MLX 框架,能够直接调用 M 系列芯片的 GPU/神经引擎加速计算,相比依赖通用深度学习框架(如 PyTorch)的传统方案,推理速度提升显著且内存占用更低。其次,项目采用模块化设计,将 TTS、STT、STS 功能解耦为独立接口,开发者可灵活组合使用,例如将语音输入实时转换为文本后再生成多语言语音输出,而无需处理底层计算图优化。第三,得益于 MLX 的动态图特性,模型支持即时编译与硬件感知调度,在保持易用性的同时避免了 Python 解释器带来的性能损耗,这一设计在长音频流处理中尤为关键。 **从技术原理来看,mlx-audio 的效能提升源于 Apple 生态的垂直整合**。MLX 框架类似于苹果硬件与算法之间的"翻译官",它将语音模型的矩阵运算(如注意力机制、卷积层)动态转换为 Metal Shader Language 指令,直接交由 GPU 的统一内存架构处理。这种设计类比于"用母语写作而非翻译外语"——传统跨平台框架需通过多层抽象接口与硬件通信,而 MLX 则允许模型像原生应用一样直接访问芯片的计算单元。此外,项目默认集成量化模型(如 4-bit 权重压缩),通过牺牲微量精度换取内存带宽的成倍降低,使得在 MacBook Air 等轻薄设备上运行百兆级语音模型成为可能。 整体而言,mlx-audio 填补了苹果生态中高效语音工具链的空白,其技术路径体现了"专用硬件+精简软件栈"的协同设计哲学。对于需要兼顾性能与隐私的 iOS/macOS 开发者,该项目提供了比云端 API 更可控、比通用框架更轻量的替代方案,未来有望成为苹果设备端语音交互开发的事实标准。 (语音识别与合成_其他 / 资源传输下载)
- awesome-github-projects - mlx-audio - A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon. ⭐7,377 `Python` 🔥 (🤖 AI & Machine Learning)
- awesome-mlx - mlx-audio - to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon. (Audio & Speech)
- stars - Blaizzy/mlx-audio - A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon. (Python)
README
# MLX-Audio
A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.
## Features
- Fast inference on Apple Silicon (M series chips)
- Multiple language support
- Voice customization options
- Adjustable speech speed control (0.5x to 2.0x)
- Interactive web interface with 3D audio visualization
- REST API for TTS generation
- Quantization support for optimized performance
- Direct access to output files via Finder/Explorer integration
## Installation
```bash
# Install the package
pip install mlx-audio
# For web interface and API dependencies
pip install -r requirements.txt
```
### Quick Start
To generate audio with an LLM use:
```bash
# Basic usage
mlx_audio.tts.generate --text "Hello, world"
# Specify prefix for output file
mlx_audio.tts.generate --text "Hello, world" --file_prefix hello
# Adjust speaking speed (0.5-2.0)
mlx_audio.tts.generate --text "Hello, world" --speed 1.4
```
### How to call from python
To generate audio with an LLM use:
```python
from mlx_audio.tts.generate import generate_audio
# Example: Generate an audiobook chapter as mp3 audio
generate_audio(
text=("In the beginning, the universe was created...\n"
"...or the simulation was booted up."),
model_path="prince-canuma/Kokoro-82M",
voice="af_heart",
speed=1.2,
lang_code="a", # Kokoro: (a)f_heart, or comment out for auto
file_prefix="audiobook_chapter1",
audio_format="wav",
sample_rate=24000,
join_audio=True,
verbose=True # Set to False to disable print messages
)
print("Audiobook chapter successfully generated!")
```
### Web Interface & API Server
MLX-Audio includes a web interface with a 3D visualization that reacts to audio frequencies. The interface allows you to:
1. Generate TTS with different voices and speed settings
2. Upload and play your own audio files
3. Visualize audio with an interactive 3D orb
4. Automatically saves generated audio files to the outputs directory in the current working folder
5. Open the output folder directly from the interface (when running locally)
#### Features
- **Multiple Voice Options**: Choose from different voice styles (AF Heart, AF Nova, AF Bella, BF Emma)
- **Adjustable Speech Speed**: Control the speed of speech generation with an interactive slider (0.5x to 2.0x)
- **Real-time 3D Visualization**: A responsive 3D orb that reacts to audio frequencies
- **Audio Upload**: Play and visualize your own audio files
- **Auto-play Option**: Automatically play generated audio
- **Output Folder Access**: Convenient button to open the output folder in your system's file explorer
To start the web interface and API server:
```bash
# Using the command-line interface
mlx_audio.server
# With custom host and port
mlx_audio.server --host 0.0.0.0 --port 9000
# With verbose logging
mlx_audio.server --verbose
```
Available command line arguments:
- `--host`: Host address to bind the server to (default: 127.0.0.1)
- `--port`: Port to bind the server to (default: 8000)
Then open your browser and navigate to:
```
http://127.0.0.1:8000
```
#### API Endpoints
The server provides the following REST API endpoints:
- `POST /tts`: Generate TTS audio
- Parameters (form data):
- `text`: The text to convert to speech (required)
- `voice`: Voice to use (default: "af_heart")
- `speed`: Speech speed from 0.5 to 2.0 (default: 1.0)
- Returns: JSON with filename of generated audio
- `GET /audio/{filename}`: Retrieve generated audio file
- `POST /play`: Play audio directly from the server
- Parameters (form data):
- `filename`: The filename of the audio to play (required)
- Returns: JSON with status and filename
- `POST /stop`: Stop any currently playing audio
- Returns: JSON with status
- `POST /open_output_folder`: Open the output folder in the system's file explorer
- Returns: JSON with status and path
- Note: This feature only works when running the server locally
> Note: Generated audio files are stored in `~/.mlx_audio/outputs` by default, or in a fallback directory if that location is not writable.
## Models
### Kokoro
Kokoro is a multilingual TTS model that supports various languages and voice styles.
#### Example Usage
```python
from mlx_audio.tts.models.kokoro import KokoroPipeline
from mlx_audio.tts.utils import load_model
from IPython.display import Audio
import soundfile as sf
# Initialize the model
model_id = 'prince-canuma/Kokoro-82M'
model = load_model(model_id)
# Create a pipeline with American English
pipeline = KokoroPipeline(lang_code='a', model=model, repo_id=model_id)
# Generate audio
text = "The MLX King lives. Let him cook!"
for _, _, audio in pipeline(text, voice='af_heart', speed=1, split_pattern=r'\n+'):
# Display audio in notebook (if applicable)
display(Audio(data=audio, rate=24000, autoplay=0))
# Save audio to file
sf.write('audio.wav', audio[0], 24000)
```
#### Language Options
- 🇺🇸 `'a'` - American English
- 🇬🇧 `'b'` - British English
- 🇯🇵 `'j'` - Japanese (requires `pip install misaki[ja]`)
- 🇨🇳 `'z'` - Mandarin Chinese (requires `pip install misaki[zh]`)
### CSM (Conversational Speech Model)
CSM is a model from Sesame that allows you text-to-speech and to customize voices using reference audio samples.
#### Example Usage
```bash
# Generate speech using CSM-1B model with reference audio
python -m mlx_audio.tts.generate --model mlx-community/csm-1b --text "Hello from Sesame." --play --ref_audio ./conversational_a.wav
```
You can pass any audio to clone the voice from or download sample audio file from [here](https://huggingface.co/mlx-community/csm-1b/tree/main/prompts).
## Advanced Features
### Quantization
You can quantize models for improved performance:
```python
from mlx_audio.tts.utils import quantize_model, load_model
import json
import mlx.core as mx
model = load_model(repo_id='prince-canuma/Kokoro-82M')
config = model.config
# Quantize to 8-bit
group_size = 64
bits = 8
weights, config = quantize_model(model, config, group_size, bits)
# Save quantized model
with open('./8bit/config.json', 'w') as f:
json.dump(config, f)
mx.save_safetensors("./8bit/kokoro-v1_0.safetensors", weights, metadata={"format": "mlx"})
```
## Requirements
- MLX
- Python 3.8+
- Apple Silicon Mac (for optimal performance)
- For the web interface and API:
- FastAPI
- Uvicorn
## License
[MIT License](LICENSE)
## Acknowledgements
- Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.
- This project uses the Kokoro model architecture for text-to-speech synthesis.
- The 3D visualization uses Three.js for rendering.