{"id":29111046,"url":"https://github.com/blaizzy/mlx-audio","last_synced_at":"2026-01-23T00:49:56.307Z","repository":{"id":279991963,"uuid":"895253710","full_name":"Blaizzy/mlx-audio","owner":"Blaizzy","description":"A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.","archived":false,"fork":false,"pushed_at":"2025-06-10T13:14:57.000Z","size":91665,"stargazers_count":2422,"open_issues_count":58,"forks_count":178,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-06-26T18:12:37.235Z","etag":null,"topics":["apple-silicon","audio-processing","mlx","multimodal","speech-recognition","speech-synthesis","speech-to-text","text-to-speech","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Blaizzy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"Blaizzy","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2024-11-27T21:14:34.000Z","updated_at":"2025-06-26T16:42:58.000Z","dependencies_parsed_at":"2025-02-28T20:58:18.657Z","dependency_job_id":"132bf025-700f-438e-8ada-e3465789b1c6","html_url":"https://github.com/Blaizzy/mlx-audio","commit_stats":null,"previous_names":["blaizzy/mlx-audio"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/Blaizzy/mlx-audio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Blaizzy","download_url":"https://codeload.github.com/Blaizzy/mlx-audio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262566829,"owners_count":23329681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","audio-processing","mlx","multimodal","speech-recognition","speech-synthesis","speech-to-text","text-to-speech","transformers"],"created_at":"2025-06-29T09:05:25.978Z","updated_at":"2026-01-23T00:49:56.301Z","avatar_url":"https://github.com/Blaizzy.png","language":"Python","funding_links":["https://github.com/sponsors/Blaizzy"],"categories":[],"sub_categories":[],"readme":"# MLX-Audio\n\nThe best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.\n\n## Features\n\n- Fast inference optimized for Apple Silicon (M series chips)\n- Multiple model architectures for TTS, STT, and STS\n- Multilingual support across models\n- Voice customization and cloning capabilities\n- Adjustable speech speed control\n- Interactive web interface with 3D audio visualization\n- OpenAI-compatible REST API\n- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance\n- Swift package for iOS/macOS integration\n\n## Installation\n\n```bash\npip install mlx-audio\n```\n\nFor development or web interface:\n\n```bash\ngit clone https://github.com/Blaizzy/mlx-audio.git\ncd mlx-audio\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n### Command Line\n\n```bash\n# Basic TTS generation\nmlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text \"Hello, world!\"\n\n# With voice selection and speed adjustment\nmlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text \"Hello!\" --voice af_heart --speed 1.2\n\n# Play audio immediately\nmlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text \"Hello!\" --play\n\n# Save to a specific directory\nmlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text \"Hello!\" --output_path ./my_audio\n```\n\n### Python API\n\n```python\nfrom mlx_audio.tts.utils import load_model\n\n# Load model\nmodel = load_model(\"mlx-community/Kokoro-82M-bf16\")\n\n# Generate speech\nfor result in model.generate(\"Hello from MLX-Audio!\", voice=\"af_heart\"):\n    print(f\"Generated {result.audio.shape[0]} samples\")\n    # result.audio contains the waveform as mx.array\n```\n\n## Supported Models\n\n### Text-to-Speech (TTS)\n\n| Model | Description | Languages | Repo |\n|-------|-------------|-----------|------|\n| **Kokoro** | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | [mlx-community/Kokoro-82M-bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16) |\n| **Qwen3-TTS** | Alibaba's multilingual TTS with voice design | ZH, EN, JA, KO, + more | [mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16](https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16) |\n| **CSM** | Conversational Speech Model with voice cloning | EN | [mlx-community/csm-1b](https://huggingface.co/mlx-community/csm-1b) |\n| **Dia** | Dialogue-focused TTS | EN | [mlx-community/Dia-1.6B-bf16](https://huggingface.co/mlx-community/Dia-1.6B-bf16) |\n| **OuteTTS** | Efficient TTS model | EN | [mlx-community/OuteTTS-0.2-500M](https://huggingface.co/mlx-community/OuteTTS-0.2-500M) |\n| **Spark** | SparkTTS model | EN, ZH | [mlx-community/SparkTTS-0.5B-bf16](https://huggingface.co/mlx-community/SparkTTS-0.5B-bf16) |\n| **Chatterbox** | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | [mlx-community/Chatterbox-bf16](https://huggingface.co/mlx-community/Chatterbox-bf16) |\n| **Soprano** | High-quality TTS | EN | [mlx-community/Soprano-bf16](https://huggingface.co/mlx-community/Soprano-bf16) |\n\n### Speech-to-Text (STT)\n\n| Model | Description | Languages | Repo |\n|-------|-------------|-----------|------|\n| **Whisper** | OpenAI's robust STT model | 99+ languages | [mlx-community/whisper-large-v3-turbo-asr-fp16](https://huggingface.co/mlx-community/whisper-large-v3-turbo-asr-fp16) |\n| **Parakeet** | NVIDIA's accurate STT | EN | [mlx-community/parakeet-tdt-0.6b-v2](https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v2) |\n| **Voxtral** | Mistral's speech model | Multiple | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) |\n\n### Speech-to-Speech (STS)\n\n| Model | Description | Use Case | Repo |\n|-------|-------------|----------|------|\n| **SAM-Audio** | Text-guided source separation | Extract specific sounds | [mlx-community/sam-audio-large](https://huggingface.co/mlx-community/sam-audio-large) |\n| **Liquid2.5-Audio*** | Speech-to-Speech, Text-to-Speech and Speech-to-Text | Speech interactions | [mlx-community/LFM2.5-Audio-1.5B-8bit](https://huggingface.co/mlx-community/LFM2.5-Audio-1.5B-8bit)\n| **MossFormer2 SE** | Speech enhancement | Noise removal | [starkdmi/MossFormer2_SE_48K_MLX](https://huggingface.co/starkdmi/MossFormer2_SE_48K_MLX) |\n\n## Model Examples\n\n### Kokoro TTS\n\nKokoro is a fast, multilingual TTS model with 54 voice presets.\n\n```python\nfrom mlx_audio.tts.utils import load_model\n\nmodel = load_model(\"mlx-community/Kokoro-82M-bf16\")\n\n# Generate with different voices\nfor result in model.generate(\n    text=\"Welcome to MLX-Audio!\",\n    voice=\"af_heart\",  # American female\n    speed=1.0,\n    lang_code=\"a\"  # American English\n):\n    audio = result.audio\n```\n\n**Available Voices:**\n- American English: `af_heart`, `af_bella`, `af_nova`, `af_sky`, `am_adam`, `am_echo`, etc.\n- British English: `bf_alice`, `bf_emma`, `bm_daniel`, `bm_george`, etc.\n- Japanese: `jf_alpha`, `jm_kumo`, etc.\n- Chinese: `zf_xiaobei`, `zm_yunxi`, etc.\n\n**Language Codes:**\n| Code | Language | Note |\n|------|----------|------|\n| `a` | American English | Default |\n| `b` | British English | |\n| `j` | Japanese | Requires `pip install misaki[ja]` |\n| `z` | Mandarin Chinese | Requires `pip install misaki[zh]` |\n| `e` | Spanish | |\n| `f` | French | |\n\n### Qwen3-TTS\n\nAlibaba's state-of-the-art multilingual TTS with three model variants:\n\n```python\nfrom mlx_audio.tts.utils import load_model\n\n# Base model with predefined voices\nmodel = load_model(\"mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16\")\nresults = list(model.generate(\n    text=\"Hello, welcome to MLX-Audio!\",\n    voice=\"Chelsie\",\n    language=\"English\",\n))\n\n# CustomVoice model - predefined voices with emotion control\nmodel = load_model(\"mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16\")\nresults = list(model.generate_custom_voice(\n    text=\"I'm so excited to meet you!\",\n    speaker=\"Vivian\",\n    language=\"English\",\n    instruct=\"Very happy and excited.\",\n))\n\n# VoiceDesign model - create any voice from text description\nmodel = load_model(\"mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16\")\nresults = list(model.generate_voice_design(\n    text=\"Big brother, you're back!\",\n    language=\"English\",\n    instruct=\"A cheerful young female voice with high pitch and energetic tone.\",\n))\n\n# Access generated audio\naudio = results[0].audio  # mx.array\n```\n\n**Available Models:**\n| Model | Method | Description |\n|-------|--------|-------------|\n| `mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16` | `generate()` | Fast, predefined voices |\n| `mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16` | `generate()` | Higher quality |\n| `mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16` | `generate_custom_voice()` | Voices + emotion |\n| `mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16` | `generate_custom_voice()` | Better emotion control |\n| `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16` | `generate_voice_design()` | Create any voice |\n\n**Speakers (Base/CustomVoice):** `Chelsie`, `Ethan`, `Serena`, `Vivian`, `Ryan`, `Aiden`, `Eric`, `Dylan`\n\n### CSM (Voice Cloning)\n\nClone any voice using a reference audio sample:\n\n```bash\nmlx_audio.tts.generate \\\n    --model mlx-community/csm-1b \\\n    --text \"Hello from Sesame.\" \\\n    --ref_audio ./reference_voice.wav \\\n    --play\n```\n\n### Whisper STT\n\n```python\nfrom mlx_audio.stt.utils import load_model, transcribe\n\nmodel = load_model(\"mlx-community/whisper-large-v3-turbo-asr-fp16\")\nresult = transcribe(\"audio.wav\", model=model)\nprint(result[\"text\"])\n```\n\n### SAM-Audio (Source Separation)\n\nSeparate specific sounds from audio using text prompts:\n\n```python\nfrom mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio\n\nmodel = SAMAudio.from_pretrained(\"mlx-community/sam-audio-large\")\nprocessor = SAMAudioProcessor.from_pretrained(\"mlx-community/sam-audio-large\")\n\nbatch = processor(\n    descriptions=[\"A person speaking\"],\n    audios=[\"mixed_audio.wav\"],\n)\n\nresult = model.separate_long(\n    batch.audios,\n    descriptions=batch.descriptions,\n    anchors=batch.anchor_ids,\n    chunk_seconds=10.0,\n    overlap_seconds=3.0,\n    ode_opt={\"method\": \"midpoint\", \"step_size\": 2/32},\n)\n\nsave_audio(result.target[0], \"voice.wav\")\nsave_audio(result.residual[0], \"background.wav\")\n```\n\n### MossFormer2 (Speech Enhancement)\n\nRemove noise from speech recordings:\n\n```python\nfrom mlx_audio.sts import MossFormer2SEModel, save_audio\n\nmodel = MossFormer2SEModel.from_pretrained(\"starkdmi/MossFormer2_SE_48K_MLX\")\nenhanced = model.enhance(\"noisy_speech.wav\")\nsave_audio(enhanced, \"clean.wav\", 48000)\n```\n\n## Web Interface \u0026 API Server\n\nMLX-Audio includes a modern web interface and OpenAI-compatible API.\n\n### Starting the Server\n\n```bash\n# Start API server\nmlx_audio.server --host 0.0.0.0 --port 8000\n\n# Start web UI (in another terminal)\ncd mlx_audio/ui\nnpm install \u0026\u0026 npm run dev\n```\n\n### API Endpoints\n\n**Text-to-Speech** (OpenAI-compatible):\n```bash\ncurl -X POST http://localhost:8000/v1/audio/speech \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"mlx-community/Kokoro-82M-bf16\", \"input\": \"Hello!\", \"voice\": \"af_heart\"}' \\\n  --output speech.wav\n```\n\n**Speech-to-Text**:\n```bash\ncurl -X POST http://localhost:8000/v1/audio/transcriptions \\\n  -F \"file=@audio.wav\" \\\n  -F \"model=mlx-community/whisper-large-v3-turbo-asr-fp16\"\n```\n\n## Quantization\n\n- MLX\n- Python 3.8+\n- Apple Silicon Mac (for optimal performance)\n- For the web interface and API:\n  - FastAPI\n  - Uvicorn\n\n## Swift\n\nLooking for Swift/iOS support? Check out [mlx-audio-swift](https://github.com/Blaizzy/mlx-audio-swift) for on-device TTS using MLX on macOS and iOS.\nReduce model size and improve performance with quantization using the convert script:\n\n```bash\n# Convert and quantize to 4-bit\npython -m mlx_audio.convert \\\n    --hf-path prince-canuma/Kokoro-82M \\\n    --mlx-path ./Kokoro-82M-4bit \\\n    --quantize \\\n    --q-bits 4 \\\n    --upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)\n\n# Convert with specific dtype (bfloat16)\npython -m mlx_audio.convert \\\n    --hf-path prince-canuma/Kokoro-82M \\\n    --mlx-path ./Kokoro-82M-bf16 \\\n    --dtype bfloat16 \\\n    --upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)\n```\n\n**Options:**\n| Flag | Description |\n|------|-------------|\n| `--hf-path` | Source Hugging Face model or local path |\n| `--mlx-path` | Output directory for converted model |\n| `-q, --quantize` | Enable quantization |\n| `--q-bits` | Bits per weight (4, 6, or 8) |\n| `--q-group-size` | Group size for quantization (default: 64) |\n| `--dtype` | Weight dtype: `float16`, `bfloat16`, `float32` |\n| `--upload-repo` | Upload converted model to HF Hub |\n\n\n## Requirements\n\n- Python 3.10+\n- Apple Silicon Mac (M1/M2/M3/M4)\n- MLX framework\n- **ffmpeg** (required for MP3/FLAC audio encoding)\n\n### Installing ffmpeg\n\nffmpeg is required for saving audio in MP3 or FLAC format. Install it using:\n\n```bash\n# macOS (using Homebrew)\nbrew install ffmpeg\n\n# Ubuntu/Debian\nsudo apt install ffmpeg\n```\n\nWAV format works without ffmpeg.\n\n## License\n\n[MIT License](LICENSE)\n\n## Citation\n\n```bibtex\n@misc{mlx-audio,\n  author = {Canuma, Prince},\n  title = {MLX Audio},\n  year = {2025},\n  howpublished = {\\url{https://github.com/Blaizzy/mlx-audio}},\n  note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}\n}\n```\n\n## Acknowledgements\n\n- [Apple MLX Team](https://github.com/ml-explore/mlx) for the MLX framework\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblaizzy%2Fmlx-audio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblaizzy%2Fmlx-audio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblaizzy%2Fmlx-audio/lists"}