https://github.com/soniqo/speech-swift
AI speech toolkit for Apple Silicon β ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
https://github.com/soniqo/speech-swift
apple-silicon asr coreml ios macos mlx neural-engine on-device speaker-diarization speech-enhancement speech-recognition speech-to-speech swift text-to-speech tts voice-activity-detection
Last synced: 12 days ago
JSON representation
AI speech toolkit for Apple Silicon β ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
- Host: GitHub
- URL: https://github.com/soniqo/speech-swift
- Owner: soniqo
- License: apache-2.0
- Created: 2026-02-04T00:52:46.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-30T20:57:05.000Z (24 days ago)
- Last Synced: 2026-03-30T21:06:43.672Z (24 days ago)
- Topics: apple-silicon, asr, coreml, ios, macos, mlx, neural-engine, on-device, speaker-diarization, speech-enhancement, speech-recognition, speech-to-speech, swift, text-to-speech, tts, voice-activity-detection
- Language: Swift
- Homepage: https://soniqo.audio
- Size: 1.41 MB
- Stars: 496
- Watchers: 7
- Forks: 57
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
- awesome-mlx - speech-swift - ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML (Core MLX & Examples)
README
# Speech Swift
AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
π Read in: [English](README.md) Β· [δΈζ](README_zh.md) Β· [ζ₯ζ¬θͺ](README_ja.md) Β· [νκ΅μ΄](README_ko.md) Β· [EspaΓ±ol](README_es.md) Β· [Deutsch](README_de.md) Β· [FranΓ§ais](README_fr.md) Β· [ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯](README_hi.md) Β· [PortuguΓͺs](README_pt.md) Β· [Π ΡΡΡΠΊΠΈΠΉ](README_ru.md)
On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon β no cloud, no API keys, no data leaves your device.
**[π Full Documentation β](https://soniqo.audio)** Β· **[π€ HuggingFace Models](https://huggingface.co/aufklarer)** Β· **[π Blog](https://blog.ivan.digital)**
- **[Qwen3-ASR](https://soniqo.audio/guides/transcribe)** β Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
- **[Parakeet TDT](https://soniqo.audio/guides/parakeet)** β Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- **[Omnilingual ASR](https://soniqo.audio/guides/omnilingual)** β Speech-to-text (Meta wav2vec2 + CTC, **1,672 languages** across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
- **[Streaming Dictation](https://soniqo.audio/guides/dictate)** β Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
- **[Qwen3-ForcedAligner](https://soniqo.audio/guides/align)** β Word-level timestamp alignment (audio + text β timestamps)
- **[Qwen3-TTS](https://soniqo.audio/guides/speak)** β Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
- **[CosyVoice TTS](https://soniqo.audio/guides/cosyvoice)** β Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
- **[Kokoro TTS](https://soniqo.audio/guides/kokoro)** β On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
- **[Qwen3.5-Chat](https://soniqo.audio/guides/chat)** β On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
- **[PersonaPlex](https://soniqo.audio/guides/respond)** β Full-duplex speech-to-speech (7B, audio in β audio out, 18 voice presets)
- **[DeepFilterNet3](https://soniqo.audio/guides/denoise)** β Real-time noise suppression (2.1M params, 48 kHz)
- **[VAD](https://soniqo.audio/guides/vad)** β Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
- **[Speaker Diarization](https://soniqo.audio/guides/diarize)** β Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
- **[Speaker Embeddings](https://soniqo.audio/guides/embed-speaker)** β WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)
Papers: [Qwen3-ASR](https://arxiv.org/abs/2601.21337) (Alibaba) Β· [Qwen3-TTS](https://arxiv.org/abs/2601.15621) (Alibaba) Β· [Omnilingual ASR](https://arxiv.org/abs/2511.09690) (Meta) Β· [Parakeet TDT](https://arxiv.org/abs/2304.06795) (NVIDIA) Β· [CosyVoice 3](https://arxiv.org/abs/2505.17589) (Alibaba) Β· [Kokoro](https://arxiv.org/abs/2301.01695) (StyleTTS 2) Β· [PersonaPlex](https://arxiv.org/abs/2602.06053) (NVIDIA) Β· [Mimi](https://arxiv.org/abs/2410.00037) (Kyutai) Β· [Sortformer](https://arxiv.org/abs/2409.06656) (NVIDIA)
## News
- **20 Mar 2026** β [We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac](https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174)
- **26 Feb 2026** β [Speaker Diarization and Voice Activity Detection on Apple Silicon β Native Swift with MLX](https://blog.ivan.digital/speaker-diarization-and-voice-activity-detection-on-apple-silicon-native-swift-with-mlx-92ea0c9aca0f)
- **23 Feb 2026** β [NVIDIA PersonaPlex 7B on Apple Silicon β Full-Duplex Speech-to-Speech in Native Swift with MLX](https://blog.ivan.digital/nvidia-personaplex-7b-on-apple-silicon-full-duplex-speech-to-speech-in-native-swift-with-mlx-0aa5276f2e23)
- **12 Feb 2026** β [Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon β Architecture and Benchmarks](https://blog.ivan.digital/qwen3-asr-swift-on-device-asr-tts-for-apple-silicon-architecture-and-benchmarks-27cbf1e4463f)
## Quick start
Add the package to your `Package.swift`:
```swift
.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
```
Import only the modules you need β every model is its own SPM library, so you don't pay for what you don't use:
```swift
.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI views
```
**Transcribe an audio buffer in 3 lines:**
```swift
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
```
**Live streaming with partials:**
```swift
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}
```
**SwiftUI dictation view in ~10 lines:**
```swift
import SwiftUI
import ParakeetStreamingASR
import SpeechUI
@MainActor
struct DictateView: View {
@State private var store = TranscriptionStore()
var body: some View {
TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
.task {
let model = try? await ParakeetStreamingASRModel.fromPretrained()
guard let model else { return }
for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
store.apply(text: p.text, isFinal: p.isFinal)
}
}
}
}
```
`SpeechUI` ships only `TranscriptionView` (finals + partials) and `TranscriptionStore` (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
Available SPM products: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `OmnilingualASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.
## Models
Compact view below. **[Full model catalogue with sizes, quantisations, download URLs, and memory tables β soniqo.audio/architecture](https://soniqo.audio/architecture)**.
| Model | Task | Backends | Sizes | Languages |
|-------|------|----------|-------|-----------|
| [Qwen3-ASR](https://soniqo.audio/guides/transcribe) | Speech β Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 |
| [Parakeet TDT](https://soniqo.audio/guides/parakeet) | Speech β Text | CoreML (ANE) | 0.6B | 25 European |
| [Parakeet EOU](https://soniqo.audio/guides/dictate) | Speech β Text (streaming) | CoreML (ANE) | 120M | 25 European |
| [Omnilingual ASR](https://soniqo.audio/guides/omnilingual) | Speech β Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | **[1,672](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py)** |
| [Qwen3-ForcedAligner](https://soniqo.audio/guides/align) | Audio + Text β Timestamps | MLX, CoreML | 0.6B | Multi |
| [Qwen3-TTS](https://soniqo.audio/guides/speak) | Text β Speech | MLX, CoreML | 0.6B, 1.7B | 10 |
| [CosyVoice3](https://soniqo.audio/guides/cosyvoice) | Text β Speech | MLX | 0.5B | 9 |
| [Kokoro-82M](https://soniqo.audio/guides/kokoro) | Text β Speech | CoreML (ANE) | 82M | 10 |
| [Qwen3.5-Chat](https://soniqo.audio/guides/chat) | Text β Text (LLM) | MLX, CoreML | 0.8B | Multi |
| [PersonaPlex](https://soniqo.audio/guides/respond) | Speech β Speech | MLX | 7B | EN |
| [Silero VAD](https://soniqo.audio/guides/vad) | Voice Activity Detection | MLX, CoreML | 309K | Agnostic |
| [Pyannote](https://soniqo.audio/guides/diarize) | VAD + Diarization | MLX | 1.5M | Agnostic |
| [Sortformer](https://soniqo.audio/guides/diarize) | Diarization (E2E) | CoreML (ANE) | β | Agnostic |
| [DeepFilterNet3](https://soniqo.audio/guides/denoise) | Speech Enhancement | CoreML | 2.1M | Agnostic |
| [WeSpeaker](https://soniqo.audio/guides/embed-speaker) | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |
## Installation
### Homebrew
Requires native ARM Homebrew (`/opt/homebrew`). Rosetta/x86_64 Homebrew is not supported.
```bash
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech
```
Then:
```bash
audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript
```
**[Full CLI reference β](https://soniqo.audio/cli)**
### Swift Package Manager
```swift
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
]
```
Import only what you need β every model is its own SPM target:
```swift
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import OmnilingualASR // 1,672 languages (CoreML + MLX)
import Qwen3TTS // Text-to-speech
import CosyVoiceTTS // Text-to-speech with voice cloning
import KokoroTTS // Text-to-speech (iOS-ready)
import Qwen3Chat // On-device LLM chat
import PersonaPlex // Full-duplex speech-to-speech
import SpeechVAD // VAD + speaker diarization + embeddings
import SpeechEnhancement // Noise suppression
import SpeechUI // SwiftUI components for streaming transcripts
import AudioCommon // Shared protocols and utilities
```
### Requirements
- Swift 5.9+, Xcode 15+ (with Metal Toolchain)
- macOS 14+ or iOS 17+, Apple Silicon (M1/M2/M3/M4)
### Build from source
```bash
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build
```
`make build` compiles the Swift package **and** the MLX Metal shader library. The Metal library is required for GPU inference β without it you'll see `Failed to load the default metallib` at runtime. `make debug` for debug builds, `make test` for the test suite.
**[Full build and install guide β](https://soniqo.audio/getting-started)**
## Demo apps
- **[DictateDemo](Examples/DictateDemo/)** ([docs](https://soniqo.audio/guides/dictate)) β macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
- **[iOSEchoDemo](Examples/iOSEchoDemo/)** β iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
- **[PersonaPlexDemo](Examples/PersonaPlexDemo/)** β Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
- **[SpeechDemo](Examples/SpeechDemo/)** β Dictation and TTS synthesis in a tabbed interface. macOS.
Each demo's README has build instructions.
## Code examples
The snippets below show the minimal path for each domain. Every section links to a full guide on [soniqo.audio](https://soniqo.audio) with configuration options, multiple backends, streaming patterns, and CLI recipes.
### Speech-to-Text β [full guide β](https://soniqo.audio/guides/transcribe)
```swift
import Qwen3ASR
let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
```
Alternative backends: [Parakeet TDT](https://soniqo.audio/guides/parakeet) (CoreML, 32Γ realtime), [Omnilingual ASR](https://soniqo.audio/guides/omnilingual) (1,672 languages, CoreML or MLX), [Streaming dictation](https://soniqo.audio/guides/dictate) (live partials).
### Forced Alignment β [full guide β](https://soniqo.audio/guides/align)
```swift
import Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}
```
### Text-to-Speech β [full guide β](https://soniqo.audio/guides/speak)
```swift
import Qwen3TTS
import AudioCommon
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
```
Alternative TTS engines: [CosyVoice3](https://soniqo.audio/guides/cosyvoice) (streaming + voice cloning + emotion tags), [Kokoro-82M](https://soniqo.audio/guides/kokoro) (iOS-ready, 54 voices), [Voice cloning](https://soniqo.audio/guides/voice-cloning).
### Speech-to-Speech β [full guide β](https://soniqo.audio/guides/respond)
```swift
import PersonaPlex
let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback
```
### LLM Chat β [full guide β](https://soniqo.audio/guides/chat)
```swift
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
print(token, terminator: "")
}
```
### Voice Activity Detection β [full guide β](https://soniqo.audio/guides/vad)
```swift
import SpeechVAD
let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s β \(s.endTime)s") }
```
### Speaker Diarization β [full guide β](https://soniqo.audio/guides/diarize)
```swift
import SpeechVAD
let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }
```
### Speech Enhancement β [full guide β](https://soniqo.audio/guides/denoise)
```swift
import SpeechEnhancement
let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)
```
### Voice Pipeline (ASR β LLM β TTS) β [full guide β](https://soniqo.audio/api)
```swift
import SpeechCore
let pipeline = VoicePipeline(
stt: parakeetASR,
tts: qwen3TTS,
vad: sileroVAD,
config: .init(mode: .voicePipeline),
onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)
```
`VoicePipeline` is the real-time voice-agent state machine (powered by [speech-core](https://github.com/soniqo/speech-core)) with VAD-driven turn detection, interruption handling, and eager STT. It connects any `SpeechRecognitionModel` + `SpeechGenerationModel` + `StreamingVADProvider`.
### HTTP API server
```bash
audio-server --port 8080
```
Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at `/v1/realtime`. See [`Sources/AudioServer/`](Sources/AudioServer/).
## Architecture
speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in `AudioCommon` (protocols, audio I/O, HuggingFace downloader, `SentencePieceModel`) and `MLXCommon` (weight loading, `QuantizedLinear` helpers, `SDPA` multi-head attention helper).
**[Full architecture diagram with backends, memory tables, and module map β soniqo.audio/architecture](https://soniqo.audio/architecture)** Β· **[API reference β soniqo.audio/api](https://soniqo.audio/api)** Β· **[Benchmarks β soniqo.audio/benchmarks](https://soniqo.audio/benchmarks)**
Local docs (repo):
- **Models:** [Qwen3-ASR](docs/models/asr-model.md) Β· [Qwen3-TTS](docs/models/tts-model.md) Β· [CosyVoice](docs/models/cosyvoice-tts.md) Β· [Kokoro](docs/models/kokoro-tts.md) Β· [Parakeet TDT](docs/models/parakeet-asr.md) Β· [Parakeet Streaming](docs/models/parakeet-streaming-asr.md) Β· [Omnilingual ASR](docs/models/omnilingual-asr.md) Β· [PersonaPlex](docs/models/personaplex.md) Β· [FireRedVAD](docs/models/fireredvad.md)
- **Inference:** [Qwen3-ASR](docs/inference/qwen3-asr-inference.md) Β· [Parakeet TDT](docs/inference/parakeet-asr-inference.md) Β· [Parakeet Streaming](docs/inference/parakeet-streaming-asr-inference.md) Β· [Omnilingual ASR](docs/inference/omnilingual-asr-inference.md) Β· [TTS](docs/inference/qwen3-tts-inference.md) Β· [Forced Aligner](docs/inference/forced-aligner.md) Β· [Silero VAD](docs/inference/silero-vad.md) Β· [Speaker Diarization](docs/inference/speaker-diarization.md) Β· [Speech Enhancement](docs/inference/speech-enhancement.md)
- **Reference:** [Shared Protocols](docs/shared-protocols.md)
## Cache configuration
Model weights download from HuggingFace on first use and cache to `~/Library/Caches/qwen3-speech/`. Override with `QWEN3_CACHE_DIR` (CLI) or `cacheDir:` (Swift API). All `fromPretrained()` entry points also accept `offlineMode: true` to skip network when weights are already cached.
See [`docs/inference/cache-and-offline.md`](docs/inference/cache-and-offline.md) for full details including sandboxed iOS container paths.
## MLX Metal library
If you see `Failed to load the default metallib` at runtime, the Metal shader library is missing. Run `make build` or `./scripts/build_mlx_metallib.sh release` after a manual `swift build`. If the Metal Toolchain is missing, install it first:
```bash
xcodebuild -downloadComponent MetalToolchain
```
## Testing
```bash
make test # full suite (unit + E2E with model downloads)
swift test --skip E2E # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests # specific module
```
E2E test classes use the `E2E` prefix so CI can filter them out with `--skip E2E`. See [CLAUDE.md](CLAUDE.md#testing) for the full testing convention.
## Contributing
PRs welcome β bug fixes, new model integrations, documentation. Fork, create a feature branch, `make build && make test`, open a PR against `main`.
## License
Apache 2.0