An open API service indexing awesome lists of open source software.

https://github.com/soniqo/speech-swift

AI speech toolkit for Apple Silicon β€” ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
https://github.com/soniqo/speech-swift

apple-silicon asr coreml ios macos mlx neural-engine on-device speaker-diarization speech-enhancement speech-recognition speech-to-speech swift text-to-speech tts voice-activity-detection

Last synced: 12 days ago
JSON representation

AI speech toolkit for Apple Silicon β€” ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML

Awesome Lists containing this project

README

          

# Speech Swift

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

πŸ“– Read in: [English](README.md) Β· [δΈ­ζ–‡](README_zh.md) Β· [ζ—₯本θͺž](README_ja.md) Β· [ν•œκ΅­μ–΄](README_ko.md) Β· [EspaΓ±ol](README_es.md) Β· [Deutsch](README_de.md) Β· [FranΓ§ais](README_fr.md) Β· [ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯€](README_hi.md) Β· [PortuguΓͺs](README_pt.md) Β· [Русский](README_ru.md)

On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon β€” no cloud, no API keys, no data leaves your device.

**[πŸ“š Full Documentation β†’](https://soniqo.audio)** Β· **[πŸ€— HuggingFace Models](https://huggingface.co/aufklarer)** Β· **[πŸ“ Blog](https://blog.ivan.digital)**

- **[Qwen3-ASR](https://soniqo.audio/guides/transcribe)** β€” Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
- **[Parakeet TDT](https://soniqo.audio/guides/parakeet)** β€” Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- **[Omnilingual ASR](https://soniqo.audio/guides/omnilingual)** β€” Speech-to-text (Meta wav2vec2 + CTC, **1,672 languages** across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
- **[Streaming Dictation](https://soniqo.audio/guides/dictate)** β€” Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
- **[Qwen3-ForcedAligner](https://soniqo.audio/guides/align)** β€” Word-level timestamp alignment (audio + text β†’ timestamps)
- **[Qwen3-TTS](https://soniqo.audio/guides/speak)** β€” Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
- **[CosyVoice TTS](https://soniqo.audio/guides/cosyvoice)** β€” Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
- **[Kokoro TTS](https://soniqo.audio/guides/kokoro)** β€” On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
- **[Qwen3.5-Chat](https://soniqo.audio/guides/chat)** β€” On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
- **[PersonaPlex](https://soniqo.audio/guides/respond)** β€” Full-duplex speech-to-speech (7B, audio in β†’ audio out, 18 voice presets)
- **[DeepFilterNet3](https://soniqo.audio/guides/denoise)** β€” Real-time noise suppression (2.1M params, 48 kHz)
- **[VAD](https://soniqo.audio/guides/vad)** β€” Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
- **[Speaker Diarization](https://soniqo.audio/guides/diarize)** β€” Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
- **[Speaker Embeddings](https://soniqo.audio/guides/embed-speaker)** β€” WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)

Papers: [Qwen3-ASR](https://arxiv.org/abs/2601.21337) (Alibaba) Β· [Qwen3-TTS](https://arxiv.org/abs/2601.15621) (Alibaba) Β· [Omnilingual ASR](https://arxiv.org/abs/2511.09690) (Meta) Β· [Parakeet TDT](https://arxiv.org/abs/2304.06795) (NVIDIA) Β· [CosyVoice 3](https://arxiv.org/abs/2505.17589) (Alibaba) Β· [Kokoro](https://arxiv.org/abs/2301.01695) (StyleTTS 2) Β· [PersonaPlex](https://arxiv.org/abs/2602.06053) (NVIDIA) Β· [Mimi](https://arxiv.org/abs/2410.00037) (Kyutai) Β· [Sortformer](https://arxiv.org/abs/2409.06656) (NVIDIA)

## News

- **20 Mar 2026** β€” [We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac](https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174)
- **26 Feb 2026** β€” [Speaker Diarization and Voice Activity Detection on Apple Silicon β€” Native Swift with MLX](https://blog.ivan.digital/speaker-diarization-and-voice-activity-detection-on-apple-silicon-native-swift-with-mlx-92ea0c9aca0f)
- **23 Feb 2026** β€” [NVIDIA PersonaPlex 7B on Apple Silicon β€” Full-Duplex Speech-to-Speech in Native Swift with MLX](https://blog.ivan.digital/nvidia-personaplex-7b-on-apple-silicon-full-duplex-speech-to-speech-in-native-swift-with-mlx-0aa5276f2e23)
- **12 Feb 2026** β€” [Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon β€” Architecture and Benchmarks](https://blog.ivan.digital/qwen3-asr-swift-on-device-asr-tts-for-apple-silicon-architecture-and-benchmarks-27cbf1e4463f)

## Quick start

Add the package to your `Package.swift`:

```swift
.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
```

Import only the modules you need β€” every model is its own SPM library, so you don't pay for what you don't use:

```swift
.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI views
```

**Transcribe an audio buffer in 3 lines:**

```swift
import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
```

**Live streaming with partials:**

```swift
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}
```

**SwiftUI dictation view in ~10 lines:**

```swift
import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
@State private var store = TranscriptionStore()

var body: some View {
TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
.task {
let model = try? await ParakeetStreamingASRModel.fromPretrained()
guard let model else { return }
for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
store.apply(text: p.text, isFinal: p.isFinal)
}
}
}
}
```

`SpeechUI` ships only `TranscriptionView` (finals + partials) and `TranscriptionStore` (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: `Qwen3ASR`, `Qwen3TTS`, `Qwen3TTSCoreML`, `ParakeetASR`, `ParakeetStreamingASR`, `OmnilingualASR`, `KokoroTTS`, `CosyVoiceTTS`, `PersonaPlex`, `SpeechVAD`, `SpeechEnhancement`, `Qwen3Chat`, `SpeechCore`, `SpeechUI`, `AudioCommon`.

## Models

Compact view below. **[Full model catalogue with sizes, quantisations, download URLs, and memory tables β†’ soniqo.audio/architecture](https://soniqo.audio/architecture)**.

| Model | Task | Backends | Sizes | Languages |
|-------|------|----------|-------|-----------|
| [Qwen3-ASR](https://soniqo.audio/guides/transcribe) | Speech β†’ Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 |
| [Parakeet TDT](https://soniqo.audio/guides/parakeet) | Speech β†’ Text | CoreML (ANE) | 0.6B | 25 European |
| [Parakeet EOU](https://soniqo.audio/guides/dictate) | Speech β†’ Text (streaming) | CoreML (ANE) | 120M | 25 European |
| [Omnilingual ASR](https://soniqo.audio/guides/omnilingual) | Speech β†’ Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | **[1,672](https://github.com/facebookresearch/omnilingual-asr/blob/main/src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py)** |
| [Qwen3-ForcedAligner](https://soniqo.audio/guides/align) | Audio + Text β†’ Timestamps | MLX, CoreML | 0.6B | Multi |
| [Qwen3-TTS](https://soniqo.audio/guides/speak) | Text β†’ Speech | MLX, CoreML | 0.6B, 1.7B | 10 |
| [CosyVoice3](https://soniqo.audio/guides/cosyvoice) | Text β†’ Speech | MLX | 0.5B | 9 |
| [Kokoro-82M](https://soniqo.audio/guides/kokoro) | Text β†’ Speech | CoreML (ANE) | 82M | 10 |
| [Qwen3.5-Chat](https://soniqo.audio/guides/chat) | Text β†’ Text (LLM) | MLX, CoreML | 0.8B | Multi |
| [PersonaPlex](https://soniqo.audio/guides/respond) | Speech β†’ Speech | MLX | 7B | EN |
| [Silero VAD](https://soniqo.audio/guides/vad) | Voice Activity Detection | MLX, CoreML | 309K | Agnostic |
| [Pyannote](https://soniqo.audio/guides/diarize) | VAD + Diarization | MLX | 1.5M | Agnostic |
| [Sortformer](https://soniqo.audio/guides/diarize) | Diarization (E2E) | CoreML (ANE) | β€” | Agnostic |
| [DeepFilterNet3](https://soniqo.audio/guides/denoise) | Speech Enhancement | CoreML | 2.1M | Agnostic |
| [WeSpeaker](https://soniqo.audio/guides/embed-speaker) | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |

## Installation

### Homebrew

Requires native ARM Homebrew (`/opt/homebrew`). Rosetta/x86_64 Homebrew is not supported.

```bash
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech
```

Then:

```bash
audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript
```

**[Full CLI reference β†’](https://soniqo.audio/cli)**

### Swift Package Manager

```swift
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
]
```

Import only what you need β€” every model is its own SPM target:

```swift
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import OmnilingualASR // 1,672 languages (CoreML + MLX)
import Qwen3TTS // Text-to-speech
import CosyVoiceTTS // Text-to-speech with voice cloning
import KokoroTTS // Text-to-speech (iOS-ready)
import Qwen3Chat // On-device LLM chat
import PersonaPlex // Full-duplex speech-to-speech
import SpeechVAD // VAD + speaker diarization + embeddings
import SpeechEnhancement // Noise suppression
import SpeechUI // SwiftUI components for streaming transcripts
import AudioCommon // Shared protocols and utilities
```

### Requirements

- Swift 5.9+, Xcode 15+ (with Metal Toolchain)
- macOS 14+ or iOS 17+, Apple Silicon (M1/M2/M3/M4)

### Build from source

```bash
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build
```

`make build` compiles the Swift package **and** the MLX Metal shader library. The Metal library is required for GPU inference β€” without it you'll see `Failed to load the default metallib` at runtime. `make debug` for debug builds, `make test` for the test suite.

**[Full build and install guide β†’](https://soniqo.audio/getting-started)**

## Demo apps

- **[DictateDemo](Examples/DictateDemo/)** ([docs](https://soniqo.audio/guides/dictate)) β€” macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
- **[iOSEchoDemo](Examples/iOSEchoDemo/)** β€” iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
- **[PersonaPlexDemo](Examples/PersonaPlexDemo/)** β€” Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
- **[SpeechDemo](Examples/SpeechDemo/)** β€” Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

## Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on [soniqo.audio](https://soniqo.audio) with configuration options, multiple backends, streaming patterns, and CLI recipes.

### Speech-to-Text β€” [full guide β†’](https://soniqo.audio/guides/transcribe)

```swift
import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
```

Alternative backends: [Parakeet TDT](https://soniqo.audio/guides/parakeet) (CoreML, 32Γ— realtime), [Omnilingual ASR](https://soniqo.audio/guides/omnilingual) (1,672 languages, CoreML or MLX), [Streaming dictation](https://soniqo.audio/guides/dictate) (live partials).

### Forced Alignment β€” [full guide β†’](https://soniqo.audio/guides/align)

```swift
import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}
```

### Text-to-Speech β€” [full guide β†’](https://soniqo.audio/guides/speak)

```swift
import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
```

Alternative TTS engines: [CosyVoice3](https://soniqo.audio/guides/cosyvoice) (streaming + voice cloning + emotion tags), [Kokoro-82M](https://soniqo.audio/guides/kokoro) (iOS-ready, 54 voices), [Voice cloning](https://soniqo.audio/guides/voice-cloning).

### Speech-to-Speech β€” [full guide β†’](https://soniqo.audio/guides/respond)

```swift
import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback
```

### LLM Chat β€” [full guide β†’](https://soniqo.audio/guides/chat)

```swift
import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
print(token, terminator: "")
}
```

### Voice Activity Detection β€” [full guide β†’](https://soniqo.audio/guides/vad)

```swift
import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s β†’ \(s.endTime)s") }
```

### Speaker Diarization β€” [full guide β†’](https://soniqo.audio/guides/diarize)

```swift
import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }
```

### Speech Enhancement β€” [full guide β†’](https://soniqo.audio/guides/denoise)

```swift
import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)
```

### Voice Pipeline (ASR β†’ LLM β†’ TTS) β€” [full guide β†’](https://soniqo.audio/api)

```swift
import SpeechCore

let pipeline = VoicePipeline(
stt: parakeetASR,
tts: qwen3TTS,
vad: sileroVAD,
config: .init(mode: .voicePipeline),
onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)
```

`VoicePipeline` is the real-time voice-agent state machine (powered by [speech-core](https://github.com/soniqo/speech-core)) with VAD-driven turn detection, interruption handling, and eager STT. It connects any `SpeechRecognitionModel` + `SpeechGenerationModel` + `StreamingVADProvider`.

### HTTP API server

```bash
audio-server --port 8080
```

Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at `/v1/realtime`. See [`Sources/AudioServer/`](Sources/AudioServer/).

## Architecture

speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in `AudioCommon` (protocols, audio I/O, HuggingFace downloader, `SentencePieceModel`) and `MLXCommon` (weight loading, `QuantizedLinear` helpers, `SDPA` multi-head attention helper).

**[Full architecture diagram with backends, memory tables, and module map β†’ soniqo.audio/architecture](https://soniqo.audio/architecture)** Β· **[API reference β†’ soniqo.audio/api](https://soniqo.audio/api)** Β· **[Benchmarks β†’ soniqo.audio/benchmarks](https://soniqo.audio/benchmarks)**

Local docs (repo):
- **Models:** [Qwen3-ASR](docs/models/asr-model.md) Β· [Qwen3-TTS](docs/models/tts-model.md) Β· [CosyVoice](docs/models/cosyvoice-tts.md) Β· [Kokoro](docs/models/kokoro-tts.md) Β· [Parakeet TDT](docs/models/parakeet-asr.md) Β· [Parakeet Streaming](docs/models/parakeet-streaming-asr.md) Β· [Omnilingual ASR](docs/models/omnilingual-asr.md) Β· [PersonaPlex](docs/models/personaplex.md) Β· [FireRedVAD](docs/models/fireredvad.md)
- **Inference:** [Qwen3-ASR](docs/inference/qwen3-asr-inference.md) Β· [Parakeet TDT](docs/inference/parakeet-asr-inference.md) Β· [Parakeet Streaming](docs/inference/parakeet-streaming-asr-inference.md) Β· [Omnilingual ASR](docs/inference/omnilingual-asr-inference.md) Β· [TTS](docs/inference/qwen3-tts-inference.md) Β· [Forced Aligner](docs/inference/forced-aligner.md) Β· [Silero VAD](docs/inference/silero-vad.md) Β· [Speaker Diarization](docs/inference/speaker-diarization.md) Β· [Speech Enhancement](docs/inference/speech-enhancement.md)
- **Reference:** [Shared Protocols](docs/shared-protocols.md)

## Cache configuration

Model weights download from HuggingFace on first use and cache to `~/Library/Caches/qwen3-speech/`. Override with `QWEN3_CACHE_DIR` (CLI) or `cacheDir:` (Swift API). All `fromPretrained()` entry points also accept `offlineMode: true` to skip network when weights are already cached.

See [`docs/inference/cache-and-offline.md`](docs/inference/cache-and-offline.md) for full details including sandboxed iOS container paths.

## MLX Metal library

If you see `Failed to load the default metallib` at runtime, the Metal shader library is missing. Run `make build` or `./scripts/build_mlx_metallib.sh release` after a manual `swift build`. If the Metal Toolchain is missing, install it first:

```bash
xcodebuild -downloadComponent MetalToolchain
```

## Testing

```bash
make test # full suite (unit + E2E with model downloads)
swift test --skip E2E # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests # specific module
```

E2E test classes use the `E2E` prefix so CI can filter them out with `--skip E2E`. See [CLAUDE.md](CLAUDE.md#testing) for the full testing convention.

## Contributing

PRs welcome β€” bug fixes, new model integrations, documentation. Fork, create a feature branch, `make build && make test`, open a PR against `main`.

## License

Apache 2.0