https://github.com/ekhodzitsky/polyvoice
Speaker diarization for Rust — who spoke when, without Python. Silero VAD + WeSpeaker + AHC in a single Pipeline::run() call.
https://github.com/ekhodzitsky/polyvoice
audio diarization machine-learning onnx python-bindings rust speaker-diarization speech vad voice
Last synced: 2 days ago
JSON representation
Speaker diarization for Rust — who spoke when, without Python. Silero VAD + WeSpeaker + AHC in a single Pipeline::run() call.
- Host: GitHub
- URL: https://github.com/ekhodzitsky/polyvoice
- Owner: ekhodzitsky
- License: mit
- Created: 2026-05-05T13:25:51.000Z (8 days ago)
- Default Branch: master
- Last Pushed: 2026-05-10T12:19:27.000Z (4 days ago)
- Last Synced: 2026-05-10T18:02:41.775Z (3 days ago)
- Topics: audio, diarization, machine-learning, onnx, python-bindings, rust, speaker-diarization, speech, vad, voice
- Language: Rust
- Homepage: https://docs.rs/polyvoice
- Size: 85.9 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: docs/security/audit-2026-05-08.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# polyvoice
[](https://github.com/ekhodzitsky/polyvoice/actions/workflows/ci.yml)
[](https://crates.io/crates/polyvoice)
[](https://pypi.org/project/polyvoice)
[](https://docs.rs/polyvoice)
[](LICENSE)
> **Speaker diarization for Rust — who spoke when, without Python.**
>
> Silero VAD + WeSpeaker embeddings + AHC clustering in a single `Pipeline::run()` call.

```
Input: 14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s - 7.60s
SPEAKER_01: 8.10s - 14.10s
```
## Quick start
### 1. Add the dependency
```toml
[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }
```
### 2. Download models
```bash
bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/
```
### 3. Run the pipeline
```rust,no_run
use polyvoice::{
Pipeline, DiarizationConfig, VadConfig,
FbankOnnxExtractor, SileroVad,
};
use std::path::Path;
fn main() -> Result<(), Box> {
// Load models
let extractor = FbankOnnxExtractor::new(
Path::new("models/wespeaker_resnet34.onnx"),
256, // embedding dim
4, // ONNX session pool size
)?;
let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;
// Configure and run
let pipeline = Pipeline::new(
DiarizationConfig::default(),
VadConfig::default(),
);
let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
let result = pipeline.run(&samples, &extractor, &mut vad)?;
for turn in &result.turns {
println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}
Ok(())
}
```
## Python
```bash
pip install polyvoice
```
Or build from source:
```bash
cd python
maturin develop --release
```
```python
import polyvoice
pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")
for turn in turns:
print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
```
## CLI
```bash
cargo install polyvoice --features cli
polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4
```
## How it works
```
WAV / PCM audio (16 kHz mono)
|
v
+-------------+ +------------------+ +---------+
| Silero VAD |---->| WeSpeaker |---->| AHC |---> Speaker turns
| (speech | | ResNet34 | | cluster |
| regions) | | (256-d embed.) | | |
+-------------+ +------------------+ +---------+
fbank + CMVN cosine similarity
lock-free pool threshold merging
```
**VAD** detects speech regions, skipping silence. **WeSpeaker** extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). **AHC** clusters embeddings by cosine similarity into speaker groups. The `Pipeline` wires it all together.
## Comparison with pyannote
| | polyvoice | pyannote |
|---|---|---|
| Language | Rust | Python |
| Runtime | ONNX Runtime | PyTorch |
| GIL-free | Yes | No |
| Binary size | ~30 MB (with models) | ~2 GB (torch + models) |
| Deploy | Single binary / C FFI | Python env + pip |
| Concurrent sessions | Lock-free session pool | Thread-limited |
| Streaming | `OnlineDiarizer` built-in | Third-party wrappers |
pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.
## Minimum Supported Rust Version (MSRV)
1.85 (Rust 2024 edition).
## Accuracy (DER benchmarks)
Evaluated with 0.25s collar on standard diarization benchmarks:
### VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)
| System | DER | Miss | FA | Confusion | Speed |
|--------|-----|------|-----|-----------|-------|
| **polyvoice** (AHC, t=0.45, me=2) | **~15%** | 3.9% | 3.2% | 7.9% | **10.6x RT (CPU)** |
| pyannote 3.0 | ~11% | — | — | — | ~1x RT (GPU) |
### AMI (16 meetings, 9 hours — meeting room recordings)
| System | DER | Miss | FA | Confusion | Speed |
|--------|-----|------|-----|-----------|-------|
| **polyvoice** (AHC, t=0.45, me=2) | **~23%** | 15.4% | 3.5% | 4.1% | 7x RT (CPU) |
| pyannote 3.0 | ~18% | — | — | — | ~1x RT (GPU) |
| Simple i-vector + AHC | ~33% | — | — | — | — |
polyvoice delivers **~80% of pyannote's accuracy at 10x the speed on CPU alone** — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.
```bash
# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test
bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4
```
## Features
- **Pipeline API** — `Pipeline::run()` for one-call diarization with VAD + embeddings + clustering.
- **Online & Offline** — `OnlineDiarizer` for real-time streaming, `OfflineDiarizer` for batch files.
- **ONNX-powered** — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
- **Lock-free session pool** — `crossbeam-queue` backed pool for concurrent ONNX inference.
- **Silero VAD** — integrated voice activity detection with stateful LSTM context.
- **Overlap detection** — find regions where multiple speakers talk simultaneously.
- **Word alignment** — assign speaker IDs to transcript words by timestamp.
- **Python bindings** — `pip install polyvoice`, 3-line API via PyO3/maturin.
- **CLI** — `polyvoice diarize meeting.wav` with text/json/rttm output.
- **C FFI** — drop-in `.so`/`.dylib`/`.dll` for Go, Node.js, C++ callers.
- **Safety verified** — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.
## Configuration
```rust
use polyvoice::{DiarizationConfig, VadConfig, SampleRate};
let config = DiarizationConfig {
threshold: 0.45, // cosine similarity threshold
max_speakers: 64, // hard speaker limit
window_secs: 1.5, // analysis window
hop_secs: 0.75, // sliding step
min_speech_secs: 0.25, // discard shorter segments
max_gap_secs: 0.5, // merge same-speaker gaps under 500 ms
min_turn_duration_secs: 1.0, // filter turns shorter than 1s
min_embeddings_per_speaker: 2, // merge speakers with <2 embeddings
sample_rate: SampleRate::new(16000).unwrap(),
};
let vad_config = VadConfig {
frame_size: 512, // Silero VAD chunk size (32 ms at 16 kHz)
threshold: 0.5, // speech probability threshold
min_silence_ms: 300.0, // minimum silence to split segments
};
```
## Streaming (real-time)
```rust,no_run
use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};
let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);
// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}
```
## Verification
| Check | Tool |
|-------|------|
| Unsafe memory safety | Miri (nightly CI) |
| Concurrency correctness | Loom model-checking |
| Input fuzzing | cargo-fuzz (4 targets) |
| API stability | cargo-semver-checks |
| Cross-platform | Ubuntu, macOS, Windows CI |
| Dependency audit | cargo-audit |
## Roadmap
- [x] WeSpeaker + ECAPA-TDNN ONNX extractors
- [x] Silero VAD integration
- [x] Agglomerative hierarchical clustering (AHC)
- [x] Pipeline API (VAD + embeddings + AHC)
- [x] C FFI bindings
- [x] Miri / Loom / fuzz verification
- [x] Cross-platform CI
- [x] Python bindings (PyO3 / maturin)
- [x] CLI tool (`polyvoice diarize` / `download-models`)
- [x] DER benchmarks on AMI (~23%) and VoxConverse (~15%), 0.25s collar
- [x] Spectral clustering backend (experimental)
- [x] Merge-small-speakers post-processing
- [ ] PLDA scoring backend
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md).
## Changelog
See [CHANGELOG.md](CHANGELOG.md).
## License
MIT