https://github.com/ekhodzitsky/polyvoice

Speaker diarization for Rust — who spoke when, without Python. Silero VAD + WeSpeaker + AHC in a single Pipeline::run() call.
https://github.com/ekhodzitsky/polyvoice

audio diarization machine-learning onnx python-bindings rust speaker-diarization speech vad voice

Last synced: 2 days ago
JSON representation

Speaker diarization for Rust — who spoke when, without Python. Silero VAD + WeSpeaker + AHC in a single Pipeline::run() call.

Host: GitHub
URL: https://github.com/ekhodzitsky/polyvoice
Owner: ekhodzitsky
License: mit
Created: 2026-05-05T13:25:51.000Z (8 days ago)
Default Branch: master
Last Pushed: 2026-05-10T12:19:27.000Z (4 days ago)
Last Synced: 2026-05-10T18:02:41.775Z (3 days ago)
Topics: audio, diarization, machine-learning, onnx, python-bindings, rust, speaker-diarization, speech, vad, voice
Language: Rust
Homepage: https://docs.rs/polyvoice
Size: 85.9 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: docs/security/audit-2026-05-08.md
- Agents: AGENTS.md

Awesome Lists containing this project

README

          # polyvoice

[![CI](https://github.com/ekhodzitsky/polyvoice/actions/workflows/ci.yml/badge.svg)](https://github.com/ekhodzitsky/polyvoice/actions/workflows/ci.yml)

[![Crates.io](https://img.shields.io/crates/v/polyvoice)](https://crates.io/crates/polyvoice)

[![PyPI](https://img.shields.io/pypi/v/polyvoice)](https://pypi.org/project/polyvoice)

[![Docs.rs](https://docs.rs/polyvoice/badge.svg)](https://docs.rs/polyvoice)

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

> **Speaker diarization for Rust — who spoke when, without Python.**

>

> Silero VAD + WeSpeaker embeddings + AHC clustering in a single `Pipeline::run()` call.

![CLI Demo](docs/assets/demo.gif)

```

Input:  14 seconds of two-speaker audio (16 kHz mono WAV)

Output: SPEAKER_00: 0.10s -  7.60s

        SPEAKER_01: 8.10s - 14.10s

```

## Quick start

### 1. Add the dependency

```toml

[dependencies]

polyvoice = { version = "0.5", features = ["onnx"] }

```

### 2. Download models

```bash

bash scripts/download-models.sh

# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/

```

### 3. Run the pipeline

```rust,no_run

use polyvoice::{

    Pipeline, DiarizationConfig, VadConfig,

    FbankOnnxExtractor, SileroVad,

};

use std::path::Path;

fn main() -> Result<(), Box> {

    // Load models

    let extractor = FbankOnnxExtractor::new(

        Path::new("models/wespeaker_resnet34.onnx"),

        256, // embedding dim

        4,   // ONNX session pool size

    )?;

    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;

    // Configure and run

    let pipeline = Pipeline::new(

        DiarizationConfig::default(),

        VadConfig::default(),

    );

    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;

    let result = pipeline.run(&samples, &extractor, &mut vad)?;

    for turn in &result.turns {

        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);

    }

    Ok(())

}

```

## Python

```bash

pip install polyvoice

```

Or build from source:

```bash

cd python

maturin develop --release

```

```python

import polyvoice

pipeline = polyvoice.Pipeline("models/")

turns = pipeline("meeting.wav")

for turn in turns:

    print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

```

## CLI

```bash

cargo install polyvoice --features cli

polyvoice download-models

polyvoice diarize meeting.wav

polyvoice diarize meeting.wav --format json

polyvoice diarize meeting.wav --format rttm --max-speakers 4

```

## How it works

```

WAV / PCM audio (16 kHz mono)

       |

       v

+-------------+     +------------------+     +---------+

|  Silero VAD |---->| WeSpeaker        |---->|   AHC   |---> Speaker turns

|  (speech    |     | ResNet34         |     | cluster |

|   regions)  |     | (256-d embed.)   |     |         |

+-------------+     +------------------+     +---------+

                     fbank + CMVN           cosine similarity

                     lock-free pool         threshold merging

```

**VAD** detects speech regions, skipping silence. **WeSpeaker** extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). **AHC** clusters embeddings by cosine similarity into speaker groups. The `Pipeline` wires it all together.

## Comparison with pyannote

| | polyvoice | pyannote |

|---|---|---|

| Language | Rust | Python |

| Runtime | ONNX Runtime | PyTorch |

| GIL-free | Yes | No |

| Binary size | ~30 MB (with models) | ~2 GB (torch + models) |

| Deploy | Single binary / C FFI | Python env + pip |

| Concurrent sessions | Lock-free session pool | Thread-limited |

| Streaming | `OnlineDiarizer` built-in | Third-party wrappers |

pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.

## Minimum Supported Rust Version (MSRV)

1.85 (Rust 2024 edition).

## Accuracy (DER benchmarks)

Evaluated with 0.25s collar on standard diarization benchmarks:

### VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)

| System | DER | Miss | FA | Confusion | Speed |

|--------|-----|------|-----|-----------|-------|

| **polyvoice** (AHC, t=0.45, me=2) | **~15%** | 3.9% | 3.2% | 7.9% | **10.6x RT (CPU)** |

| pyannote 3.0 | ~11% | — | — | — | ~1x RT (GPU) |

### AMI (16 meetings, 9 hours — meeting room recordings)

| System | DER | Miss | FA | Confusion | Speed |

|--------|-----|------|-----|-----------|-------|

| **polyvoice** (AHC, t=0.45, me=2) | **~23%** | 15.4% | 3.5% | 4.1% | 7x RT (CPU) |

| pyannote 3.0 | ~18% | — | — | — | ~1x RT (GPU) |

| Simple i-vector + AHC | ~33% | — | — | — | — |

polyvoice delivers **~80% of pyannote's accuracy at 10x the speed on CPU alone** — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.

```bash

# Reproduce benchmarks

bash scripts/download-ami-test.sh

cargo run --release --features cli --bin polyvoice-bench -- data/ami-test

bash scripts/download-voxconverse-test.sh

cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4

```

## Features

- **Pipeline API** — `Pipeline::run()` for one-call diarization with VAD + embeddings + clustering.

- **Online & Offline** — `OnlineDiarizer` for real-time streaming, `OfflineDiarizer` for batch files.

- **ONNX-powered** — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.

- **Lock-free session pool** — `crossbeam-queue` backed pool for concurrent ONNX inference.

- **Silero VAD** — integrated voice activity detection with stateful LSTM context.

- **Overlap detection** — find regions where multiple speakers talk simultaneously.

- **Word alignment** — assign speaker IDs to transcript words by timestamp.

- **Python bindings** — `pip install polyvoice`, 3-line API via PyO3/maturin.

- **CLI** — `polyvoice diarize meeting.wav` with text/json/rttm output.

- **C FFI** — drop-in `.so`/`.dylib`/`.dll` for Go, Node.js, C++ callers.

- **Safety verified** — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.

## Configuration

```rust

use polyvoice::{DiarizationConfig, VadConfig, SampleRate};

let config = DiarizationConfig {

    threshold: 0.45,          // cosine similarity threshold

    max_speakers: 64,         // hard speaker limit

    window_secs: 1.5,         // analysis window

    hop_secs: 0.75,           // sliding step

    min_speech_secs: 0.25,    // discard shorter segments

    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms

    min_turn_duration_secs: 1.0,  // filter turns shorter than 1s

    min_embeddings_per_speaker: 2, // merge speakers with <2 embeddings

    sample_rate: SampleRate::new(16000).unwrap(),

};

let vad_config = VadConfig {

    frame_size: 512,          // Silero VAD chunk size (32 ms at 16 kHz)

    threshold: 0.5,           // speech probability threshold

    min_silence_ms: 300.0,    // minimum silence to split segments

};

```

## Streaming (real-time)

```rust,no_run

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();

let mut diarizer = OnlineDiarizer::new(config);

let extractor = DummyExtractor::new(256);

// In your audio callback:

# let chunk = vec![0.0f32; 4800];

let segments = diarizer.feed(&chunk, &extractor).unwrap();

for seg in segments {

    println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);

}

```

## Verification

| Check | Tool |

|-------|------|

| Unsafe memory safety | Miri (nightly CI) |

| Concurrency correctness | Loom model-checking |

| Input fuzzing | cargo-fuzz (4 targets) |

| API stability | cargo-semver-checks |

| Cross-platform | Ubuntu, macOS, Windows CI |

| Dependency audit | cargo-audit |

## Roadmap

- [x] WeSpeaker + ECAPA-TDNN ONNX extractors

- [x] Silero VAD integration

- [x] Agglomerative hierarchical clustering (AHC)

- [x] Pipeline API (VAD + embeddings + AHC)

- [x] C FFI bindings

- [x] Miri / Loom / fuzz verification

- [x] Cross-platform CI

- [x] Python bindings (PyO3 / maturin)

- [x] CLI tool (`polyvoice diarize` / `download-models`)

- [x] DER benchmarks on AMI (~23%) and VoxConverse (~15%), 0.25s collar

- [x] Spectral clustering backend (experimental)

- [x] Merge-small-speakers post-processing

- [ ] PLDA scoring backend

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## Changelog

See [CHANGELOG.md](CHANGELOG.md).

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ekhodzitsky/polyvoice

Awesome Lists containing this project

README