An open API service indexing awesome lists of open source software.

https://github.com/Gr122lyBr/voicetag

Speaker identification powered by pyannote and resemblyzer
https://github.com/Gr122lyBr/voicetag

audio-transcription deep-learning deepgram diarization groq machine-learning nlp pyannote python resemblyzer speaker-diarization speaker-identification speaker-recognition speech-processing speech-to-text transcription voice-recognition whisper whisper-ai

Last synced: about 2 months ago
JSON representation

Speaker identification powered by pyannote and resemblyzer

Awesome Lists containing this project

README

          

voicetag

voicetag

Know who said what. Automatically.


PyPI version
Python versions
License
CI status
Downloads

---

## What is voicetag?

voicetag is a Python library for **speaker diarization and named speaker identification**. It combines [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization with [resemblyzer](https://github.com/resemble-ai/Resemblyzer) for speaker embeddings, giving you a single interface to answer: *who is speaking, and when?*

Enroll speakers once with a few audio samples, then identify them in any recording -- meetings, podcasts, interviews, phone calls.

## Features

- :zap: **Dead-simple API** -- enroll speakers and identify them in three lines of code
- :globe_with_meridians: **Language agnostic** -- works with Hebrew, English, Mandarin, or any spoken language
- :busts_in_silhouette: **Built-in overlap detection** -- flags regions where multiple speakers talk simultaneously
- :rocket: **Fast parallel processing** -- concurrent embedding computation with configurable thread pools
- :keyboard: **CLI tool included** -- enroll, identify, and manage profiles from the terminal
- :floppy_disk: **Save/load speaker profiles** -- persist enrolled speakers to disk and reuse across sessions
- :white_check_mark: **Pydantic result models** -- fully typed, validated, immutable result objects
- :speech_balloon: **Built-in transcription** -- plug in OpenAI, Groq, Fireworks, Whisper, or Deepgram to get "who said what"

## Quick Start

```python
from voicetag import VoiceTag

vt = VoiceTag()
vt.enroll("Christie", ["christie1.flac", "christie2.flac", "christie3.flac"])
vt.enroll("Mark", ["mark1.flac", "mark2.flac"])

# Identify who spoke when
result = vt.identify("audiobook.flac")
for seg in result.segments:
print(f"{seg.speaker}: {seg.start:.1f}s - {seg.end:.1f}s (confidence: {seg.confidence:.2f})")

# Transcribe: who said what
transcript = vt.transcribe("audiobook.flac", provider="whisper")
print(transcript.full_transcript)
```

Output:

```
Christie: 0.0s - 2.6s (confidence: 0.85)
Christie: 2.6s - 6.7s (confidence: 0.88)
Christie: 7.0s - 8.1s (confidence: 0.78)

[Christie] Gentlemen, he sat in a hoarse voice. Give me your
[Christie] word of honor that this horrible secret shall forever remain buried amongst ourselves.
[Christie] The two men drew back.
```

## Installation

```bash
pip install voicetag
```

For transcription support, install with a provider:

```bash
pip install voicetag[openai] # OpenAI Whisper API
pip install voicetag[groq] # Groq (fast Whisper)
pip install voicetag[whisper] # Local Whisper (no API key needed)
pip install voicetag[deepgram] # Deepgram
pip install voicetag[all-stt] # All providers
```

voicetag requires access to the [pyannote.audio](https://github.com/pyannote/pyannote-audio) speaker diarization model, which is gated behind a HuggingFace license agreement.

### Prerequisites

1. **Accept the pyannote model licenses** at:
- [hf.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
- [hf.co/pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
- [hf.co/pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
2. **Create a HuggingFace token** at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
3. **Set the token** via environment variable or config:

```bash
export HF_TOKEN="hf_your_token_here"
```

Or pass it directly:

```python
from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(hf_token="hf_your_token_here"))
```

### GPU Acceleration (optional)

For faster processing on CUDA or Apple Silicon:

```python
vt = VoiceTag(config=VoiceTagConfig(device="cuda")) # NVIDIA GPU
vt = VoiceTag(config=VoiceTagConfig(device="mps")) # Apple Silicon
```

## CLI Usage

voicetag ships with a full-featured command-line interface.

### Enroll a speaker

```bash
voicetag enroll "Christie" christie1.flac christie2.flac christie3.flac
voicetag enroll "Mark" mark1.flac mark2.flac
```

### Identify speakers

```bash
voicetag identify audiobook.flac
```

```
Speaker Timeline — audiobook.flac
+-----------+----------+----------+----------+------------+
| Speaker | Start | End | Duration | Confidence |
+-----------+----------+----------+----------+------------+
| Christie | 00:00.00 | 00:02.60 | 00:02.60 | 0.85 |
| Christie | 00:02.60 | 00:06.70 | 00:04.10 | 0.88 |
| Christie | 00:07.00 | 00:08.10 | 00:01.10 | 0.78 |
+-----------+----------+----------+----------+------------+

Summary
Total duration: 8.4s
Speakers: 1
Segments: 3
```

### Transcribe (speaker + text)

```bash
voicetag transcribe audiobook.flac --provider whisper --language en
```

```
Transcript — audiobook.flac
+-----------+----------+----------+--------------------------------------------------------------+
| Speaker | Start | End | Text |
+-----------+----------+----------+--------------------------------------------------------------+
| Christie | 00:00.00 | 00:02.60 | Gentlemen, he sat in a hoarse voice. Give me your |
| Christie | 00:02.60 | 00:06.70 | word of honor that this horrible secret shall forever remain |
| | | | buried amongst ourselves. |
| Christie | 00:07.00 | 00:08.10 | The two men drew back. |
+-----------+----------+----------+--------------------------------------------------------------+
```

Other providers:

```bash
voicetag transcribe call.wav --provider openai --language en
voicetag transcribe interview.wav --provider groq --language he
voicetag transcribe meeting.wav --provider deepgram
```

### Manage profiles

```bash
voicetag profiles list
voicetag profiles remove "Christie"
voicetag providers # list available STT providers
```

### All CLI options

```bash
voicetag --help
voicetag identify --help
```

| Option | Description |
|---|---|
| `--profiles PATH` | Path to speaker profiles file (default: `voicetag_profiles.json`) |
| `--output, -o PATH` | Save results as JSON |
| `--threshold FLOAT` | Similarity threshold override (0.0-1.0) |
| `--hf-token TEXT` | HuggingFace API token |
| `--device TEXT` | Torch device: `cpu`, `cuda`, `mps` |
| `--unknown-only` | Skip speaker matching, just diarize |

## API Reference

### `VoiceTag`

The main entry point. Wraps the full diarization + identification pipeline.

```python
from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(...))
```

| Method | Returns | Description |
|---|---|---|
| `enroll(name, audio_paths)` | `SpeakerProfile` | Register a speaker from one or more audio files |
| `identify(audio_path)` | `DiarizationResult` | Run full identification pipeline on an audio file |
| `save(path)` | `None` | Save enrolled speaker profiles to disk |
| `load(path)` | `None` | Load speaker profiles from disk |
| `remove_speaker(name)` | `None` | Remove an enrolled speaker by name |
| `enrolled_speakers` | `list[str]` | Property: list of enrolled speaker names |
| `transcribe(audio_path, provider, ...)` | `TranscriptResult` | Identify speakers and transcribe what they said |

#### Transcription example

```python
result = vt.transcribe("meeting.wav", provider="openai", language="en")

for seg in result.segments:
print(f"[{seg.speaker}] {seg.text}")

# Full transcript
print(result.full_transcript)

# Group by speaker
for speaker, segments in result.by_speaker.items():
print(f"\n{speaker}:")
for seg in segments:
print(f" {seg.text}")
```

Supported providers: `openai`, `groq`, `fireworks`, `whisper` (local), `deepgram`

### `VoiceTagConfig`

Configuration model (Pydantic v2, frozen/immutable).

```python
config = VoiceTagConfig(
hf_token="hf_...", # HuggingFace token (or set HF_TOKEN env var)
similarity_threshold=0.75, # min cosine similarity for a match
overlap_threshold=0.5, # min overlap ratio to flag
max_workers=4, # parallel embedding threads
min_segment_duration=0.5, # discard segments shorter than this (seconds)
device="cpu", # "cpu", "cuda", or "mps"
)
```

### Result Models

**`DiarizationResult`** -- returned by `identify()`:

| Field | Type | Description |
|---|---|---|
| `segments` | `list[SpeakerSegment \| OverlapSegment]` | Ordered timeline of speaker segments |
| `audio_duration` | `float` | Total audio length in seconds |
| `num_speakers` | `int` | Number of distinct speakers detected |
| `processing_time` | `float` | Wall-clock pipeline time in seconds |

**`SpeakerSegment`**:

| Field | Type | Description |
|---|---|---|
| `speaker` | `str` | Identified speaker name or `"UNKNOWN"` |
| `start` | `float` | Start time in seconds |
| `end` | `float` | End time in seconds |
| `confidence` | `float` | Cosine similarity score (0.0-1.0) |
| `duration` | `float` | Property: `end - start` |

**`OverlapSegment`**:

| Field | Type | Description |
|---|---|---|
| `speakers` | `list[str]` | Names of overlapping speakers |
| `start` | `float` | Start time in seconds |
| `end` | `float` | End time in seconds |
| `speaker` | `Literal["OVERLAP"]` | Always `"OVERLAP"` |
| `duration` | `float` | Property: `end - start` |

**`SpeakerProfile`**:

| Field | Type | Description |
|---|---|---|
| `name` | `str` | Speaker name |
| `embedding` | `list[float]` | 256-dimensional mean embedding vector |
| `num_samples` | `int` | Number of audio files used for enrollment |
| `created_at` | `datetime` | UTC timestamp of enrollment |

### Error Handling

All exceptions inherit from `VoiceTagError`:

```python
from voicetag import VoiceTagError

try:
result = vt.identify("audio.wav")
except VoiceTagError as e:
print(f"Error: {e}")
```

| Exception | When |
|---|---|
| `VoiceTagConfigError` | Invalid config or missing HuggingFace token |
| `EnrollmentError` | Enrollment fails (no audio, bad format) |
| `DiarizationError` | Pyannote processing failure |
| `AudioLoadError` | Audio file not found or unsupported format |

## Real-World Use Cases

- **Podcasts** -- automatically label host vs. guest segments for transcription
- **Interviews** -- separate interviewer and interviewee speech for analysis
- **Meeting recordings** -- identify who said what in team meetings, generate per-speaker summaries
- **Court recordings** -- tag judge, attorney, and witness speech segments
- **Call centers** -- distinguish agent from customer in call recordings for QA
- **Media monitoring** -- track specific speakers across broadcast recordings

## How It Works

voicetag runs a three-stage pipeline:

```
Audio File
|
v
1. DIARIZE (pyannote.audio)
"When does each speaker talk?"
-> segments: [(0.0-4.2, SPEAKER_00), (4.5-8.1, SPEAKER_01), ...]
|
v
2. EMBED (resemblyzer)
"What does each speaker sound like?"
-> 256-dim embedding vector per segment (computed in parallel)
|
v
3. MATCH (cosine similarity)
"Which enrolled speaker does this sound like?"
-> Alice (0.92), Bob (0.87), UNKNOWN (below threshold)
|
v
DiarizationResult with named speaker timeline
```

1. **Diarize** -- pyannote.audio segments the audio into speaker turns with anonymous labels (`SPEAKER_00`, `SPEAKER_01`, etc.)
2. **Embed** -- resemblyzer computes a 256-dimensional voice embedding for each segment, running in parallel via a thread pool
3. **Match** -- each embedding is compared against enrolled speaker profiles using cosine similarity. Matches above the threshold get assigned the speaker's name; others are labeled `"UNKNOWN"`

Overlap detection runs in parallel with matching, identifying regions where two or more speakers talk simultaneously.

## Comparison

| Feature | voicetag | pyannote alone | WhisperX | Manual labeling |
|---|:---:|:---:|:---:|:---:|
| Speaker diarization | Yes | Yes | Yes | N/A |
| Named speaker identification | Yes | No | No | Yes |
| Overlap detection | Yes | Yes | No | Varies |
| CLI tool | Yes | No | Yes | N/A |
| Save/load speaker profiles | Yes | N/A | N/A | N/A |
| Language agnostic | Yes | Yes | Yes | Yes |
| Typed result models | Yes (Pydantic) | No | No | N/A |
| Lines of code to identify | 3 | ~30 | ~20 | N/A |

## Configuration

`VoiceTagConfig` controls all tunable parameters:

| Field | Type | Default | Description |
|---|---|---|---|
| `hf_token` | `Optional[str]` | `None` | HuggingFace token. Falls back to `HF_TOKEN` env var. |
| `similarity_threshold` | `float` | `0.75` | Minimum cosine similarity for a match. Range: (0.0, 1.0). |
| `overlap_threshold` | `float` | `0.5` | Minimum overlap ratio to flag as overlapping speech. |
| `max_workers` | `int` | `4` | Thread count for parallel embedding computation. |
| `min_segment_duration` | `float` | `0.5` | Segments shorter than this (seconds) are discarded. |
| `device` | `str` | `"cpu"` | Torch device: `"cpu"`, `"cuda"`, or `"mps"`. |

**Token resolution order:**
1. `config.hf_token` (explicit)
2. `HF_TOKEN` environment variable
3. Raise `VoiceTagConfigError` with a link to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on setting up the development environment, running tests, and submitting pull requests.

## License

[MIT](LICENSE) -- Copyright (c) 2026 voicetag contributors