https://github.com/Gr122lyBr/voicetag

Speaker identification powered by pyannote and resemblyzer
https://github.com/Gr122lyBr/voicetag
audio-transcription deep-learning deepgram diarization groq machine-learning nlp pyannote python resemblyzer speaker-diarization speaker-identification speaker-recognition speech-processing speech-to-text transcription voice-recognition whisper whisper-ai
Last synced: about 2 months ago
JSON representation
Speaker identification powered by pyannote and resemblyzer
Host: GitHub
URL: https://github.com/Gr122lyBr/voicetag
Owner: Gr122lyBr
License: mit
Created: 2026-03-16T21:08:33.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-16T22:22:19.000Z (3 months ago)
Last Synced: 2026-03-17T07:59:12.174Z (3 months ago)
Topics: audio-transcription, deep-learning, deepgram, diarization, groq, machine-learning, nlp, pyannote, python, resemblyzer, speaker-diarization, speaker-identification, speaker-recognition, speech-processing, speech-to-text, transcription, voice-recognition, whisper, whisper-ai
Language: Python
Homepage:
Size: 283 KB
Stars: 15
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          


voicetag


Know who said what. Automatically.




  

  

  

  

  



---

## What is voicetag?

voicetag is a Python library for **speaker diarization and named speaker identification**. It combines [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization with [resemblyzer](https://github.com/resemble-ai/Resemblyzer) for speaker embeddings, giving you a single interface to answer: *who is speaking, and when?*

Enroll speakers once with a few audio samples, then identify them in any recording -- meetings, podcasts, interviews, phone calls.

## Features

- :zap: **Dead-simple API** -- enroll speakers and identify them in three lines of code

- :globe_with_meridians: **Language agnostic** -- works with Hebrew, English, Mandarin, or any spoken language

- :busts_in_silhouette: **Built-in overlap detection** -- flags regions where multiple speakers talk simultaneously

- :rocket: **Fast parallel processing** -- concurrent embedding computation with configurable thread pools

- :keyboard: **CLI tool included** -- enroll, identify, and manage profiles from the terminal

- :floppy_disk: **Save/load speaker profiles** -- persist enrolled speakers to disk and reuse across sessions

- :white_check_mark: **Pydantic result models** -- fully typed, validated, immutable result objects

- :speech_balloon: **Built-in transcription** -- plug in OpenAI, Groq, Fireworks, Whisper, or Deepgram to get "who said what"

## Quick Start

```python

from voicetag import VoiceTag

vt = VoiceTag()

vt.enroll("Christie", ["christie1.flac", "christie2.flac", "christie3.flac"])

vt.enroll("Mark", ["mark1.flac", "mark2.flac"])

# Identify who spoke when

result = vt.identify("audiobook.flac")

for seg in result.segments:

    print(f"{seg.speaker}: {seg.start:.1f}s - {seg.end:.1f}s (confidence: {seg.confidence:.2f})")

# Transcribe: who said what

transcript = vt.transcribe("audiobook.flac", provider="whisper")

print(transcript.full_transcript)

```

Output:

```

Christie: 0.0s - 2.6s (confidence: 0.85)

Christie: 2.6s - 6.7s (confidence: 0.88)

Christie: 7.0s - 8.1s (confidence: 0.78)

[Christie] Gentlemen, he sat in a hoarse voice. Give me your

[Christie] word of honor that this horrible secret shall forever remain buried amongst ourselves.

[Christie] The two men drew back.

```

## Installation

```bash

pip install voicetag

```

For transcription support, install with a provider:

```bash

pip install voicetag[openai]    # OpenAI Whisper API

pip install voicetag[groq]      # Groq (fast Whisper)

pip install voicetag[whisper]   # Local Whisper (no API key needed)

pip install voicetag[deepgram]  # Deepgram

pip install voicetag[all-stt]   # All providers

```

voicetag requires access to the [pyannote.audio](https://github.com/pyannote/pyannote-audio) speaker diarization model, which is gated behind a HuggingFace license agreement.

### Prerequisites

1. **Accept the pyannote model licenses** at:

   - [hf.co/pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

   - [hf.co/pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)

   - [hf.co/pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)

2. **Create a HuggingFace token** at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

3. **Set the token** via environment variable or config:

```bash

export HF_TOKEN="hf_your_token_here"

```

Or pass it directly:

```python

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(hf_token="hf_your_token_here"))

```

### GPU Acceleration (optional)

For faster processing on CUDA or Apple Silicon:

```python

vt = VoiceTag(config=VoiceTagConfig(device="cuda"))  # NVIDIA GPU

vt = VoiceTag(config=VoiceTagConfig(device="mps"))    # Apple Silicon

```

## CLI Usage

voicetag ships with a full-featured command-line interface.

### Enroll a speaker

```bash

voicetag enroll "Christie" christie1.flac christie2.flac christie3.flac

voicetag enroll "Mark" mark1.flac mark2.flac

```

### Identify speakers

```bash

voicetag identify audiobook.flac

```

```

Speaker Timeline — audiobook.flac

+-----------+----------+----------+----------+------------+

| Speaker   | Start    | End      | Duration | Confidence |

+-----------+----------+----------+----------+------------+

| Christie  | 00:00.00 | 00:02.60 | 00:02.60 | 0.85       |

| Christie  | 00:02.60 | 00:06.70 | 00:04.10 | 0.88       |

| Christie  | 00:07.00 | 00:08.10 | 00:01.10 | 0.78       |

+-----------+----------+----------+----------+------------+

Summary

  Total duration:  8.4s

  Speakers:        1

  Segments:        3

```

### Transcribe (speaker + text)

```bash

voicetag transcribe audiobook.flac --provider whisper --language en

```

```

Transcript — audiobook.flac

+-----------+----------+----------+--------------------------------------------------------------+

| Speaker   | Start    | End      | Text                                                         |

+-----------+----------+----------+--------------------------------------------------------------+

| Christie  | 00:00.00 | 00:02.60 | Gentlemen, he sat in a hoarse voice. Give me your            |

| Christie  | 00:02.60 | 00:06.70 | word of honor that this horrible secret shall forever remain  |

|           |          |          | buried amongst ourselves.                                    |

| Christie  | 00:07.00 | 00:08.10 | The two men drew back.                                       |

+-----------+----------+----------+--------------------------------------------------------------+

```

Other providers:

```bash

voicetag transcribe call.wav --provider openai --language en

voicetag transcribe interview.wav --provider groq --language he

voicetag transcribe meeting.wav --provider deepgram

```

### Manage profiles

```bash

voicetag profiles list

voicetag profiles remove "Christie"

voicetag providers              # list available STT providers

```

### All CLI options

```bash

voicetag --help

voicetag identify --help

```

| Option | Description |

|---|---|

| `--profiles PATH` | Path to speaker profiles file (default: `voicetag_profiles.json`) |

| `--output, -o PATH` | Save results as JSON |

| `--threshold FLOAT` | Similarity threshold override (0.0-1.0) |

| `--hf-token TEXT` | HuggingFace API token |

| `--device TEXT` | Torch device: `cpu`, `cuda`, `mps` |

| `--unknown-only` | Skip speaker matching, just diarize |

## API Reference

### `VoiceTag`

The main entry point. Wraps the full diarization + identification pipeline.

```python

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(...))

```

| Method | Returns | Description |

|---|---|---|

| `enroll(name, audio_paths)` | `SpeakerProfile` | Register a speaker from one or more audio files |

| `identify(audio_path)` | `DiarizationResult` | Run full identification pipeline on an audio file |

| `save(path)` | `None` | Save enrolled speaker profiles to disk |

| `load(path)` | `None` | Load speaker profiles from disk |

| `remove_speaker(name)` | `None` | Remove an enrolled speaker by name |

| `enrolled_speakers` | `list[str]` | Property: list of enrolled speaker names |

| `transcribe(audio_path, provider, ...)` | `TranscriptResult` | Identify speakers and transcribe what they said |

#### Transcription example

```python

result = vt.transcribe("meeting.wav", provider="openai", language="en")

for seg in result.segments:

    print(f"[{seg.speaker}] {seg.text}")

# Full transcript

print(result.full_transcript)

# Group by speaker

for speaker, segments in result.by_speaker.items():

    print(f"\n{speaker}:")

    for seg in segments:

        print(f"  {seg.text}")

```

Supported providers: `openai`, `groq`, `fireworks`, `whisper` (local), `deepgram`

### `VoiceTagConfig`

Configuration model (Pydantic v2, frozen/immutable).

```python

config = VoiceTagConfig(

    hf_token="hf_...",          # HuggingFace token (or set HF_TOKEN env var)

    similarity_threshold=0.75,  # min cosine similarity for a match

    overlap_threshold=0.5,      # min overlap ratio to flag

    max_workers=4,              # parallel embedding threads

    min_segment_duration=0.5,   # discard segments shorter than this (seconds)

    device="cpu",               # "cpu", "cuda", or "mps"

)

```

### Result Models

**`DiarizationResult`** -- returned by `identify()`:

| Field | Type | Description |

|---|---|---|

| `segments` | `list[SpeakerSegment \| OverlapSegment]` | Ordered timeline of speaker segments |

| `audio_duration` | `float` | Total audio length in seconds |

| `num_speakers` | `int` | Number of distinct speakers detected |

| `processing_time` | `float` | Wall-clock pipeline time in seconds |

**`SpeakerSegment`**:

| Field | Type | Description |

|---|---|---|

| `speaker` | `str` | Identified speaker name or `"UNKNOWN"` |

| `start` | `float` | Start time in seconds |

| `end` | `float` | End time in seconds |

| `confidence` | `float` | Cosine similarity score (0.0-1.0) |

| `duration` | `float` | Property: `end - start` |

**`OverlapSegment`**:

| Field | Type | Description |

|---|---|---|

| `speakers` | `list[str]` | Names of overlapping speakers |

| `start` | `float` | Start time in seconds |

| `end` | `float` | End time in seconds |

| `speaker` | `Literal["OVERLAP"]` | Always `"OVERLAP"` |

| `duration` | `float` | Property: `end - start` |

**`SpeakerProfile`**:

| Field | Type | Description |

|---|---|---|

| `name` | `str` | Speaker name |

| `embedding` | `list[float]` | 256-dimensional mean embedding vector |

| `num_samples` | `int` | Number of audio files used for enrollment |

| `created_at` | `datetime` | UTC timestamp of enrollment |

### Error Handling

All exceptions inherit from `VoiceTagError`:

```python

from voicetag import VoiceTagError

try:

    result = vt.identify("audio.wav")

except VoiceTagError as e:

    print(f"Error: {e}")

```

| Exception | When |

|---|---|

| `VoiceTagConfigError` | Invalid config or missing HuggingFace token |

| `EnrollmentError` | Enrollment fails (no audio, bad format) |

| `DiarizationError` | Pyannote processing failure |

| `AudioLoadError` | Audio file not found or unsupported format |

## Real-World Use Cases

- **Podcasts** -- automatically label host vs. guest segments for transcription

- **Interviews** -- separate interviewer and interviewee speech for analysis

- **Meeting recordings** -- identify who said what in team meetings, generate per-speaker summaries

- **Court recordings** -- tag judge, attorney, and witness speech segments

- **Call centers** -- distinguish agent from customer in call recordings for QA

- **Media monitoring** -- track specific speakers across broadcast recordings

## How It Works

voicetag runs a three-stage pipeline:

```

Audio File

    |

    v

1. DIARIZE (pyannote.audio)

   "When does each speaker talk?"

   -> segments: [(0.0-4.2, SPEAKER_00), (4.5-8.1, SPEAKER_01), ...]

    |

    v

2. EMBED (resemblyzer)

   "What does each speaker sound like?"

   -> 256-dim embedding vector per segment (computed in parallel)

    |

    v

3. MATCH (cosine similarity)

   "Which enrolled speaker does this sound like?"

   -> Alice (0.92), Bob (0.87), UNKNOWN (below threshold)

    |

    v

DiarizationResult with named speaker timeline

```

1. **Diarize** -- pyannote.audio segments the audio into speaker turns with anonymous labels (`SPEAKER_00`, `SPEAKER_01`, etc.)

2. **Embed** -- resemblyzer computes a 256-dimensional voice embedding for each segment, running in parallel via a thread pool

3. **Match** -- each embedding is compared against enrolled speaker profiles using cosine similarity. Matches above the threshold get assigned the speaker's name; others are labeled `"UNKNOWN"`

Overlap detection runs in parallel with matching, identifying regions where two or more speakers talk simultaneously.

## Comparison

| Feature | voicetag | pyannote alone | WhisperX | Manual labeling |

|---|:---:|:---:|:---:|:---:|

| Speaker diarization | Yes | Yes | Yes | N/A |

| Named speaker identification | Yes | No | No | Yes |

| Overlap detection | Yes | Yes | No | Varies |

| CLI tool | Yes | No | Yes | N/A |

| Save/load speaker profiles | Yes | N/A | N/A | N/A |

| Language agnostic | Yes | Yes | Yes | Yes |

| Typed result models | Yes (Pydantic) | No | No | N/A |

| Lines of code to identify | 3 | ~30 | ~20 | N/A |

## Configuration

`VoiceTagConfig` controls all tunable parameters:

| Field | Type | Default | Description |

|---|---|---|---|

| `hf_token` | `Optional[str]` | `None` | HuggingFace token. Falls back to `HF_TOKEN` env var. |

| `similarity_threshold` | `float` | `0.75` | Minimum cosine similarity for a match. Range: (0.0, 1.0). |

| `overlap_threshold` | `float` | `0.5` | Minimum overlap ratio to flag as overlapping speech. |

| `max_workers` | `int` | `4` | Thread count for parallel embedding computation. |

| `min_segment_duration` | `float` | `0.5` | Segments shorter than this (seconds) are discarded. |

| `device` | `str` | `"cpu"` | Torch device: `"cpu"`, `"cuda"`, or `"mps"`. |

**Token resolution order:**

1. `config.hf_token` (explicit)

2. `HF_TOKEN` environment variable

3. Raise `VoiceTagConfigError` with a link to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on setting up the development environment, running tests, and submitting pull requests.

## License

[MIT](LICENSE) -- Copyright (c) 2026 voicetag contributors
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Gr122lyBr/voicetag

Awesome Lists containing this project

README

voicetag