An open API service indexing awesome lists of open source software.

https://github.com/narcotic-sh/senko

Very fast, accurate speaker diarization
https://github.com/narcotic-sh/senko

audio-ai diarization fbank pyannote rapids silero-vad speaker-diarization zanshin

Last synced: 7 days ago
JSON representation

Very fast, accurate speaker diarization

Awesome Lists containing this project

README

          

# Senko
> 閃光 (senkō) - a flash of light

A very fast and accurate speaker diarization pipeline.

1 hour of audio processed in 5 seconds (RTX 4090 + Ryzen 9 7950X).

On Apple M3, 1 hour in 7.7 seconds.

The pipeline achieves a best score of 13.5% DER on VoxConverse, 13.3% on AISHELL-4, and 26.5% on AMI-IHM. See the [evaluation](/evaluation) directory for more benchmarks and comparison with other diarization systems.

Senko powers the [Zanshin](https://github.com/narcotic-sh/zanshin) media player.

## Usage
```python
import senko

diarizer = senko.Diarizer(device='auto', warmup=True, quiet=False)

wav_path = 'audio.wav' # 16kHz mono 16-bit wav
result = diarizer.diarize(wav_path, generate_colors=False)

senko.save_json(result["merged_segments"], 'audio_diarized.json')
senko.save_rttm(result["merged_segments"], wav_path, 'audio_diarized.rttm')
```
See `examples/diarize.py` for an interactive script, and also read `DOCS.md`

## Installation
Senko has been tested to work on Linux, macOS, and WSL, with Python version `3.11.13`.

Pre-requisites:
- `clang` - on Linux/WSL, a seperate install; on macOS, have the Xcode Developer Tools installed
- [`uv`](https://docs.astral.sh/uv/#installation)

Create a Python virtual environment and activate it
```
uv venv --python 3.11.13 .venv
source .venv/bin/activate
```
Then install Senko
```bash
# For NVIDIA GPUs with CUDA compute capability >= 7.5 (~GTX 16 series and newer)
uv pip install "git+https://github.com/narcotic-sh/senko.git[nvidia]"

# For NVIDIA GPUs with CUDA compute capability < 7.5 (~GTX 10 series and older)
uv pip install "git+https://github.com/narcotic-sh/senko.git[nvidia-old]"

# For Mac (macOS 14+) and CPU execution on all other platforms
uv pip install "git+https://github.com/narcotic-sh/senko.git"
```
For NVIDIA, make sure the installed driver is CUDA 12 capable (should see `CUDA Version: 12.x` in `nvidia-smi`).

For setting up Senko for development, see `DEV_SETUP.md`.

## Accuracy
See the [evaluation](/evaluation) directory.

## Technical Details
Senko is a modified version of the speaker diarization pipeline found in the excellent [3D-Speaker](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization) project.
It consists of four stages: VAD (voice activity detection), Fbank feature extraction, speaker embeddings generation, and clustering (spectral or UMAP+HDBSCAN).

The following modifications have been made:
- VAD model has been swapped from FSMN-VAD to either Pyannote [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) or [Silero VAD](https://github.com/snakers4/silero-vad)
- Fbank feature extraction is done fully upfront, in C++, using all available CPU cores
- Batched inference of CAM++ embedding model
- Clustering when on NVIDIA (with a GPU of CUDA compute capability 7.0+) can be done on the GPU through [RAPIDS](https://docs.rapids.ai/api/cuml/stable/zero-code-change/)

On Linux/WSL, both Pyannote segmentation-3.0 and CAM++ run using PyTorch, but on Mac, both models run through CoreML. The CAM++ CoreML conversion was done from scratch in this project (see [`tracing/coreml`](tracing/coreml)), but the segmentation-3.0 converted model and interfacing code is taken from the excellent [FluidAudio](https://github.com/FluidInference/FluidAudio) project by Fluid Inference.

## FAQ

Is there any way to visualize the output diarization data?


Absolutely. The Zanshin media player is entirely made for this purpose. Zanshin is powered by Senko, so the easiest way to visualize the diarization data is by simply using it. It's currently available for Mac (Apple Silicon) with packaging. It also works on Windows and Linux, but without packaging (coming soon); you'll need to clone the repo and launch it through the terminal. See here for instructions.




You can also load in the diarization data that Senko generates manually into Zanshin if you want. First, turn off diarization in Zanshin by going into Settings and turning off Identify Speakers. Then, after you add a media item, click on it and on the player page press the H key. In the textbox that appears, paste the contents of the output JSON file that examples/diarize.py generates.

What languages does Senko support?


Generally the pipeline should work for any language, as it relies on acoustic patterns as opposed to words or speech patterns. That being said, the embeddings model used in this pipeline was trained on a mix of English and Mandarin Chinese. So the pipeline will likely work best on English and Mandarin Chinese.

Are overlapping speaker segments detected correctly?


The current output will not have any overlapping speaker segments; i.e. only one speaker max is reported to be speaking at any given time. However, despite this, the current pipeline still performs great in determining who the dominant speaker is at any given time in chaotic audio with speakers talking over each other (example: casual podcasts). That said, detecting overlapping speaker segments is a planned feature thanks to the Pyannote segmentation-3.0 model (which we currently only use for VAD) supporting it.

How fast is the pipeline on CPU (cpu)?


On a Ryzen 9 9950X it takes 42 seconds to process 1 hour of audio.

Does the entire pipeline run fully on the GPU, if available?


With cuda, all parts of the pipeline except Fbank feature extraction (which always runs on the CPU) do, by default, run on the GPU (though you can override this behaviour using the vad and clustering arguments of the Diarizer object). However, CPU performance still significantly impacts overall speed even for the GPU-accelerated stages.



During the embeddings generation phase, for example, while the actual model inference happens on the GPU with minimal CPU-GPU memory transfers (just input/output), the CPU handles all the orchestration work: Python loops for batching, tensor preparation, padding operations dispatch, and managing the inference pipeline. All this orchestration runs single-threaded on the CPU. This means a faster CPU will improve performance even when using a powerful GPU, as the CPU coordinates all the GPU operations.



Therefore, for optimal performance, pair a fast GPU with a fast CPU. The CPU bottleneck becomes more noticeable with very fast GPUs (ex. RTX 4090) where the GPU can execute the batch preparation and inference faster than the CPU can orchestrate/dispatch these operations.



As for Mac, both the VAD and embeddings gen phases run on the ANE (Apple Neural Engine) & CPU through CoreML. The fbank stage and clustering run purely on the CPU.

Known limitations?


- The pipeline works best when the audio recording quality is good. Ideal setting: professional podcast studio. Heavy background noise, background music, or a generally low fidelity recording will degrade the diarization performance significantly. Note that it's also possible to have generally good recording quality but still low fidelity recorded voice quality; an example is this.



- It is rare but possible that voices that sound very similar get clustered as one voice. This can happen if the voices are genuinely extremely similar, or, more commonly, if the audio recording fidelity is low.



- The same voice recorded with >1 microphones or in >1 recording settings within the same audio file will often get detected as >1 speakers.



- If a single person makes >1 voices in the same recording (as in change the auditory texture/tone of their voice; like if they do an impression of someone else, for example), their speech will almost certainly get detected as >1 speakers.

## Community & Support
Join the [Discord](https://discord.gg/Nf7m5Ftk3c) server to ask questions, suggest features, talk about Senko and Zanshin development etc.

## Future Improvements & Directions
- Overlapping speaker segments support
- Improve speaker colors generation algorithm
- Support for Intel and AMD GPUs
- Experiment with `torch.compile()`
- Experiment with Modular [MAX](https://www.modular.com/blog/bring-your-own-pytorch-model) engine (faster CPU inference speed?)
- Experiment with [Resonate](https://alexandrefrancois.org/Resonate/) for superior audio feature extraction
- Background noise removal ([DeepFilterNet](https://github.com/Rikorose/DeepFilterNet))
- Live progress reporting