https://github.com/ciscodevnet/vo-id

audio-processing pytorch speaker-diarization speaker-identification speaker-recognition speaker-verification
Last synced: 6 months ago
JSON representation
Host: GitHub
URL: https://github.com/ciscodevnet/vo-id
Owner: CiscoDevNet
License: mit
Created: 2021-07-02T06:50:12.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-10-25T21:36:00.000Z (almost 4 years ago)
Last Synced: 2025-04-14T22:03:22.111Z (6 months ago)
Topics: audio-processing, pytorch, speaker-diarization, speaker-identification, speaker-recognition, speaker-verification
Language: Python
Homepage:
Size: 85.7 MB
Stars: 11
Watchers: 3
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Project vo-id

-----

## Project Description

* [Installation](#installation)

* [Voice Fingerprinting](#compute-the-voice-fingerprint-from-an-audio-file)

* [Speaker diarization](#perform-speaker-diarization)

* [Speaker recognition](#perform-speaker-recognition)

* [Speaker Verification](#perform-speaker-verification) 

* [Voice Cloning](#voice-cloning)

All these functionalities are possibple thanks to a neural model that converts audio into a **Voice Fingerprint** 


Below you can find examples on how to use the package as is. 



We are providing all the code and data for training the Neural Network, so if you have improvements to submit, please fork the repo, make pull requests or open issues. :handshake:

------

## Installation 

* Clone the repo:

    ```

    git clone git@github.com:CiscoDevNet/vo-id.git

    ```

* Create a Python virtual environment:

    ```

    mkdir ~/Envs/

    python3.8 -m venv ~/Envs/vo-id

    source ~/Envs/vo-id/bin/activate

    ```

* Install the package

    ```

    pip install -e .

    ```

------------

## Compute the voice fingerprint from an audio file

By default the model creates a voice fingerprint or voice vector every 100 milliseconds.

**Example in Python:** :snake:

```python

audio_path = "tests/audio_samples/short_podcast.wav"

from void.voicetools import ToolBox

tb = ToolBox(use_cpu=True) # Leave `use_cpu` blank to let the machine use the GPU if available  

audio_vectors = tb.vectorize(audio_path)

print(audio_vectors.shape)

# (322, 128)

```

------

## Perform Speaker Diarization

**Speaker diarization** answers the question: ``"Who spoke when?"``. 


If you run this tool on a meeting recording for example, each spoken segment will get an anonymous speaker ID assigned. 


The format in use is [RTTM](https://github.com/nryant/dscore#rttm). 


Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:

- ``Type``  --  segment type; should always by ``SPEAKER``

- ``File ID``  --  file name; basename of the recording minus extension (e.g.,

  ``rec1_a``)

- ``Channel ID``  --  channel (1-indexed) that turn is on; should always be

  ``1``

- ``Turn Onset``  --  onset of turn in seconds from beginning of recording

- ``Turn Duration``  -- duration of turn in seconds

- ``Orthography Field`` --  should always by ````

- ``Speaker Type``  --  should always be ````

- ``Speaker Name``  --  name of speaker of turn; should be unique within scope

  of each file

- ``Confidence Score``  --  system confidence (probability) that information

  is correct; should always be ````

- ``Signal Lookahead Time``  --  should always be ````

**Example in Python:** :snake:

```python

audio_path = "tests/audio_samples/short_podcast.wav"

from void.voicetools import ToolBox

from pprint import pprint

tb = ToolBox(use_cpu=True) # Leave `use_cpu` blank to let the machine use the GPU if available  

rttm = tb.diarize(audio_path)

pprint(rttm)

""" 

['SPEAKER filename 1 0.39 3.96   speaker0  \n',

 'SPEAKER filename 1 4.71 2.85   speaker0  \n',

 'SPEAKER filename 1 8.19 5.97   speaker0  \n',

 'SPEAKER filename 1 14.28 1.32   speaker0  \n',

 'SPEAKER filename 1 15.63 0.93   speaker0  \n',

 'SPEAKER filename 1 16.71 0.54   speaker1  \n',

 'SPEAKER filename 1 17.31 2.58   speaker1  \n',

 'SPEAKER filename 1 19.95 2.61   speaker1  \n',

 'SPEAKER filename 1 22.65 1.14   speaker1  \n',

 'SPEAKER filename 1 23.88 1.89   speaker1  \n',

 'SPEAKER filename 1 25.83 0.60   speaker1  \n',

 'SPEAKER filename 1 26.52 1.44   speaker1  \n',

 'SPEAKER filename 1 27.99 0.15   speaker1  \n',

 'SPEAKER filename 1 28.47 3.48   speaker1  \n',

 'SPEAKER filename 1 32.04 0.09   speaker1  \n']

"""

```

-----

## Perform Speaker Recognition

**Speaker Recognition** works very similarly to Speaker Diarization, with the difference that each voice segment gets assigned the name of the person the system thinks it's speaking. 


In order to do so we need to provide ``Enrollment file``, meaning audio files with examples of the voice of the speakers present in the audio we are diarizing.

**Example in Python:** :snake:

```python

audio_path = "tests/audio_samples/short_podcast.wav"

# Provide enrollment samples

enroll_f1_path = "tests/audio_samples/enroll_fridman_1.wav"

enroll_f2_path = "tests/audio_samples/enroll_fridman_2.wav"

enroll_c1_path = "tests/audio_samples/enroll_chomsky_1.wav"

enroll_c2_path = "tests/audio_samples/enroll_chomsky_2.wav"

enroll_d1_path = "tests/audio_samples/enroll_dario_1.wav"

enroll_d2_path = "tests/audio_samples/enroll_dario_2.wav"

from void.voicetools import ToolBox

from pprint import pprint

tb = ToolBox(use_cpu=True) # Leave `use_cpu` blank to let the machine use the GPU if available  

rttm = tb.recognize(audio_path, 

                            enrollments=[

                                (enroll_c1_path, "Chomsky"), 

                                (enroll_f1_path, "Fridman"), 

                                (enroll_d1_path, "Dario"), 

                                (enroll_c2_path, "Chomsky"), 

                                (enroll_f2_path, "Fridman"), 

                                (enroll_d2_path, "Dario"),

                            ],

                            max_num_speakers=10

                        )

pprint(rttm)

"""

['SPEAKER filename 1 0.39 3.96   Chomsky  \n',

 'SPEAKER filename 1 4.71 2.85   Chomsky  \n',

 'SPEAKER filename 1 8.19 5.97   Chomsky  \n',

 'SPEAKER filename 1 14.28 1.32   Chomsky  \n',

 'SPEAKER filename 1 15.63 0.93   Chomsky  \n',

 'SPEAKER filename 1 16.71 0.54   Fridman  \n',

 'SPEAKER filename 1 17.31 2.58   Fridman  \n',

 'SPEAKER filename 1 19.95 2.61   Fridman  \n',

 'SPEAKER filename 1 22.65 1.14   Fridman  \n',

 'SPEAKER filename 1 23.88 1.89   Fridman  \n',

 'SPEAKER filename 1 25.83 0.60   Fridman  \n',

 'SPEAKER filename 1 26.52 1.44   Fridman  \n',

 'SPEAKER filename 1 27.99 0.15   Fridman  \n',

 'SPEAKER filename 1 28.47 3.48   Fridman  \n',

 'SPEAKER filename 1 32.04 0.09   Fridman  \n']

"""

```

#### NB: We provided `3 enrollment speakers` but the meeting only had 2. 
  The system correctly outputs only `2` in total.

---

## Perform Speaker Verification

We can use our voice similarly to how we use our fingerprints or faces on modern smartphones: to let only the right users have access to a system. 


By providing voice examples of someone's voice, we can then compare new audio samples with the ones we have previously stored.

**Example in Python:** :snake:

```python

enroll_f1_path = "tests/audio_samples/enroll_fridman_1.wav"

enroll_f2_path = "tests/audio_samples/enroll_fridman_2.wav"

new_audio = "tests/audio_samples/verify_fridman.wav"

from void.voicetools import ToolBox

tb = ToolBox(use_cpu=True) # Leave `use_cpu` blank to let the machine use the GPU if available  

similarity = tb.verify(new_audio, 

                                enrollments=[

                                    (enroll_f1_path, "Fridman"),

                                    (enroll_f2_path, "Fridman"),

                                ])

print(f"Same person probability: {similarity*100:.2f}%")

#Same person probability: 82.24%

```

--------

## Voice Cloning

Work in Progress... :construction_worker:

--------

## Train the Vectorizer

If it's the first time you run it, this might take a while to download all the training data.


Just hang on :hourglass_flowing_sand:

```bash

mkdir vectorizer/data

python vectorizer/train.py

```

### Notes

1. The tranining runs a classifier on the speakers, yet the speaker_ids are not consistent with how PyTorch handles the classification task:

    ```python

    max(speaker_id) > len(num_speakers) + 1

    ```

2. For this reason the file `vectorizer/speaker_ids_map.bin` stores a mapping that allows to provide labels from `0` to `num_speakers-1`

-----

## Citation

```

@software{Uno,

  author = {Dario Cazzani},

  title = {vo-id: VOice IDentification tools},

  url = {https://github.com/CiscoDevNet/vo-id},

  version = {0.1},

  year = {2021},

} 

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ciscodevnet/vo-id

Awesome Lists containing this project

README