Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MahmoudAshraf97/whisper-diarization
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
https://github.com/MahmoudAshraf97/whisper-diarization
asr speaker-diarization speech speech-recognition speech-to-text whisper
Last synced: 3 months ago
JSON representation
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
- Host: GitHub
- URL: https://github.com/MahmoudAshraf97/whisper-diarization
- Owner: MahmoudAshraf97
- License: bsd-2-clause
- Created: 2023-01-25T23:18:09.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-27T17:38:46.000Z (3 months ago)
- Last Synced: 2024-10-29T14:55:28.338Z (3 months ago)
- Topics: asr, speaker-diarization, speech, speech-recognition, speech-to-text, whisper
- Language: Jupyter Notebook
- Homepage:
- Size: 455 KB
- Stars: 3,617
- Watchers: 46
- Forks: 318
- Open Issues: 38
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-whisper - whisper-diarization - Automatic speech recognition with speaker diarization. (CLI tools / Self-hosted)
- project-awesome - MahmoudAshraf97/whisper-diarization - Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper (Jupyter Notebook)
- AiTreasureBox - MahmoudAshraf97/whisper-diarization - 01-19_4007_3](https://img.shields.io/github/stars/MahmoudAshraf97/whisper-diarization.svg)|Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper| (Repos)
- StarryDivineSky - MahmoudAshraf97/whisper-diarization
README
Speaker Diarization Using OpenAI Whisper
#
Speaker Diarization pipeline based on OpenAI Whisper**Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!**
## What is it
This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using `ctc-forced-aligner` to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by `ctc-forced-aligner` to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.Whisper and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later
## Installation
Python >= `3.10` is needed, `3.9` will work but you'll need to manually install the requirements one by one.`FFMPEG` and `Cython` are needed as prerequisites to install the requirements
```
pip install cython
```
or
```
sudo apt update && sudo apt install cython3
```
```
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg# on Arch Linux
sudo pacman -S ffmpeg# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg# on Windows using WinGet (https://github.com/microsoft/winget-cli)
winget install ffmpeg
```
```
pip install -c constraints.txt -r requirements.txt
```
## Usage```
python diarize.py -a AUDIO_FILE_NAME
```If your system has enough VRAM (>=10GB), you can use `diarize_parallel.py` instead, the difference is that it runs NeMo in parallel with Whisper, this can be beneficial in some cases and the result is the same since the two models are nondependent on each other. This is still experimental, so expect errors and sharp edges. Your feedback is welcome.
## Command Line Options
- `-a AUDIO_FILE_NAME`: The name of the audio file to be processed
- `--no-stem`: Disables source separation
- `--whisper-model`: The model to be used for ASR, default is `medium.en`
- `--suppress_numerals`: Transcribes numbers in their pronounced letters instead of digits, improves alignment accuracy
- `--device`: Choose which device to use, defaults to "cuda" if available
- `--language`: Manually select language, useful if language detection failed
- `--batch-size`: Batch size for batched inference, reduce if you run out of memory, set to 0 for non-batched inference## Known Limitations
- Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
- There might be some errors, please raise an issue if you encounter any.## Future Improvements
- Implement a maximum length per sentence for SRT## Acknowledgements
Special Thanks for [@adamjonas](https://github.com/adamjonas) for supporting this project
This work is based on [OpenAI's Whisper](https://github.com/openai/whisper) , [Faster Whisper](https://github.com/guillaumekln/faster-whisper) , [Nvidia NeMo](https://github.com/NVIDIA/NeMo) , and [Facebook's Demucs](https://github.com/facebookresearch/demucs)## Citation
If you use this in your research, please cite the project:```bibtex
@unpublished{hassouna2024whisperdiarization,
title={Whisper Diarization: Speaker Diarization Using OpenAI Whisper},
author={Ashraf, Mahmoud},
year={2024}
}
```