https://github.com/kushalchoksi/diarization

Transcribing audio/video with faster-whisper
https://github.com/kushalchoksi/diarization

faster-whisper transcription whisper

Last synced: 5 months ago
JSON representation

Transcribing audio/video with faster-whisper

Host: GitHub
URL: https://github.com/kushalchoksi/diarization
Owner: kushalchoksi
Created: 2025-05-02T04:24:14.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-05-07T04:01:55.000Z (5 months ago)
Last Synced: 2025-05-07T23:13:25.337Z (5 months ago)
Topics: faster-whisper, transcription, whisper
Language: Python
Homepage:
Size: 995 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Diarization using Faster Whisper

### Input

https://github.com/user-attachments/assets/cddb4a38-6a41-4743-94a6-6c7a3fcfb329

### Outputs
```json
...
{
"start": 10.24,
"end": 11.4,
"text": "Do you think we can get back to that?",
"id": 2,
"speaker": "SPEAKER_2"
},
{
"start": 12.4,
"end": 14.8,
"text": "Not for now. We can, we have to try.",
"id": 3,
"speaker": "SPEAKER_1"
},
{
"start": 14.96,
"end": 21.66,
"text": "I mean, I will certainly try because I'm a big believer, that football is changing and you need to adapt to every situation",
"id": 4,
"speaker": "SPEAKER_1"
},
...
```

| Start | End | Speaker | Transcription |
|-------|-----|---------|------|
0:00:00|0:00:10|SPEAKER_1|"You could feel, you could smell, you could sense the next step, you could smell the play. You could smell if the other team was, you know, slowing down."
0:00:10|0:00:11|SPEAKER_2|Do you think we can get back to that?
0:00:12|0:00:28|SPEAKER_1|"Not for now. We can, we have to try. I mean, I will certainly try because I'm a big believer, that football is changing and you need to adapt to every situation and you need to adapt to what's happening now. I mean, you cannot get away from this and you cannot underestimate the..."

## What it does

1) Converts video / audio files to WAV format
2) Transcribes speech using Faster Whisper (GPU-accelerated)
3) Identifies different speakers in the audio
4) Outputs formatted transcripts with speaker labels

## Requirements
- Python 3.7+
- PyTorch with CUDA Support (I personally used cuda 12.6 on Windows)
- NVIDIA GPU (Ran Whisper distill-large-v3 with ease on GeForce GTX 1660)
- ffmpeg (The 🐐)

## Setup
- Install the dependencies using `requirements.txt`
- Create a `token.txt` file containing your HuggingFace token to pull any faster-whisper models that aren't cached on your system (will probably ask you to accept pyannote model license)
- (Optional) Create a .prompt file with custom transcription

## Usage
```shell
python diarize.py input_folder
OR
python diarize.py input_folder --num_speakers=2 --whisper_model=medium.en --vad_filter=True
```

## Model Size and Chunking
The tool adjusts processing settings based on model size:

### Large models (large-v1, large-v2, large-v3):

Use very small chunks (45 seconds)
Higher precision (float32)
More memory-intensive but more accurate

### Medium models (medium, medium.en):

Use medium chunks (75 seconds)
Higher precision (float32)
Good balance of accuracy and resource usage

### Small models (small, small.en, base):

Use larger chunks (100 seconds)
Lower precision (float16)
Less resource-intensive but potentially less accurate

For very long audio (10+ minutes), chunk sizes are automatically reduced further to prevent memory issues.

## Troubleshooting

If you encounter the "exclamation mark issue" (text filled with !!!), try:
- Normalizing the audio using the normalization script (it'll amplify the volume level)
- Usually happens when trying to use a larger model than what your GPU VRAM can handle, so lower the chunk size
- Try using a smaller model
- Get a better GPU

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kushalchoksi/diarization

Awesome Lists containing this project

README