https://github.com/adamelkholyy/whisper-yt

Toolkit for using Whisper to transcribe YouTube videos. Includes Whisper transcription of YouTube videos, conversion of YouTube video into HuggingFace dataset (using audio and subtitles) and evaluation of Whisper transcription against YouTube subtitles
https://github.com/adamelkholyy/whisper-yt

asr diarization huggingface-datasets pyannote transcription whisper word-error-rate youtube

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/adamelkholyy/whisper-yt
Owner: adamelkholyy
License: apache-2.0
Created: 2024-09-20T14:45:09.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-10-21T11:43:42.000Z (7 months ago)
Last Synced: 2025-03-26T09:21:18.552Z (about 2 months ago)
Topics: asr, diarization, huggingface-datasets, pyannote, transcription, whisper, word-error-rate, youtube
Language: Python
Homepage:
Size: 79.1 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # whisper-yt

## Overview

Toolkit for using Whisper to transcribe YouTube videos. Allows for the following functionality given only a YouTube video URL:

- Transcribe YouTube videos using Whisper, including optional diarization using pyannote

- Download and split YouTube audio into Whisper compatible segments (audio_1.mp3, audio_2.mp3, ..., audio_n.mp3) 

- Download and preprocess YouTube subtitles into a HuggingFace compatible timestamped transcript.json file

- Create and save a HuggingFace dataset from a YouTube video for use in Whisper fine tuning

- Evaluate the Word Error Rate of Whisper transcriptions against YouTube's subtitles

## Prerequisites: ffmpeg

Ensure you have the [ffmpeg](https://www.ffmpeg.org/download.html) installed for audio processing. Using chocolatey:  

```

choco install ffmpeg

```   

Or see more instructions for [ffmpeg installation](https://avpres.net/FFmpeg/install_Windows)

## Usage

NB: The following examples were run on a laptop using cuda with an NVIDIA GeForce GTX 1650 graphics card with the base Whisper model ('openai-whisper/base'). Better transcription results and faster runtimes would be seen using a larger Whisper model with a better graphics card.  

### 1.a. Downloading and Transcribing YouTube Videos (without Diarization)

You can download and transcribe a YouTube video with Whisper (and optionally save the transcript) without diarization as follows

```python

from whisper_yt.yt_downloader import download_mp3

from whisper_yt.whisper_utilities import transcribe_mp3

from whisper_yt.utilities import save_transcript

# YouTube video url: 40 second elevator pitch 

URL = "https://www.youtube.com/watch?v=4WEQtgnBu0I"

# download youtube mp3 from url: default location downloads/audio.mp3 

download_mp3(URL)

# transcribe using whisper 

transcription = transcribe_mp3(mp3_path="downloads/audio.mp3", model_type="openai/whisper-base")

# save transcript to file: defualt location transcripts/transcript.txt

save_transcript(transcription)

```

Outputs to transcripts/transcript.txt:

```plaintext

Hello, my name is Andrea Fitzgera. I am studying marketing at the University of Texas at Dallas.

I am a member of the American Marketing Association and AlphaCapas-Sci, both of which are dedicated to shaping future business leaders.

I hope to incorporate my business knowledge into consumer trend analysis and strengthening relationships among consumers, as well as other companies. I am savvy, social, and principled, and have exquisite interpersonal communication skills.

I know that I can be an asset in any company and or situation,

and I hope that you will consider me for an internship or job opportunity.

Thank you so much.

```

### 1.b. Downloading and Transcribing YouTube Videos (with Diarization)

You can download and transcribe a YouTube video with Whisper (and optionally save the transcript) with diarization using pyannote as follows. Note that pyannote requires a HuggingFace authorization token and accepted permissions on the pyannote page, see [here](https://github.com/pyannote/pyannote-audio) for more details.

```python

from whisper_yt.yt_downloader import download_mp3

from whisper_yt.whisper_utilities import transcribe_mp3

from whisper_yt.utilities import save_transcript

# YouTube video url: 2 minute two person job interview 

URL = "https://www.youtube.com/watch?v=naIkpQ_cIt0"

# replace with your HuggingFace authorization token

my_auth_token = "hf_my_huggingface_authtoken"

# download youtube mp3 from url: default location downloads/audio.mp3 

download_mp3(URL)

# transcribe and diarize using whisper 

transcription = transcribe_mp3(mp3_path="downloads/audio.mp3",

                               model_type="openai/whisper-base",

                               diarize=True,

                               auth_token=my_auth_token)

# save transcript to file: defualt location transcripts/transcript.txt

save_transcript(transcription)

```

Outputs to transcripts/transcript.txt:

```plaintext

...

SPEAKER_01: Mary, do you have any experience working in the kitchen?

SPEAKER_00: No, but I want to learn. I work hard and I cook a lot at home.

SPEAKER_01: Okay, well tell me about yourself.

SPEAKER_00: Well, I love to learn new things. I'm very organized.

SPEAKER_00: And I follow directions exactly.

SPEAKER_00: That's why my boss at my last job made me a trainer.

SPEAKER_00: And the company actually gave me a special certificate

SPEAKER_00: for coming to work on time every day for a year.

SPEAKER_01: That's great.

SPEAKER_01: Why did you leave your last job?

...

```

Note that the pyannote diarization can sometimes be faulty, double checking the diarized transcript is highly recommended

### 2. Creating and Saving a HuggingFace Dataset from a YouTube Video

You can create a HuggingFace dataset from a YouTube video (using its segmented audio as inputs and subtitles as ground truth transcriptions) for use in Whisper fine-tuning and/or evaluation by calling ```download_and_preprocess_yt()```, which performs the following operations:

  1) Downloads audio mp3 and subtitles raw transcript from YouTube video url

  2) Processess the raw transcript timestamps and text data

  3) Splits audio into segments from transcript 

  4) Creates transcript.json containing audio segment and clean timestamped transcript data

```python

from whisper_yt.yt_downloader import download_and_preprocess_yt

from whisper_yt.utilities import make_dataset

# YouTube video url: 40 second elevator pitch 

URL = "https://www.youtube.com/watch?v=4WEQtgnBu0I"

# downloads audio file and raw transcript to 'downloads/' and saves segmented audio to 'data/'

download_and_preprocess_yt(url)

# create huggingface dataset

ds = make_dataset(data_dir="data")

# save dataset to disk

ds.save_to_disk(dataset_path="datasets/elevator_pitch_ds")

```

This creates a directory, 'data/', structured as follows

```plaintext

data/ 

  audio_1.mp3

  audio_2.mp3 

  ...

  transcript.json

```

with transcript.json as follows 

```json

  {

    "start": 480,

    "end": 2629,

    "text": "hello my name is Andrea fitzer I am",

    "audio": "audio_1.mp3"

  },

  {

    "start": 2639,

    "end": 4430,

    "text": "studying marketing at the University of",

    "audio": "audio_2.mp3"

  },

```

The HuggingFace dataset is then created from this 'data/' directory, which is in the correct format for Whisper evaluation and fine tuning

### 3. Evaluating Whisper WER Against YouTube Subtitles

You can evalute the Word Error Rate of Whisper transcriptions using YouTube subtitles, which is especially useful in cases of manually transcribed YouTube subtitles.  

  

We will first quickly make a dataset using a video known to have manual transcriptions

```python

from whisper_yt.yt_downloader import download_and_preprocess_yt

from whisper_yt.utilities import make_dataset

# YouTube video url: 2 minute two person job interview 

URL = "https://www.youtube.com/watch?v=naIkpQ_cIt0"

download_and_preprocess_yt(URL)

ds = make_dataset(data_dir="data")

```

Now we evaluate the WER of Whisper's transcriptions against the manually transcribed subtitles

```python

from whisper_yt.whisper_utilities import get_whisper_transcription

from whisper_yt.utilities import filter_empty_references

from evaluate import load

# transcribed dataset using Whisper

transcribed_ds = get_whisper_transcription(ds)

references = transcribed_ds['reference']    # ground truth text

predictions = transcribed_ds['prediction']  # Whisper transcriptions

# filter out blocks of silence for more accurate WER calculation

references, predictions = filter_empty_references(references, predictions)

# calculate final WER

wer_function = load("wer")

wer_score = 100 * wer_function.compute(references=references, predictions=predictions)

print(f"Word Error Rate (WER) of Whisper transcriptions against youtube subtitles: {wer_score:.3f}%")

```

Which gives us a WER output as follows

```plaintext

Word Error Rate (WER) of Whisper transcriptions against youtube subtitles: 10.219%

```

### See Also

- main.py for more examples

- [pyannote](https://github.com/pyannote/pyannote-audio)

- [ffmpeg](https://www.ffmpeg.org/)

- [Whisper](https://github.com/openai/whisper)

- YouTube: (auto-generated subtitles) [40 second elevator pitch](https://www.youtube.com/watch?v=4WEQtgnBu0I)

- YouTube: (manually transcribed) [2 minute two person job interview](https://www.youtube.com/watch?v=naIkpQ_cIt0)

- YouTube: (manually transcribed) [3 minute tutorial video](https://www.youtube.com/watch?v=VatNBZh66Po)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adamelkholyy/whisper-yt

Awesome Lists containing this project

README