Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/wtlow003/auto-subtitles

CLI tool to transcribe (+ translate) videos and embed subtitles automatically.
https://github.com/wtlow003/auto-subtitles
faster-whisper nllb subtitles subtitles-generator translation whisper whisper-cpp
Last synced: 3 months ago
JSON representation
CLI tool to transcribe (+ translate) videos and embed subtitles automatically.
Host: GitHub
URL: https://github.com/wtlow003/auto-subtitles
Owner: wtlow003
License: mit
Created: 2024-02-20T10:02:22.000Z (almost 1 year ago)
Default Branch: master
Last Pushed: 2024-06-12T14:12:13.000Z (8 months ago)
Last Synced: 2024-06-12T20:07:29.296Z (8 months ago)
Topics: faster-whisper, nllb, subtitles, subtitles-generator, translation, whisper, whisper-cpp
Language: Shell
Homepage:
Size: 49 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        
📺 Auto-Subtitles




   

      

      





    About •

    Features •

    Installation •

    Usage



![banner](/assets/banner-dall-e.jpeg)

## About

The **Auto-Subtitles** is a CLI tool that generates and embeds subtitles for any YouTube video automatically. Other core functionality includes the ability to generate translated transcripts prior to the output process.

### Why Should You Use It?

Prior to the advancement of automatic speech recognition (ASR), transcription process is often seen as a tedious manual task that requires meticulousness in understanding the given audio.

I studied and interned in the film and media industry prior to working as a Machine Learning/Platform Engineer. I was involved in several production that involves manually generating transcriptions and overlay subtitles via video editing software for various advertisements and commercials.

With OpenAI's [Whisper](https://github.com/openai/whisper) models garnering favourable interests from developers due to the ease of local processing and [high](https://www.speechly.com/blog/analyzing-open-ais-whisper-asr-models-word-error-rates-across-languages) accuracy in languages such as english, it soon became a viable drop-in (free) replacement for professional (paid) transcription services.

While far from perfect – **Auto-Subtitles** still provides automatically generated transcriptions from your local setup with ease of setting up and using from the get-go. The CLI tool can be a initial starting phase in the subtitling process by generating a first-draft of transcriptions that can be vetted and edited by the human before using the edited subtitles for the eventual output. This can reduce the time-intensive process of audio scrubbing and typing every single word from scratch.

## Features

### Supported Models

Currently, the auto-subtitles workflow supports the following variant(s) of the Whisper model:

1. [@ggerganov/whisper.cpp](https://github.com/ggerganov/whisper.cpp):

   - Provides the `whisper-cpp` backend for the workflow.

   - Port of OpenAI's Whisper model in C/C++. Generate fast transcription on local setup (esp. MacOS via `MPS`).

2. [@jianfch/stable-ts](https://github.com/jianfch/stable-ts):

   - Provides the [`faster-whisper`](https://github.com/SYSTRAN/faster-whisper) backend for the workflow, while producing more reliable and accurate timestamps for transcription.

   - Functionalities also includes VAD filters to more accurately detect voice activities.

3. [@Vaibhavs10/insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) [`Experimental`]:

   - Leverages Flash Attention 2 (or Scaled Dot Product Attention) and batching to improve transcription speed.

   - Works for only for gpu setup (`cuda` or `mps`) at the moment.

   - Supports only `large`, `large-v2`, and `large-v3` models.

   - No default support for max segment length – currently using self-implemented heuristics for segment length adjustment.

### Translation

In **Auto-Subtitles**, we also included the functionality to translate transcripts, e.g., `english (en)` to `chinese (zh)`, prior to embedding subtitles on the output video.

We did not opt to use the translation features directly via the Whisper model due to observed performance issue and hallucination in the generated transcript.

To support a more efficient and reliable translation process, we used Meta AI's group of models - [No Language Left Behind (NLLB)](https://ai.meta.com/research/no-language-left-behind/) for translation post-transcription.

Currently, the following models are supported:

1. [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)

2. [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)

3. [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)

4. [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)

By default, the `facebook/nllb-200-distilled-600M` model is used.

## Installation

For this project, you can setup the requirements/dependencies and environment either locally or in a containerised environment with Docker.

### Local Setup

#### Pre-requisites

1. [ffmpeg](https://ffmpeg.org/download.html#build-mac)

   > Alternatively, referenced from [@openai/whisper](https://github.com/openai/whisper):

   ```shell

   # on Ubuntu or Debian

   sudo apt update && sudo apt install ffmpeg

   # on Arch Linux

   sudo pacman -S ffmpeg

   # on MacOS using Homebrew (https://brew.sh/)

   brew install ffmpeg

   # on Windows using Chocolatey (https://chocolatey.org/)

   choco install ffmpeg

   # on Windows using Scoop (https://scoop.sh/)

   scoop install ffmpeg

   ```

2. [Python 3.9](https://www.python.org/downloads/)

3. [whisper.cpp](https://www.bing.com/search?q=whisper.cpp&cvid=c6357be7905a4543b299efb7b63bda65&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARBFGDsyBggCEEUYOTIGCAMQRRg8MgYIBBBFGDwyBggFEEUYPDIGCAYQRRhA0gEIMTE0OGowajSoAgCwAgA&FORM=ANAB01&PC=U531)

   ```shell

   # build the binary for usage

   git clone https://github.com/ggerganov/whisper.cpp.git

   cd whisper.cpp

   make

   ```

   - Please refer to the actual [repo](https://github.com/ggerganov/whisper.cpp.git) for all other build arguments relevant to your local setup for better performance.

#### Python Dependencies

Install the dependencies in `requirements.txt` into a virtual environment (`virtualenv`):

```shell

python -m venv .venv

# mac-os

source .venv/bin/activate

# install dependencies

pip install --upgrade pip setuptools wheel

pip install -r requirements.txt

```

### Docker Setup

To run the workflow using docker:

```bash

# build the image

docker buildx build -t auto-subs .

```

## Usage

### Transcribing

To run the automatic subtitling process for the following [video](https://www.youtube.com/watch?v=fnvZJU5Fj3Q), simply run the following command (refer [here](#detailed-options) for advanced options):

#### Local

```shell

chmod +x ./workflow.sh

./workflow.sh -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \

    -b faster-whisper \

    -t 8 \

    -m medium \

    -ml 47

```

#### Docker

```bash

# run the image

docker run \

   --volume :/app/output

   auto-subs \

   -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \

   -b faster-whisper \

   -t 8 \

   -ml 47

```

The above command generate the workflow with the following settings:

1. Using the `faster-whisper` backend

   - More reliable and accurate timestamps as opposed to `whisper.cpp`, using `VAD` etc.

2. Running on `8` threads for increased performance

3. Using the [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) multi-lingual model

4. Limit the maximum length of each transcription segment to max [`47`](https://www.capitalcaptions.com/services/subtitle-services-2/capital-captions-standard-subtitling-guidelines/) characters.

The following is the generated video:

### Transcribing + Translating

To run the automatic subtitling process for the following [video](https://www.youtube.com/watch?v=DtLJjNyl57M) and generate `Chinese (zh)` subtitles:

#### Local

```shell

chmod +x ./workflow.sh

./workflow.sh -u https://www.youtube.com/watch?v=DtLJjNyl57M \

    -b whisper-cpp \

    -wbp ~/code/whisper.cpp \

    -t 8 \

    -m medium \

    -ml 47 \

    -tf "eng_Latn" \

    -tt "zho_Hans"

```

#### Docker

```bash

# run the image

docker run \

   --volume :/app/output

   auto-subs \

   -u https://www.youtube.com/watch?v=DtLJjNyl57M \

   -b whisper-cpp \

   -t 8 \

   -ml 47 \

   -tf "eng_Latn" \

   -tt "zho_Hans"

```

The above command generate the workflow with the following settings:

1. Using the `whisper-cpp` backend

   - Faster transcription process compared to `faster-whisper`.

   - However, may produce degraded output video with inaccurate timestamps or subtitles appearing early with no noticeable voice activity.

2. Specifying directory path to the pre-built binary of `whisper.cpp` to be used for transcription.

3. Running on `8` threads for increased performance

4. Using the [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) multi-lingual model

5. Limit the maximum length of each transcription segment to max [`47`](https://www.capitalcaptions.com/services/subtitle-services-2/capital-captions-standard-subtitling-guidelines/) characters.

6. Translating from (`-tf`) **English (eng_Latn)** to (`-tt`) **Chinese (zho_Hans)**, using the `FLORES-200` Code found [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200).

The following is the generated video:

### Detailed Options

To check all the available options, use the `--help` flag:

```shell

./workflow.sh --help

Usage: ./workflow.sh [-u ] [options]

Options:

  -u, --url                        YouTube video URL

  -o, --output-path                      Output path

  -b, --backend                              Backend to use: whisper-cpp or faster-whisper

  -wbp, --whisper-bin-path          Path to whisper-cpp binary. Required if using [--backend whisper-cpp].

  -ml, --max-length                       Maximum length of the generated transcript

  -t, --threads                              Number of threads to use

  -w, --workers                              Number of workers to use

  -m, --model                                  Model name to use

  -tf, --translate-from               Translate from language

  -tt, --translate-to                   Translate to language

  -f, --font                                    Font to use for subtitles

```

## [WIP] Performance

> For `mps` device, I am running performance testing on a M2 Max 12/30 (cpu/gpu) cores MacBook Pro (14-inch, 2023).

### Transcription

| Model  | Backend        | Device | Threads | Time Taken |

| ------ | -------------- | ------ | ------- | ---------- |

| base   | whisper-cpp    | cpu    | 4       | ~          |

| base   | whisper-cpp    | mps    | 4       | ~          |

| base   | faster-whisper | cpu    | 4       | ~          |

| base   | faster-whisper | mps    | 4       | ~          |

| medium | whisper-cpp    | cpu    | 4       | ~          |

| medium | whisper-cpp    | mps    | 4       | ~          |

| medium | faster-whisper | cpu    | 4       | ~          |

| medium | faster-whisper | mps    | 4       | ~          |

### Transcription + Translation

| Model  | Backend        | Device | Threads | Time Taken |

| ------ | -------------- | ------ | ------- | ---------- |

| base   | whisper-cpp    | cpu    | 4       | ~          |

| base   | whisper-cpp    | mps    | 4       | ~          |

| base   | faster-whisper | cpu    | 4       | ~          |

| base   | faster-whisper | mps    | 4       | ~          |

| medium | whisper-cpp    | cpu    | 4       | ~          |

| medium | whisper-cpp    | mps    | 4       | ~          |

| medium | faster-whisper | cpu    | 4       | ~          |

| medium | faster-whisper | mps    | 4       | ~          |

## Known Issues

1. Korean subtitles are not supported at the moment.

   - **Details**: The default font used to embed subtitles is `Arial Unicode MS`, which does not provide glpyh for Korean characters.

   - **Potential Solution**: Add alternate fonts for Korean characters

   - **Status**: ✅ `Done`

## Changelog

1. 🗓️ **[24/02/2024]**: Include `./fonts` folder to host downloaded fonts to be copied into the Docker container. Once copied, users can specified their desired fonts with the `-f` or `--font` flag.