https://github.com/tsmdt/whisply

💬 Transcribe, translate, diarize, annotate and subtitle video (and audio) with Whisper on Win, Linux and Mac ... fast!
https://github.com/tsmdt/whisply

asr automatic-speech-recognition speech-recognition speech-to-text subtitles transcription-tool whisper-ai

Last synced: about 1 year ago
JSON representation

💬 Transcribe, translate, diarize, annotate and subtitle video (and audio) with Whisper on Win, Linux and Mac ... fast!

Host: GitHub
URL: https://github.com/tsmdt/whisply
Owner: tsmdt
License: apache-2.0
Created: 2024-04-30T07:58:03.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-28T09:17:12.000Z (about 1 year ago)
Last Synced: 2025-04-02T10:14:43.875Z (about 1 year ago)
Topics: asr, automatic-speech-recognition, speech-recognition, speech-to-text, subtitles, transcription-tool, whisper-ai
Language: Python
Homepage:
Size: 4 MB
Stars: 38
Watchers: 1
Forks: 11
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-mlx - whisply - platform CLI and GUI for batch transcription, translation, speaker annotation and subtitle generation using OpenAI’s Whisper on CPU, Nvidia GPU and Apple MLX. (Audio & Speech)
Awesome-Whisper-Apps - whisply - square) - Automatic subtitle generation (Linux) (By Use Case / Subtitles & Captioning)

README

# whisply

[![PyPI version](https://badge.fury.io/py/whisply.svg)](https://badge.fury.io/py/whisply)

*Transcribe, translate, annotate and subtitle audio and video files with OpenAI's [Whisper](https://github.com/openai/whisper) ... fast!*

`whisply` combines [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) to offer an easy-to-use solution for batch processing files on Windows, Linux and Mac. It also enables word-level speaker annotation by integrating [whisperX](https://github.com/m-bain/whisperX) and [pyannote](https://github.com/pyannote/pyannote-audio).

## Table of contents

- [Features](#features)
- [Requirements](#requirements)
- [Installation](#installation)
- [Install `ffmpeg`](#install-ffmpeg)
- [Installation with `pip`](#installation-with-pip)
- [Installation from `source`](#installation-from-source)
- [Nvidia GPU fix for Linux users (March 2025)](#nvidia-gpu-fix-for-linux-users-march-2025)
- [Usage](#usage)
- [CLI](#cli)
- [App](#app)
- [Speaker annotation and diarization](#speaker-annotation-and-diarization)
- [Requirements](#requirements-1)
- [How speaker annotation works](#how-speaker-annotation-works)
- [Post correction](#post-correction)
- [Batch processing](#batch-processing)
- [Using config files for batch processing](#using-config-files-for-batch-processing)

## Features

* 🚴‍♂️ **Performance**: `whisply` selects the fastest Whisper implementation based on your hardware:
* CPU/GPU (Nvidia CUDA): `fast-whisper` or `whisperX`
* MPS (Apple M1-M4): `insanely-fast-whisper`

* ⏩ **large-v3-turbo Ready**: Support for [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) on all devices. **Note**: Subtitling and annotations on CPU/GPU use `whisperX` for accurate timestamps, but `whisper-large-v3-turbo` isn’t currently available for `whisperX`.

* ✅ **Auto Device Selection**: `whisply` automatically chooses `faster-whisper` (CPU) or `insanely-fast-whisper` (MPS, Nvidia GPUs) for transcription and translation unless a specific `--device` option is passed.

* 🗣️ **Word-level Annotations**: Enabling `--subtitle` or `--annotate` uses `whisperX` or `insanely-fast-whisper` for word segmentation and speaker annotations. `whisply` approximates missing timestamps for numeric words.

* 💬 **Customizable Subtitles**: Specify words per subtitle block (e.g., "5") to generate `.srt` and `.webvtt` files with fixed word counts and timestamps.

* 🧺 **Batch Processing**: Handle single files, folders, URLs, or lists via `.list` documents. See the [Batch processing](#batch-processing) section for details.

* 👩‍💻 **CLI / App**: `whisply` can be run directly from CLI or as an app with a graphical user-interface (GUI).

* ⚙️ **Export Formats**: Supports `.json`, `.txt`, `.txt (annotated)`, `.srt`, `.webvtt`, `.vtt`, `.rttm` and `.html` (compatible with [noScribe's editor](https://github.com/kaixxx/noScribe)).

## Requirements

* [FFmpeg](https://ffmpeg.org/)
* \>= Python3.10
* GPU processing requires:
* Nvidia GPU (CUDA: cuBLAS and cuDNN for CUDA 12)
* Apple Metal Performance Shaders (MPS) (Mac M1-M4)
* Speaker annotation requires a [HuggingFace Access Token](https://huggingface.co/docs/hub/security-tokens)

## Installation

### Install `ffmpeg`

```shell
# --- macOS ---
brew install ffmpeg

# --- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg

# --- Windows ---
winget install Gyan.FFmpeg
```

For more information you can visit the [FFmpeg website](https://ffmpeg.org/download.html).

### Installation with `pip`

1. Create a Python virtual environment

```shell
python3 -m venv venv
```

2. Activate the environment

```shell
# --- Linux & macOS ---
source venv/bin/activate

# --- Windows ---
venv\Scripts\activate
```

3. Install whisply

```shell
pip install whisply
```

### Installation from `source`

1. Clone this repository

```shell
git clone https://github.com/tsmdt/whisply.git
```

2. Change to project folder

```shell
cd whisply
```

3. Create a Python virtual environment

```shell
python3 -m venv venv
```

4. Activate the Python virtual environment

```shell
# --- Linux & macOS ---
source venv/bin/activate

# --- Windows ---
venv\Scripts\activate
```

5. Install whisply

```shell
pip install .
```

### Nvidia GPU fix for Linux users (March 2025)

Could not load library libcudnn_ops_infer.so.8. (click to expand)

If you use whisply on a Linux system with a Nvidia GPU and encounter this error:

```shell
"Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory"
```

Use the following steps to fix the issue:

1. In your activated python environment run `pip list` and check that `torch==2.4.0` and `torchaudio==2.4.0` are installed.
2. If yes, run `pip install ctranslate2==4.5.0`. Otherwise install `torch==2.4.0` and `torchaudio==2.4.0` using pip first.
3. Export the following environment variable to your shell:

```shell
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```

4. Add this line to your Python environment to make it permanent:

```shell
echo "export LD_LIBRARY_PATH=\`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'\`" >> path/to/your/python/env/bin/activate
```

Find additional information at faster-whisper's GitHub page.

## Usage

### CLI

```shell
$ whisply

Usage: whisply [OPTIONS]

WHISPLY 💬 Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!

╭─ Options
│ --files
│ --output_dir
│ --device
│
│ --model
│ --lang
│
│ --annotate
│ --num_speakers
│ --hf_token
│ --subtitle
│ --sub_length
│ --translate
│ --export
│ --verbose
│ --del_originals
│ --config
│ --post_correction
│ --launch_app
│ --list_models
│ --install-completion
│ --show-completion
│ --help
╰─────
``` ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -f TEXT Path to file, folder, URL or .list to process. [default: None] │ -o DIRECTORY Folder where transcripts should be saved. [default: transcriptions] │ -d [auto|cpu|gpu|mps] Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4). │ [default: auto] │ -m TEXT Whisper model to use (List models via --list_models). [default: large-v3-turbo] │ -l TEXT Language of provided file(s) ("en", "de") (Default: auto-detection). │ [default: None] │ -a Enable speaker annotation (Saves .rttm | Default: False). │ -num INTEGER Number of speakers to annotate (Default: auto-detection). [default: None] │ -hf TEXT HuggingFace Access token required for speaker annotation. [default: None] │ -s Create subtitles (Saves .srt, .vtt and .webvtt | Default: False). │ INTEGER Subtitle segment length in words. [default: 5] │ -t Translate transcription to English (Default: False). │ -e [all|json|txt|rttm|vtt|webvtt|srt|html] Choose the export format. [default: all] │ -v Print text chunks during transcription (Default: False). │ -del Delete original input files after file conversion. (Default: False) │ PATH Path to configuration file. [default: None] │ -post PATH Path to YAML file for post-correction. [default: None] │ -app Launch the web app instead of running standard CLI commands. │ List available models. │ Install completion for the current shell. │ Show completion for the current shell, to copy it or customize the installation. │ Show this message and exit. │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

### App

Instead of running `whisply` from the CLI you can start the web app instead:

```shell
$ whisply --launch_app
```

or:

```shell
$ whisply -app
```

Open the local URL in your browser after starting the app (**Note**: The URL might differ from system to system):

```shell
* Running on local URL: http://127.0.0.1:7860
```

### Speaker annotation and diarization

#### Requirements

In order to annotate speakers using `--annotate` you need to provide a valid [HuggingFace](https://huggingface.co) access token using the `--hf_token` option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the `pyannote` segmentation model.

For detailed instructions, refer to the *Requirements* section on the [pyannote model page on HuggingFace](https://huggingface.co/pyannote/speaker-diarization-3.1#requirements) and make sure that you complete steps *"2. Accept pyannote/segmentation-3.0 user conditions"*, *"3. Accept pyannote/speaker-diarization-3.1 user conditions"* and *"4. Create access token at hf.co/settings/tokens"*.

#### How speaker annotation works

`whisply` uses [whisperX](https://github.com/m-bain/whisperX) for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard `Whisper` implementation `whisperX` is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.

Out of the box `whisperX` will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): `whisply` fixes those instances through timestamp approximation. Other known limitations of `whisperX` include:

* inaccurate speaker diarization if multiple speakers talk at the same time
* to provide word-level timestamps and annotations `whisperX` uses language specific alignment models; out of the box `whisperX` supports these languages: `en, fr, de, es, it, ja, zh, nl, uk, pt`.

Refer to the [whisperX GitHub page](https://github.com/m-bain/whisperX) for more information.

### Post correction

The `--post_correction` option allows you to correct various transcription errors that you may find in your files. The option takes as argument a `.yaml` file with the following structure:

```yaml
# Single word corrections
Gardamer: Gadamer

# Pattern-based corrections
patterns:
- pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung'
replacement: 'Klaus Tschira Stiftung'
```

- **Single word corrections**: matches single words → `wrong word`: `correct word`
- **Pattern-based corrections**: matches patterns → `(Cira|Cyra|Tira)` will look for `Klaus-Cira-Stiftung`, `Klaus-Cyra-Stiftung` and / or `Klaus-Tira-Stiftung` and replaces it with `Klaus-Tschirra-Stiftung`

Post correction will be applied to **all** export file formats you choose.

### Batch processing

Instead of providing a file, folder or URL by using the `--files` option you can pass a `.list` with a mix of files, folders and URLs for processing.

Example:

```shell
$ cat my_files.list

video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
```

#### Using config files for batch processing

You can provide a `.json` config file by using the `--config` option which makes batch processing easy. An example config looks like this:

```markdown
{
"files": "./files/my_files.list", # Path to your files
"output_dir": "./transcriptions", # Output folder where transcriptions are saved
"device": "auto", # AUTO, GPU, MPS or CPU
"model": "large-v3-turbo", # Whisper model to use
"lang": null, # Null for auto-detection or language codes ("en", "de", ...)
"annotate": false, # Annotate speakers
"num_speakers": null, # Number of speakers of the input file (null: auto-detection)
"hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations)
"subtitle": false, # Subtitle file(s)
"sub_length": 10, # Length of each subtitle block in number of words
"translate": false, # Translate to English
"export": "txt", # Export .txts only
"verbose": false # Print transcription segments while processing
"del_originals": false, # Delete original input files after file conversion
"post_correction": "my_corrections.yaml" # Apply post correction with specified patterns in .yaml
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tsmdt/whisply

Awesome Lists containing this project

README