https://github.com/tsmdt/whisply
๐ฌ Transcribe, translate, diarize, annotate and subtitle video (and audio) with Whisper on Win, Linux and Mac ... fast!
https://github.com/tsmdt/whisply
asr automatic-speech-recognition speech-recognition speech-to-text subtitles transcription-tool whisper-ai
Last synced: about 1 year ago
JSON representation
๐ฌ Transcribe, translate, diarize, annotate and subtitle video (and audio) with Whisper on Win, Linux and Mac ... fast!
- Host: GitHub
- URL: https://github.com/tsmdt/whisply
- Owner: tsmdt
- License: apache-2.0
- Created: 2024-04-30T07:58:03.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-28T09:17:12.000Z (about 1 year ago)
- Last Synced: 2025-04-02T10:14:43.875Z (about 1 year ago)
- Topics: asr, automatic-speech-recognition, speech-recognition, speech-to-text, subtitles, transcription-tool, whisper-ai
- Language: Python
- Homepage:
- Size: 4 MB
- Stars: 38
- Watchers: 1
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-mlx - whisply - platform CLI and GUI for batch transcription, translation, speaker annotation and subtitle generation using OpenAIโs Whisper on CPU, Nvidia GPU and Apple MLX. (Audio & Speech)
- Awesome-Whisper-Apps - whisply - square) - Automatic subtitle generation (Linux) (By Use Case / Subtitles & Captioning)
README
# whisply
[](https://badge.fury.io/py/whisply)

*Transcribe, translate, annotate and subtitle audio and video files with OpenAI's [Whisper](https://github.com/openai/whisper) ... fast!*
`whisply` combines [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) to offer an easy-to-use solution for batch processing files on Windows, Linux and Mac. It also enables word-level speaker annotation by integrating [whisperX](https://github.com/m-bain/whisperX) and [pyannote](https://github.com/pyannote/pyannote-audio).
## Table of contents
- [Features](#features)
- [Requirements](#requirements)
- [Installation](#installation)
- [Install `ffmpeg`](#install-ffmpeg)
- [Installation with `pip`](#installation-with-pip)
- [Installation from `source`](#installation-from-source)
- [Nvidia GPU fix for Linux users (March 2025)](#nvidia-gpu-fix-for-linux-users-march-2025)
- [Usage](#usage)
- [CLI](#cli)
- [App](#app)
- [Speaker annotation and diarization](#speaker-annotation-and-diarization)
- [Requirements](#requirements-1)
- [How speaker annotation works](#how-speaker-annotation-works)
- [Post correction](#post-correction)
- [Batch processing](#batch-processing)
- [Using config files for batch processing](#using-config-files-for-batch-processing)
## Features
* ๐ดโโ๏ธ **Performance**: `whisply` selects the fastest Whisper implementation based on your hardware:
* CPU/GPU (Nvidia CUDA): `fast-whisper` or `whisperX`
* MPS (Apple M1-M4): `insanely-fast-whisper`
* โฉ **large-v3-turbo Ready**: Support for [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) on all devices. **Note**: Subtitling and annotations on CPU/GPU use `whisperX` for accurate timestamps, but `whisper-large-v3-turbo` isnโt currently available for `whisperX`.
* โ
**Auto Device Selection**: `whisply` automatically chooses `faster-whisper` (CPU) or `insanely-fast-whisper` (MPS, Nvidia GPUs) for transcription and translation unless a specific `--device` option is passed.
* ๐ฃ๏ธ **Word-level Annotations**: Enabling `--subtitle` or `--annotate` uses `whisperX` or `insanely-fast-whisper` for word segmentation and speaker annotations. `whisply` approximates missing timestamps for numeric words.
* ๐ฌ **Customizable Subtitles**: Specify words per subtitle block (e.g., "5") to generate `.srt` and `.webvtt` files with fixed word counts and timestamps.
* ๐งบ **Batch Processing**: Handle single files, folders, URLs, or lists via `.list` documents. See the [Batch processing](#batch-processing) section for details.
* ๐ฉโ๐ป **CLI / App**: `whisply` can be run directly from CLI or as an app with a graphical user-interface (GUI).
* โ๏ธ **Export Formats**: Supports `.json`, `.txt`, `.txt (annotated)`, `.srt`, `.webvtt`, `.vtt`, `.rttm` and `.html` (compatible with [noScribe's editor](https://github.com/kaixxx/noScribe)).
## Requirements
* [FFmpeg](https://ffmpeg.org/)
* \>= Python3.10
* GPU processing requires:
* Nvidia GPU (CUDA: cuBLAS and cuDNN for CUDA 12)
* Apple Metal Performance Shaders (MPS) (Mac M1-M4)
* Speaker annotation requires a [HuggingFace Access Token](https://huggingface.co/docs/hub/security-tokens)
## Installation
### Install `ffmpeg`
```shell
# --- macOS ---
brew install ffmpeg
# --- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg
# --- Windows ---
winget install Gyan.FFmpeg
```
For more information you can visit the [FFmpeg website](https://ffmpeg.org/download.html).
### Installation with `pip`
1. Create a Python virtual environment
```shell
python3 -m venv venv
```
2. Activate the environment
```shell
# --- Linux & macOS ---
source venv/bin/activate
# --- Windows ---
venv\Scripts\activate
```
3. Install whisply
```shell
pip install whisply
```
### Installation from `source`
1. Clone this repository
```shell
git clone https://github.com/tsmdt/whisply.git
```
2. Change to project folder
```shell
cd whisply
```
3. Create a Python virtual environment
```shell
python3 -m venv venv
```
4. Activate the Python virtual environment
```shell
# --- Linux & macOS ---
source venv/bin/activate
# --- Windows ---
venv\Scripts\activate
```
5. Install whisply
```shell
pip install .
```
### Nvidia GPU fix for Linux users (March 2025)
Could not load library libcudnn_ops_infer.so.8. (click to expand)
If you use whisply on a Linux system with a Nvidia GPU and encounter this error:
```shell
"Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory"
```
Use the following steps to fix the issue:
1. In your activated python environment run `pip list` and check that `torch==2.4.0` and `torchaudio==2.4.0` are installed.
2. If yes, run `pip install ctranslate2==4.5.0`. Otherwise install `torch==2.4.0` and `torchaudio==2.4.0` using pip first.
3. Export the following environment variable to your shell:
```shell
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```
4. Add this line to your Python environment to make it permanent:
```shell
echo "export LD_LIBRARY_PATH=\`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'\`" >> path/to/your/python/env/bin/activate
```
Find additional information at faster-whisper's GitHub page.
## Usage
### CLI
```shell
$ whisply
Usage: whisply [OPTIONS]
WHISPLY ๐ฌ Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --files -f TEXT Path to file, folder, URL or .list to process. [default: None] โ
โ --output_dir -o DIRECTORY Folder where transcripts should be saved. [default: transcriptions] โ
โ --device -d [auto|cpu|gpu|mps] Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4). โ
โ [default: auto] โ
โ --model -m TEXT Whisper model to use (List models via --list_models). [default: large-v3-turbo] โ
โ --lang -l TEXT Language of provided file(s) ("en", "de") (Default: auto-detection). โ
โ [default: None] โ
โ --annotate -a Enable speaker annotation (Saves .rttm | Default: False). โ
โ --num_speakers -num INTEGER Number of speakers to annotate (Default: auto-detection). [default: None] โ
โ --hf_token -hf TEXT HuggingFace Access token required for speaker annotation. [default: None] โ
โ --subtitle -s Create subtitles (Saves .srt, .vtt and .webvtt | Default: False). โ
โ --sub_length INTEGER Subtitle segment length in words. [default: 5] โ
โ --translate -t Translate transcription to English (Default: False). โ
โ --export -e [all|json|txt|rttm|vtt|webvtt|srt|html] Choose the export format. [default: all] โ
โ --verbose -v Print text chunks during transcription (Default: False). โ
โ --del_originals -del Delete original input files after file conversion. (Default: False) โ
โ --config PATH Path to configuration file. [default: None] โ
โ --post_correction -post PATH Path to YAML file for post-correction. [default: None] โ
โ --launch_app -app Launch the web app instead of running standard CLI commands. โ
โ --list_models List available models. โ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
```
### App
Instead of running `whisply` from the CLI you can start the web app instead:
```shell
$ whisply --launch_app
```
or:
```shell
$ whisply -app
```
Open the local URL in your browser after starting the app (**Note**: The URL might differ from system to system):
```shell
* Running on local URL: http://127.0.0.1:7860
```
### Speaker annotation and diarization
#### Requirements
In order to annotate speakers using `--annotate` you need to provide a valid [HuggingFace](https://huggingface.co) access token using the `--hf_token` option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the `pyannote` segmentation model.
For detailed instructions, refer to the *Requirements* section on the [pyannote model page on HuggingFace](https://huggingface.co/pyannote/speaker-diarization-3.1#requirements) and make sure that you complete steps *"2. Accept pyannote/segmentation-3.0 user conditions"*, *"3. Accept pyannote/speaker-diarization-3.1 user conditions"* and *"4. Create access token at hf.co/settings/tokens"*.
#### How speaker annotation works
`whisply` uses [whisperX](https://github.com/m-bain/whisperX) for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard `Whisper` implementation `whisperX` is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.
Out of the box `whisperX` will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): `whisply` fixes those instances through timestamp approximation. Other known limitations of `whisperX` include:
* inaccurate speaker diarization if multiple speakers talk at the same time
* to provide word-level timestamps and annotations `whisperX` uses language specific alignment models; out of the box `whisperX` supports these languages: `en, fr, de, es, it, ja, zh, nl, uk, pt`.
Refer to the [whisperX GitHub page](https://github.com/m-bain/whisperX) for more information.
### Post correction
The `--post_correction` option allows you to correct various transcription errors that you may find in your files. The option takes as argument a `.yaml` file with the following structure:
```yaml
# Single word corrections
Gardamer: Gadamer
# Pattern-based corrections
patterns:
- pattern: 'Klaus-(Cira|Cyra|Tira)-Stiftung'
replacement: 'Klaus Tschira Stiftung'
```
- **Single word corrections**: matches single words โ `wrong word`: `correct word`
- **Pattern-based corrections**: matches patterns โ `(Cira|Cyra|Tira)` will look for `Klaus-Cira-Stiftung`, `Klaus-Cyra-Stiftung` and / or `Klaus-Tira-Stiftung` and replaces it with `Klaus-Tschirra-Stiftung`
Post correction will be applied to **all** export file formats you choose.
### Batch processing
Instead of providing a file, folder or URL by using the `--files` option you can pass a `.list` with a mix of files, folders and URLs for processing.
Example:
```shell
$ cat my_files.list
video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
```
#### Using config files for batch processing
You can provide a `.json` config file by using the `--config` option which makes batch processing easy. An example config looks like this:
```markdown
{
"files": "./files/my_files.list", # Path to your files
"output_dir": "./transcriptions", # Output folder where transcriptions are saved
"device": "auto", # AUTO, GPU, MPS or CPU
"model": "large-v3-turbo", # Whisper model to use
"lang": null, # Null for auto-detection or language codes ("en", "de", ...)
"annotate": false, # Annotate speakers
"num_speakers": null, # Number of speakers of the input file (null: auto-detection)
"hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations)
"subtitle": false, # Subtitle file(s)
"sub_length": 10, # Length of each subtitle block in number of words
"translate": false, # Translate to English
"export": "txt", # Export .txts only
"verbose": false # Print transcription segments while processing
"del_originals": false, # Delete original input files after file conversion
"post_correction": "my_corrections.yaml" # Apply post correction with specified patterns in .yaml
}
```