Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/softcatala/whisper-ctranslate2
Whisper command line client compatible with original OpenAI client based on CTranslate2.
https://github.com/softcatala/whisper-ctranslate2
openai- openai-whisper speech-recognition speech-to-text whisper
Last synced: 5 days ago
JSON representation
Whisper command line client compatible with original OpenAI client based on CTranslate2.
- Host: GitHub
- URL: https://github.com/softcatala/whisper-ctranslate2
- Owner: Softcatala
- License: mit
- Created: 2023-03-17T10:19:45.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-25T20:42:13.000Z (3 months ago)
- Last Synced: 2024-10-29T15:24:52.819Z (3 months ago)
- Topics: openai-, openai-whisper, speech-recognition, speech-to-text, whisper
- Language: Python
- Homepage:
- Size: 1.12 MB
- Stars: 903
- Watchers: 24
- Forks: 76
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ChatGPT-repositories - whisper-ctranslate2 - Whisper command line client compatible with original OpenAI client based on CTranslate2. (CLIs)
README
[![PyPI version](https://img.shields.io/pypi/v/whisper-ctranslate2.svg?logo=pypi&logoColor=FFE873)](https://pypi.org/project/whisper-ctranslate2/)
[![PyPI downloads](https://img.shields.io/pypi/dm/whisper-ctranslate2.svg)](https://pypistats.org/packages/whisper-ctranslate2)# Introduction
Whisper command line client compatible with original [OpenAI client](https://github.com/openai/whisper) based on CTranslate2.
It uses [CTranslate2](https://github.com/OpenNMT/CTranslate2/) and [Faster-whisper](https://github.com/SYSTRAN/faster-whisper) Whisper implementation that is up to 4 times faster than openai/whisper for the same accuracy while using less memory.
Goals of the project:
* Provide an easy way to use the CTranslate2 Whisper implementation
* Ease the migration for people using OpenAI Whisper CLI# 🚀 **NEW PROJECT LAUNCHED!** 🚀
**Open dubbing** is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages ! 🎉
### **🔥 Check it out now: [*open-dubbing*](https://github.com/jordimas/open-dubbing) 🔥**
# Installation
To install the latest stable version, just type:
pip install -U whisper-ctranslate2
Alternatively, if you are interested in the latest development (non-stable) version from this repository, just type:
pip install git+https://github.com/Softcatala/whisper-ctranslate2
# Using prebuild Docker image
You can use build docker image. First pull the image:
docker pull ghcr.io/softcatala/whisper-ctranslate2:latest
The Docker image includes the small, medium" and large-v2.
To run it:
docker run --gpus "device=0" \
-v "$(pwd)":/srv/files/ \
-it ghcr.io/softcatala/whisper-ctranslate2:latest \
/srv/files/e2e-tests/gossos.mp3 \
--output_dir /srv/files/
Notes:
* _--gpus "device=0"_ gives access to the GPU. If you do not have a GPU, remove this.
* _"$(pwd)":/srv/files/_ maps your current directory to /srv/files/ inside the container# CPU and GPU support
GPU and CPU support are provided by [CTranslate2](https://github.com/OpenNMT/CTranslate2/).
It has compatibility with x86-64 and AArch64/ARM64 CPU and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the [CTranslate2 documentation](https://opennmt.net/CTranslate2/installation.html)
By default the best hardware available is selected for inference. You can use the options `--device` and `--device_index` to control manually the selection.
# UsageSame command line as OpenAI Whisper.
To transcribe:
whisper-ctranslate2 inaguracio2011.mp3 --model medium
To translate:
whisper-ctranslate2 inaguracio2011.mp3 --model medium --task translate
Whisper translate task translates the transcription from the source language to English (the only target language supported).
Additionally using:
whisper-ctranslate2 --help
All the supported options with their help are shown.
# CTranslate2 specific options
On top of the OpenAI Whisper command line options, there are some specific options provided by CTranslate2 or whiper-ctranslate2.
## Batched inference
Batched inference transcribes each segment in-dependently which can provide an additional 2x-4x speed increase:
whisper-ctranslate2 inaguracio2011.mp3 --batched True
You can additionally use the --batch_size to specify the maximum number of parallel requests to model for decoding.Batched inference uses Voice Activity Detection (VAD) filter and ignores the following paramters: compression_ratio_threshold, logprob_threshold,
no_speech_threshold, condition_on_previous_text, prompt_reset_on_temperature, prefix, hallucination_silence_threshold.## Quantization
`--compute_type` option which accepts _default,auto,int8,int8_float16,int16,float16,float32_ values indicates the type of [quantization](https://opennmt.net/CTranslate2/quantization.html) to use. On CPU _int8_ will give the best performance:
whisper-ctranslate2 myfile.mp3 --compute_type int8
## Loading the model from a directory
`--model_directory` option allows to specify the directory from which you want to load a CTranslate2 Whisper model. For example, if you want to load your own quantified [Whisper model](https://opennmt.net/CTranslate2/conversion.html) version or using your own [Whisper fine-tunned](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event) version. The model must be in CTranslate2 format.
## Using Voice Activity Detection (VAD) filter
`--vad_filter` option enables the voice activity detection (VAD) to filter out parts of the audio without speech. This step uses the [Silero VAD model](https://github.com/snakers4/silero-vad):
whisper-ctranslate2 myfile.mp3 --vad_filter True
The VAD filter accepts multiple additional options to determine the filter behavior:
--vad_onset VALUE (float)
Probabilities above this value are considered as speech.
--vad_min_speech_duration_ms (int)
Final speech chunks shorter min_speech_duration_ms are thrown out.
--vad_max_speech_duration_s VALUE (int)
Maximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence.
## Print colors
`--print_colors True` options prints the transcribed text using an experimental color coding strategy based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to highlight words with high or low confidence:
whisper-ctranslate2 myfile.mp3 --print_colors True
## Live transcribe from your microphone
`--live_transcribe True` option activates the live transcription mode from your microphone:
whisper-ctranslate2 --live_transcribe True --language en
https://user-images.githubusercontent.com/309265/231533784-e58c4b92-e9fb-4256-b4cd-12f1864131d9.mov
## Diarization (speaker identification)
There is experimental diarization support using [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) to identify speakers. At the moment, the support is at segment level.
To enable diarization you need to follow these steps:
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) with `pip install pyannote.audio`
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
4. Create an access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).And then execute passing the HuggingFace API token as parameter to enable diarization:
whisper-ctranslate2 --hf_token YOUR_HF_TOKEN
and then the name of the speaker is added in the output files (e.g. JSON, VTT and STR files):
_[SPEAKER_00]: There is a lot of people in this room_
The option `--speaker_name SPEAKER_NAME` allows to use your own string to identify the speaker.
# Need help?
Check our [frequently asked questions](FAQ.md) for common questions.
# Contact
Jordi Mas