{"id":13456668,"url":"https://github.com/SYSTRAN/faster-whisper","last_synced_at":"2025-03-24T11:30:56.548Z","repository":{"id":65820003,"uuid":"600368121","full_name":"SYSTRAN/faster-whisper","owner":"SYSTRAN","description":"Faster Whisper transcription with CTranslate2","archived":false,"fork":false,"pushed_at":"2025-01-01T14:45:17.000Z","size":38359,"stargazers_count":14841,"open_issues_count":243,"forks_count":1253,"subscribers_count":135,"default_branch":"master","last_synced_at":"2025-03-18T00:37:23.419Z","etag":null,"topics":["deep-learning","inference","openai","quantization","speech-recognition","speech-to-text","transformer","whisper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SYSTRAN.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-11T09:17:27.000Z","updated_at":"2025-03-17T23:27:58.000Z","dependencies_parsed_at":"2024-04-02T17:34:29.572Z","dependency_job_id":"8642048b-fc76-4e7f-a4c9-90089faae8b9","html_url":"https://github.com/SYSTRAN/faster-whisper","commit_stats":{"total_commits":214,"total_committers":44,"mean_commits":4.863636363636363,"dds":0.5981308411214954,"last_synced_commit":"d57c5b40b06e59ec44240d93485a95799548af50"},"previous_names":["guillaumekln/faster-whisper"],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SYSTRAN%2Ffaster-whisper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SYSTRAN%2Ffaster-whisper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SYSTRAN%2Ffaster-whisper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SYSTRAN%2Ffaster-whisper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SYSTRAN","download_url":"https://codeload.github.com/SYSTRAN/faster-whisper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245260738,"owners_count":20586450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","inference","openai","quantization","speech-recognition","speech-to-text","transformer","whisper"],"created_at":"2024-07-31T08:01:25.756Z","updated_at":"2025-03-24T11:30:56.533Z","avatar_url":"https://github.com/SYSTRAN.png","language":"Python","funding_links":[],"categories":["Python","Speech-to-Text (STT)","Repos","Tools \u0026 Frameworks","STT (Speech-to-Text) | 语音转文本","Related Projects","Artificial Intelligence","Speech Processing","Tools\u003ca id=\"tool\"\u003e\u003c/a\u003e","2. Open Foundation Models","1. Local Agents","Subtitle and Localization","AI Caption \u0026 Subtitle Tools","For Developers","Audio Transcription","3. Speech-to-text (STT / ASR)"],"sub_categories":["Open-Source Models \u0026 Libraries","Open-source projects","Open Source STT Models | 开源 STT 模型","Android Launcher","Speech-to-Text","Others\u003ca id=\"paper11\"\u003e\u003c/a\u003e","Audio / Voice Agents","Open-Source Transcription","Model Variants \u0026 Performance Optimizations","Context-Relevant MCP Servers","Open source"],"readme":"[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)\n\n# Faster Whisper transcription with CTranslate2\n\n**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.\n\nThis implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.\n\n## Benchmark\n\n### Whisper\n\nFor reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:\n\n* [openai/whisper](https://github.com/openai/whisper)@[v20240930](https://github.com/openai/whisper/tree/v20240930)\n* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[v1.7.2](https://github.com/ggerganov/whisper.cpp/tree/v1.7.2)\n* [transformers](https://github.com/huggingface/transformers)@[v4.46.3](https://github.com/huggingface/transformers/tree/v4.46.3)\n* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[v1.1.0](https://github.com/SYSTRAN/faster-whisper/tree/v1.1.0)\n\n### Large-v2 model on GPU\n\n| Implementation | Precision | Beam size | Time | VRAM Usage |\n| --- | --- | --- | --- | --- |\n| openai/whisper | fp16 | 5 | 2m23s | 4708MB |\n| whisper.cpp (Flash Attention) | fp16 | 5 | 1m05s | 4127MB |\n| transformers (SDPA)[^1] | fp16 | 5 | 1m52s | 4960MB |\n| faster-whisper | fp16 | 5 | 1m03s | 4525MB |\n| faster-whisper (`batch_size=8`) | fp16 | 5 | 17s | 6090MB |\n| faster-whisper | int8 | 5 | 59s | 2926MB |\n| faster-whisper (`batch_size=8`) | int8 | 5 | 16s | 4500MB |\n\n### distil-whisper-large-v3 model on GPU\n\n| Implementation | Precision | Beam size | Time | YT Commons WER |\n| --- | --- | --- | --- | --- |\n| transformers (SDPA) (`batch_size=16`) | fp16 | 5 | 46m12s | 14.801 |\n| faster-whisper (`batch_size=16`) | fp16 | 5 | 25m50s | 13.527 |\n\n*GPU Benchmarks are Executed with CUDA 12.4 on a NVIDIA RTX 3070 Ti 8GB.*\n[^1]: transformers OOM for any batch size \u003e 1\n\n### Small model on CPU\n\n| Implementation | Precision | Beam size | Time | RAM Usage |\n| --- | --- | --- | --- | --- |\n| openai/whisper | fp32 | 5 | 6m58s | 2335MB |\n| whisper.cpp | fp32 | 5 | 2m05s | 1049MB |\n| whisper.cpp (OpenVINO) | fp32 | 5 | 1m45s | 1642MB |\n| faster-whisper | fp32 | 5 | 2m37s | 2257MB |\n| faster-whisper (`batch_size=8`) | fp32 | 5 | 1m06s | 4230MB |\n| faster-whisper | int8 | 5 | 1m42s | 1477MB |\n| faster-whisper (`batch_size=8`) | int8 | 5 | 51s | 3608MB |\n\n*Executed with 8 threads on an Intel Core i7-12700K.*\n\n\n## Requirements\n\n* Python 3.9 or greater\n\nUnlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.\n\n### GPU\n\nGPU execution requires the following NVIDIA libraries to be installed:\n\n* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)\n* [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)\n\n**Note**: The latest versions of `ctranslate2` only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, the current workaround is downgrading to the `3.24.0` version of `ctranslate2`, for CUDA 12 and cuDNN 8, downgrade to the `4.4.0` version of `ctranslate2`, (This can be done with `pip install --force-reinstall ctranslate2==4.4.0` or specifying the version in a `requirements.txt`).\n\nThere are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below. \n\n\u003cdetails\u003e\n\u003csummary\u003eOther installation methods (click to expand)\u003c/summary\u003e\n\n\n**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.\n\n#### Use Docker\n\nThe libraries (cuBLAS, cuDNN) are installed in this official NVIDIA CUDA Docker images: `nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04`.\n\n#### Install with `pip` (Linux only)\n\nOn Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.\n\n```bash\npip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*\n\nexport LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'`\n```\n\n#### Download the libraries from Purfview's repository (Windows \u0026 Linux)\n\nPurfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows \u0026 Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.\n\n\u003c/details\u003e\n\n## Installation\n\nThe module can be installed from [PyPI](https://pypi.org/project/faster-whisper/):\n\n```bash\npip install faster-whisper\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eOther installation methods (click to expand)\u003c/summary\u003e\n\n### Install the master branch\n\n```bash\npip install --force-reinstall \"faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz\"\n```\n\n### Install a specific commit\n\n```bash\npip install --force-reinstall \"faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz\"\n```\n\n\u003c/details\u003e\n\n## Usage\n\n### Faster-whisper\n\n```python\nfrom faster_whisper import WhisperModel\n\nmodel_size = \"large-v3\"\n\n# Run on GPU with FP16\nmodel = WhisperModel(model_size, device=\"cuda\", compute_type=\"float16\")\n\n# or run on GPU with INT8\n# model = WhisperModel(model_size, device=\"cuda\", compute_type=\"int8_float16\")\n# or run on CPU with INT8\n# model = WhisperModel(model_size, device=\"cpu\", compute_type=\"int8\")\n\nsegments, info = model.transcribe(\"audio.mp3\", beam_size=5)\n\nprint(\"Detected language '%s' with probability %f\" % (info.language, info.language_probability))\n\nfor segment in segments:\n    print(\"[%.2fs -\u003e %.2fs] %s\" % (segment.start, segment.end, segment.text))\n```\n\n**Warning:** `segments` is a *generator* so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a `for` loop:\n\n```python\nsegments, _ = model.transcribe(\"audio.mp3\")\nsegments = list(segments)  # The transcription will actually run here.\n```\n\n### Batched Transcription\nThe following code snippet illustrates how to run batched transcription on an example audio file. `BatchedInferencePipeline.transcribe` is a drop-in replacement for `WhisperModel.transcribe`\n\n```python\nfrom faster_whisper import WhisperModel, BatchedInferencePipeline\n\nmodel = WhisperModel(\"turbo\", device=\"cuda\", compute_type=\"float16\")\nbatched_model = BatchedInferencePipeline(model=model)\nsegments, info = batched_model.transcribe(\"audio.mp3\", batch_size=16)\n\nfor segment in segments:\n    print(\"[%.2fs -\u003e %.2fs] %s\" % (segment.start, segment.end, segment.text))\n```\n\n### Faster Distil-Whisper\n\nThe Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)\ncheckpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet \ndemonstrates how to run inference with distil-large-v3 on a specified audio file:\n\n```python\nfrom faster_whisper import WhisperModel\n\nmodel_size = \"distil-large-v3\"\n\nmodel = WhisperModel(model_size, device=\"cuda\", compute_type=\"float16\")\nsegments, info = model.transcribe(\"audio.mp3\", beam_size=5, language=\"en\", condition_on_previous_text=False)\n\nfor segment in segments:\n    print(\"[%.2fs -\u003e %.2fs] %s\" % (segment.start, segment.end, segment.text))\n```\n\nFor more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).\n\n### Word-level timestamps\n\n```python\nsegments, _ = model.transcribe(\"audio.mp3\", word_timestamps=True)\n\nfor segment in segments:\n    for word in segment.words:\n        print(\"[%.2fs -\u003e %.2fs] %s\" % (word.start, word.end, word.word))\n```\n\n### VAD filter\n\nThe library integrates the [Silero VAD](https://github.com/snakers4/silero-vad) model to filter out parts of the audio without speech:\n\n```python\nsegments, _ = model.transcribe(\"audio.mp3\", vad_filter=True)\n```\n\nThe default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:\n\n```python\nsegments, _ = model.transcribe(\n    \"audio.mp3\",\n    vad_filter=True,\n    vad_parameters=dict(min_silence_duration_ms=500),\n)\n```\nVad filter is enabled by default for batched transcription.\n\n### Logging\n\nThe library logging level can be configured like this:\n\n```python\nimport logging\n\nlogging.basicConfig()\nlogging.getLogger(\"faster_whisper\").setLevel(logging.DEBUG)\n```\n\n### Going further\n\nSee more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.\n\n## Community integrations\n\nHere is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!\n\n\n* [speaches](https://github.com/speaches-ai/speaches) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.\n* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment\n* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.\n* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.\n* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux \u0026 macOS. \n* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.\n* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.\n* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)\n* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.\n* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.\n* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.\n* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.\n* [Open-dubbing](https://github.com/softcatala/open-dubbing) is open dubbing is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages.\n\n## Model conversion\n\nWhen loading a model from its size such as `WhisperModel(\"large-v3\")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).\n\nWe also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.\n\nFor example the command below converts the [original \"large-v3\" Whisper model](https://huggingface.co/openai/whisper-large-v3) and saves the weights in FP16:\n\n```bash\npip install transformers[torch]\u003e=4.23\n\nct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2\n--copy_files tokenizer.json preprocessor_config.json --quantization float16\n```\n\n* The option `--model` accepts a model name on the Hub or a path to a model directory.\n* If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.\n\nModels can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html).\n\n### Load a converted model\n\n1. Directly load the model from a local directory:\n```python\nmodel = faster_whisper.WhisperModel(\"whisper-large-v3-ct2\")\n```\n\n2. [Upload your model to the Hugging Face Hub](https://huggingface.co/docs/transformers/model_sharing#upload-with-the-web-interface) and load it from its name:\n```python\nmodel = faster_whisper.WhisperModel(\"username/whisper-large-v3-ct2\")\n```\n\n## Comparing performance against other implementations\n\nIf you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:\n\n* Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, `model.transcribe` uses a default beam size of 1 but here we use a default beam size of 5.\n* Transcription speed is closely affected by the number of words in the transcript, so ensure that other implementations have a similar WER (Word Error Rate) to this one.\n* When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable `OMP_NUM_THREADS`, which can be set when running your script:\n\n```bash\nOMP_NUM_THREADS=4 python3 my_script.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSYSTRAN%2Ffaster-whisper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSYSTRAN%2Ffaster-whisper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSYSTRAN%2Ffaster-whisper/lists"}