{"id":20472017,"url":"https://github.com/wtlow003/auto-subtitles","last_synced_at":"2025-04-13T11:10:16.857Z","repository":{"id":223459151,"uuid":"760378341","full_name":"wtlow003/auto-subtitles","owner":"wtlow003","description":"CLI tool to transcribe (+ translate) videos and embed subtitles automatically.","archived":false,"fork":false,"pushed_at":"2024-06-12T14:12:13.000Z","size":51413,"stargazers_count":5,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T02:21:12.529Z","etag":null,"topics":["faster-whisper","nllb","subtitles","subtitles-generator","translation","whisper","whisper-cpp"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wtlow003.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-20T10:02:22.000Z","updated_at":"2025-02-04T10:21:19.000Z","dependencies_parsed_at":"2024-06-12T20:00:05.046Z","dependency_job_id":"5672b41f-5e6d-4811-99f1-a0bba7a1f674","html_url":"https://github.com/wtlow003/auto-subtitles","commit_stats":null,"previous_names":["wtlow003/auto-subtitles"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtlow003%2Fauto-subtitles","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtlow003%2Fauto-subtitles/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtlow003%2Fauto-subtitles/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtlow003%2Fauto-subtitles/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wtlow003","download_url":"https://codeload.github.com/wtlow003/auto-subtitles/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248703199,"owners_count":21148118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["faster-whisper","nllb","subtitles","subtitles-generator","translation","whisper","whisper-cpp"],"created_at":"2024-11-15T14:17:57.465Z","updated_at":"2025-04-13T11:10:16.837Z","avatar_url":"https://github.com/wtlow003.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e📺 Auto-Subtitles\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://img.shields.io/github/license/wtlow003/auto-subtitles\" alt=\"license\"\u003e\n      \u003cimg src=\"https://img.shields.io/github/last-commit/wtlow003/auto-subtitles\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/python-3.9.10-orange? alt=\"python version\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=#about\u003eAbout\u003c/a\u003e •\n    \u003ca href=#features\u003eFeatures\u003c/a\u003e •\n    \u003ca href=#installation\u003eInstallation\u003c/a\u003e •\n    \u003ca href=#usage\u003eUsage\u003c/a\u003e\n\u003c/p\u003e\n\n![banner](/assets/banner-dall-e.jpeg)\n\n## About\n\nThe **Auto-Subtitles** is a CLI tool that generates and embeds subtitles for any YouTube video automatically. Other core functionality includes the ability to generate translated transcripts prior to the output process.\n\n### Why Should You Use It?\n\nPrior to the advancement of automatic speech recognition (ASR), transcription process is often seen as a tedious manual task that requires meticulousness in understanding the given audio.\n\nI studied and interned in the film and media industry prior to working as a Machine Learning/Platform Engineer. I was involved in several production that involves manually generating transcriptions and overlay subtitles via video editing software for various advertisements and commercials.\n\nWith OpenAI's [Whisper](https://github.com/openai/whisper) models garnering favourable interests from developers due to the ease of local processing and [high](https://www.speechly.com/blog/analyzing-open-ais-whisper-asr-models-word-error-rates-across-languages) accuracy in languages such as english, it soon became a viable drop-in (free) replacement for professional (paid) transcription services.\n\nWhile far from perfect – **Auto-Subtitles** still provides automatically generated transcriptions from your local setup with ease of setting up and using from the get-go. The CLI tool can be a initial starting phase in the subtitling process by generating a first-draft of transcriptions that can be vetted and edited by the human before using the edited subtitles for the eventual output. This can reduce the time-intensive process of audio scrubbing and typing every single word from scratch.\n\n## Features\n\n### Supported Models\n\nCurrently, the auto-subtitles workflow supports the following variant(s) of the Whisper model:\n\n1. [@ggerganov/whisper.cpp](https://github.com/ggerganov/whisper.cpp):\n   - Provides the `whisper-cpp` backend for the workflow.\n   - Port of OpenAI's Whisper model in C/C++. Generate fast transcription on local setup (esp. MacOS via `MPS`).\n2. [@jianfch/stable-ts](https://github.com/jianfch/stable-ts):\n   - Provides the [`faster-whisper`](https://github.com/SYSTRAN/faster-whisper) backend for the workflow, while producing more reliable and accurate timestamps for transcription.\n   - Functionalities also includes VAD filters to more accurately detect voice activities.\n3. [@Vaibhavs10/insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) [`Experimental`]:\n   - Leverages Flash Attention 2 (or Scaled Dot Product Attention) and batching to improve transcription speed.\n   - Works for only for gpu setup (`cuda` or `mps`) at the moment.\n   - Supports only `large`, `large-v2`, and `large-v3` models.\n   - No default support for max segment length – currently using self-implemented heuristics for segment length adjustment.\n\n\n### Translation\n\nIn **Auto-Subtitles**, we also included the functionality to translate transcripts, e.g., `english (en)` to `chinese (zh)`, prior to embedding subtitles on the output video.\n\nWe did not opt to use the translation features directly via the Whisper model due to observed performance issue and hallucination in the generated transcript.\n\nTo support a more efficient and reliable translation process, we used Meta AI's group of models - [No Language Left Behind (NLLB)](https://ai.meta.com/research/no-language-left-behind/) for translation post-transcription.\n\nCurrently, the following models are supported:\n\n1. [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)\n2. [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)\n3. [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)\n4. [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)\n\nBy default, the `facebook/nllb-200-distilled-600M` model is used.\n\n## Installation\n\nFor this project, you can setup the requirements/dependencies and environment either locally or in a containerised environment with Docker.\n\n### Local Setup\n\n#### Pre-requisites\n\n1. [ffmpeg](https://ffmpeg.org/download.html#build-mac)\n\n   \u003e Alternatively, referenced from [@openai/whisper](https://github.com/openai/whisper):\n\n   ```shell\n   # on Ubuntu or Debian\n   sudo apt update \u0026\u0026 sudo apt install ffmpeg\n\n   # on Arch Linux\n   sudo pacman -S ffmpeg\n\n   # on MacOS using Homebrew (https://brew.sh/)\n   brew install ffmpeg\n\n   # on Windows using Chocolatey (https://chocolatey.org/)\n   choco install ffmpeg\n\n   # on Windows using Scoop (https://scoop.sh/)\n   scoop install ffmpeg\n   ```\n\n2. [Python 3.9](https://www.python.org/downloads/)\n3. [whisper.cpp](https://www.bing.com/search?q=whisper.cpp\u0026cvid=c6357be7905a4543b299efb7b63bda65\u0026gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARBFGDsyBggCEEUYOTIGCAMQRRg8MgYIBBBFGDwyBggFEEUYPDIGCAYQRRhA0gEIMTE0OGowajSoAgCwAgA\u0026FORM=ANAB01\u0026PC=U531)\n\n   ```shell\n   # build the binary for usage\n   git clone https://github.com/ggerganov/whisper.cpp.git\n\n   cd whisper.cpp\n   make\n   ```\n\n   - Please refer to the actual [repo](https://github.com/ggerganov/whisper.cpp.git) for all other build arguments relevant to your local setup for better performance.\n\n#### Python Dependencies\n\nInstall the dependencies in `requirements.txt` into a virtual environment (`virtualenv`):\n\n```shell\npython -m venv .venv\n\n# mac-os\nsource .venv/bin/activate\n\n# install dependencies\npip install --upgrade pip setuptools wheel\npip install -r requirements.txt\n```\n\n### Docker Setup\n\nTo run the workflow using docker:\n\n```bash\n# build the image\ndocker buildx build -t auto-subs .\n```\n\n## Usage\n\n### Transcribing\n\nTo run the automatic subtitling process for the following [video](https://www.youtube.com/watch?v=fnvZJU5Fj3Q), simply run the following command (refer [here](#detailed-options) for advanced options):\n\n#### Local\n\n```shell\nchmod +x ./workflow.sh\n\n./workflow.sh -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \\\n    -b faster-whisper \\\n    -t 8 \\\n    -m medium \\\n    -ml 47\n```\n\n#### Docker\n\n```bash\n# run the image\ndocker run \\\n   --volume \u003cabsolute-path\u003e:/app/output\n   auto-subs \\\n   -u https://www.youtube.com/watch?v=fnvZJU5Fj3Q \\\n   -b faster-whisper \\\n   -t 8 \\\n   -ml 47\n```\n\nThe above command generate the workflow with the following settings:\n\n1. Using the `faster-whisper` backend\n   - More reliable and accurate timestamps as opposed to `whisper.cpp`, using `VAD` etc.\n2. Running on `8` threads for increased performance\n3. Using the [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) multi-lingual model\n4. Limit the maximum length of each transcription segment to max [`47`](https://www.capitalcaptions.com/services/subtitle-services-2/capital-captions-standard-subtitling-guidelines/) characters.\n\nThe following is the generated video:\n\u003cvideo src=\"https://github.com/wtlow003/auto-subtitles/assets/61908161/52f3cc5d-d130-4b3b-87ba-acea39a25349\"\u003e\u003c/video\u003e\n\n### Transcribing + Translating\n\nTo run the automatic subtitling process for the following [video](https://www.youtube.com/watch?v=DtLJjNyl57M) and generate `Chinese (zh)` subtitles:\n\n#### Local\n\n```shell\nchmod +x ./workflow.sh\n\n./workflow.sh -u https://www.youtube.com/watch?v=DtLJjNyl57M \\\n    -b whisper-cpp \\\n    -wbp ~/code/whisper.cpp \\\n    -t 8 \\\n    -m medium \\\n    -ml 47 \\\n    -tf \"eng_Latn\" \\\n    -tt \"zho_Hans\"\n```\n\n#### Docker\n\n```bash\n# run the image\ndocker run \\\n   --volume \u003cabsolute-path\u003e:/app/output\n   auto-subs \\\n   -u https://www.youtube.com/watch?v=DtLJjNyl57M \\\n   -b whisper-cpp \\\n   -t 8 \\\n   -ml 47 \\\n   -tf \"eng_Latn\" \\\n   -tt \"zho_Hans\"\n```\n\nThe above command generate the workflow with the following settings:\n\n1. Using the `whisper-cpp` backend\n   - Faster transcription process compared to `faster-whisper`.\n   - However, may produce degraded output video with inaccurate timestamps or subtitles appearing early with no noticeable voice activity.\n2. Specifying directory path to the pre-built binary of `whisper.cpp` to be used for transcription.\n3. Running on `8` threads for increased performance\n4. Using the [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) multi-lingual model\n5. Limit the maximum length of each transcription segment to max [`47`](https://www.capitalcaptions.com/services/subtitle-services-2/capital-captions-standard-subtitling-guidelines/) characters.\n6. Translating from (`-tf`) **English (eng_Latn)** to (`-tt`) **Chinese (zho_Hans)**, using the `FLORES-200` Code found [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200).\n\nThe following is the generated video:\n\u003cvideo src=\"https://github.com/wtlow003/auto-subtitles/assets/61908161/5787ab5a-da3f-40de-ab11-f898caa1ac2a\"\u003e\u003c/video\u003e\n\n### Detailed Options\n\nTo check all the available options, use the `--help` flag:\n\n```shell\n./workflow.sh --help\n\nUsage: ./workflow.sh [-u \u003cyoutube_video_url\u003e] [options]\nOptions:\n  -u, --url \u003cyoutube_video_url\u003e                       YouTube video URL\n  -o, --output-path \u003coutput_path\u003e                     Output path\n  -b, --backend \u003cbackend\u003e                             Backend to use: whisper-cpp or faster-whisper\n  -wbp, --whisper-bin-path \u003cwhisper_bin_path\u003e         Path to whisper-cpp binary. Required if using [--backend whisper-cpp].\n  -ml, --max-length \u003cmax_length\u003e                      Maximum length of the generated transcript\n  -t, --threads \u003cthreads\u003e                             Number of threads to use\n  -w, --workers \u003cworkers\u003e                             Number of workers to use\n  -m, --model \u003cmodel\u003e                                 Model name to use\n  -tf, --translate-from \u003ctranslate_from\u003e              Translate from language\n  -tt, --translate-to \u003ctranslate_to\u003e                  Translate to language\n  -f, --font \u003cfont\u003e                                   Font to use for subtitles\n```\n\n## [WIP] Performance\n\n\u003e For `mps` device, I am running performance testing on a M2 Max 12/30 (cpu/gpu) cores MacBook Pro (14-inch, 2023).\n\n### Transcription\n\n| Model  | Backend        | Device | Threads | Time Taken |\n| ------ | -------------- | ------ | ------- | ---------- |\n| base   | whisper-cpp    | cpu    | 4       | ~          |\n| base   | whisper-cpp    | mps    | 4       | ~          |\n| base   | faster-whisper | cpu    | 4       | ~          |\n| base   | faster-whisper | mps    | 4       | ~          |\n| medium | whisper-cpp    | cpu    | 4       | ~          |\n| medium | whisper-cpp    | mps    | 4       | ~          |\n| medium | faster-whisper | cpu    | 4       | ~          |\n| medium | faster-whisper | mps    | 4       | ~          |\n\n### Transcription + Translation\n\n| Model  | Backend        | Device | Threads | Time Taken |\n| ------ | -------------- | ------ | ------- | ---------- |\n| base   | whisper-cpp    | cpu    | 4       | ~          |\n| base   | whisper-cpp    | mps    | 4       | ~          |\n| base   | faster-whisper | cpu    | 4       | ~          |\n| base   | faster-whisper | mps    | 4       | ~          |\n| medium | whisper-cpp    | cpu    | 4       | ~          |\n| medium | whisper-cpp    | mps    | 4       | ~          |\n| medium | faster-whisper | cpu    | 4       | ~          |\n| medium | faster-whisper | mps    | 4       | ~          |\n\n## Known Issues\n\n1. Korean subtitles are not supported at the moment.\n   - **Details**: The default font used to embed subtitles is `Arial Unicode MS`, which does not provide glpyh for Korean characters.\n   - **Potential Solution**: Add alternate fonts for Korean characters\n   - **Status**: ✅ `Done`\n\n## Changelog\n\n1. 🗓️ **[24/02/2024]**: Include `./fonts` folder to host downloaded fonts to be copied into the Docker container. Once copied, users can specified their desired fonts with the `-f` or `--font` flag.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwtlow003%2Fauto-subtitles","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwtlow003%2Fauto-subtitles","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwtlow003%2Fauto-subtitles/lists"}