{"id":20981514,"url":"https://github.com/abhirooptalasila/autosub","last_synced_at":"2025-04-04T20:14:31.496Z","repository":{"id":36958383,"uuid":"287200791","full_name":"abhirooptalasila/AutoSub","owner":"abhirooptalasila","description":"A CLI script to generate subtitle files (SRT/VTT/TXT) for any video using either DeepSpeech or Coqui","archived":false,"fork":false,"pushed_at":"2023-12-24T09:57:22.000Z","size":94,"stargazers_count":596,"open_issues_count":10,"forks_count":103,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-03-28T19:08:18.944Z","etag":null,"topics":["asr","autosub","coqui-ai","deepspeech","ffmpeg","mozilla-deepspeech","python","sox","speech-to-text","srt","subtitle","video"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhirooptalasila.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-08-13T06:35:52.000Z","updated_at":"2025-03-26T01:24:46.000Z","dependencies_parsed_at":"2024-04-22T20:15:31.596Z","dependency_job_id":null,"html_url":"https://github.com/abhirooptalasila/AutoSub","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhirooptalasila%2FAutoSub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhirooptalasila%2FAutoSub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhirooptalasila%2FAutoSub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhirooptalasila%2FAutoSub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhirooptalasila","download_url":"https://codeload.github.com/abhirooptalasila/AutoSub/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242680,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","autosub","coqui-ai","deepspeech","ffmpeg","mozilla-deepspeech","python","sox","speech-to-text","srt","subtitle","video"],"created_at":"2024-11-19T05:38:47.765Z","updated_at":"2025-04-04T20:14:31.471Z","avatar_url":"https://github.com/abhirooptalasila.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AutoSub\n\n- [AutoSub](#autosub)\n  - [About](#about)\n  - [Installation](#installation)\n  - [Docker](#docker)\n  - [How-to example](#how-to-example)\n  - [How it works](#how-it-works)\n  - [Motivation](#motivation)\n  - [Contributing](#contributing)\n  - [References](#references)\n\n## About\n\nAutoSub is a CLI application to generate subtitle files (.srt, .vtt, and .txt transcript) for any video file using either [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) or [Coqui STT](https://github.com/coqui-ai/STT). I use their open-source models to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple smaller files (makes inference easy).\n\n⭐ Featured in [DeepSpeech Examples](https://github.com/mozilla/DeepSpeech-examples) by Mozilla\n\n## Installation\n\n* Clone the repo\n    ```bash\n    $ git clone https://github.com/abhirooptalasila/AutoSub\n    $ cd AutoSub\n    ```\n* [OPTIONAL] Create a virtual environment to install the required packages. By default, AutoSub will be installed globally. All further steps should be performed while in the `AutoSub/` directory\n    ```bash\n    $ python3 -m pip install --user virtualenv\n    $ virtualenv -p python3 sub\n    $ source sub/bin/activate\n    ```\n* Use the corresponding requirements file depending on whether you have a GPU or not. If you want to install for a GPU, replace `requirements.txt` with `requirements-gpu.txt`. Make sure you have the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version\n    ```bash\n    $ pip install .\n    ```\n* Install FFMPEG. If you're on Ubuntu, this should work fine\n    ```bash\n    $ sudo apt-get install ffmpeg\n    $ ffmpeg -version               # I'm running 4.1.4\n    ```\n* By default, if no model files are found in the root directory, the script will download v0.9.3 models for DeepSpeech or TFLITE model and Huge Vocab for Coqui. Use `getmodels.sh` to download DeepSpeech model and scorer files with the version number as argument. For Coqui, download from [here](https://coqui.ai/models)\n    ```bash\n    $ ./getmodels.sh 0.9.3\n    ```\n* For .tflite models with DeepSpeech, follow [this](https://github.com/abhirooptalasila/AutoSub/issues/41#issuecomment-968847604)\n\n\n## Docker\n\n* If you don't have the model files, get them\n    ```bash\n    $ ./getmodels.sh 0.9.3\n    ```\n* For a CPU build\n    ```bash\n    $ docker build -t autosub .\n    $ docker run --volume=`pwd`/input:/input --name autosub autosub --file /input/video.mp4\n    $ docker cp autosub:/output/ .\n    ```\n* For a GPU build that is reusable (saving time on instantiating the program)\n    ```bash\n    $ docker build --build-arg BASEIMAGE=nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --build-arg DEPSLIST=requirements-gpu.txt -t autosub-base . \u0026\u0026 \\\n    docker run --gpus all --name autosub-base autosub-base --dry-run || \\\n    docker commit --change 'CMD []' autosub-base autosub-instance\n    ```\n* Finally\n    ```bash\n    $ docker run --volume=`pwd`/input:/input --name autosub autosub-instance --file ~/video.mp4\n    $ docker cp autosub:/output/ .\n    ```\n\n## How-to example\n\n* The model files should be in the repo root directory and will be loaded/downloaded automatically. Incase you have multiple versions, use the `--model` and `--scorer` args while executing\n* By default, Coqui is used for inference. You can change this by using the `--engine` argument with value `\"ds\"` for DeepSpeech\n* For languages other than English, you'll need to manually download the model and scorer files. Check [here](https://discourse.mozilla.org/t/links-to-pretrained-models/62688) for DeepSpeech and [here](https://coqui.ai/models) for Coqui.\n* After following the installation instructions, you can run `autosub/main.py` as given below. The `--file` argument is the video file for which subtitles are to be generated\n    ```bash\n    $ python3 autosub/main.py --file ~/movie.mp4\n    ```\n* After the script finishes, the SRT file is saved in `output/`\n* The optional `--split-duration` argument allows customization of the maximum number of seconds any given subtitle is displayed for. The default is 5 seconds\n    ```bash\n    $ python3 autosub/main.py --file ~/movie.mp4 --split-duration 8\n    ```\n* By default, AutoSub outputs SRT, VTT and TXT files. To only produce the file formats you want, use the `--format` argument\n    ```bash\n    $ python3 autosub/main.py --file ~/movie.mp4 --format srt txt\n    ```\n* Open the video file and add this SRT file as a subtitle. You can just drag and drop in VLC.\n\n\n\n## How it works\n\nMozilla DeepSpeech is an open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate. \n\nThen, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py`. All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`.\n\nWhen I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8GB RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file. \n\n\n## Motivation\n\nIn the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it. \n\n\n## Contributing\n\nI would love to follow up on any suggestions/issues you find :)\n\n\n## References\n1. https://github.com/mozilla/DeepSpeech/\n2. https://github.com/tyiannak/pyAudioAnalysis\n3. https://deepspeech.readthedocs.io/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhirooptalasila%2Fautosub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhirooptalasila%2Fautosub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhirooptalasila%2Fautosub/lists"}