{"id":13571897,"url":"https://github.com/pszemraj/vid2cleantxt","last_synced_at":"2025-05-16T08:04:12.886Z","repository":{"id":43674300,"uuid":"346152074","full_name":"pszemraj/vid2cleantxt","owner":"pszemraj","description":"Python API \u0026 command-line tool to easily transcribe speech-based video files into clean text","archived":false,"fork":false,"pushed_at":"2024-10-29T19:07:08.000Z","size":758180,"stargazers_count":212,"open_issues_count":1,"forks_count":29,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-05-09T22:02:46.299Z","etag":null,"topics":["audio","audio-processing","keyword","keyword-extraction","nlp","python","sentence","sentence-boundary-detection","speech","speech-recognition","speech-to-text","spelling-correction","transcription","transformer","video","video-processing","video-summarisation","video-summarization","wav2vec2","whisper"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pszemraj.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-09T21:41:27.000Z","updated_at":"2025-05-04T12:09:40.000Z","dependencies_parsed_at":"2024-11-14T05:02:20.959Z","dependency_job_id":"7c4b3055-d5b1-4c66-8f95-1ea7e419d2a2","html_url":"https://github.com/pszemraj/vid2cleantxt","commit_stats":{"total_commits":200,"total_committers":5,"mean_commits":40.0,"dds":0.495,"last_synced_commit":"2034489f538e22d62e508add517c45cc6093d85f"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Fvid2cleantxt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Fvid2cleantxt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Fvid2cleantxt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Fvid2cleantxt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pszemraj","download_url":"https://codeload.github.com/pszemraj/vid2cleantxt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254493379,"owners_count":22080126,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","audio-processing","keyword","keyword-extraction","nlp","python","sentence","sentence-boundary-detection","speech","speech-recognition","speech-to-text","spelling-correction","transcription","transformer","video","video-processing","video-summarisation","video-summarization","wav2vec2","whisper"],"created_at":"2024-08-01T14:01:08.004Z","updated_at":"2025-05-16T08:04:07.871Z","avatar_url":"https://github.com/pszemraj.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# vid2cleantxt\n\n![vid2cleantext simple](https://user-images.githubusercontent.com/74869040/131500291-ed0a9d7f-8be7-4f4b-9acf-c360cfd46f1f.png)\n\n[Jump to Quickstart](#quickstart)\n\n**vid2cleantxt**: a [transformers-based](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) pipeline for turning heavily speech-based video files into clean, readable text from the audio. Robust speech transcription is now possible like never before with [OpenAI's whisper model](https://openai.com/blog/whisper/).\n\nTL;DR check out [this Colab notebook](https://colab.research.google.com/gist/pszemraj/9678129fe0b552e114e3576606446dee/vid2cleantxt-minimal-example.ipynb) for a transcription and keyword extraction of a speech by John F. Kennedy by simply running all cells.\n\n* * *\n\n**Table of Contents**\n\n\u003c!-- TOC --\u003e\n\n-   [Motivation](#motivation)\n-   [Overview](#overview)\n    -   [Example Output](#example-output)\n    -   [Pipeline Intro](#pipeline-intro)\n-   [Quickstart](#quickstart)\n    -   [Installation](#installation)\n        -   [As a Python package](#as-a-python-package)\n        -   [Install from source](#install-from-source)\n        -   [install details \u0026 gotchas](#install-details--gotchas)\n    -   [example usage](#example-usage)\n-   [Notebooks on Colab](#notebooks-on-colab)\n-   [Details \u0026 Application](#details--application)\n    -   [How long does this take to run?](#how-long-does-this-take-to-run)\n    -   [Now I have a bunch of long text files. How are these useful?](#now-i-have-a-bunch-of-long-text-files-how-are-these-useful)\n        -   [Visualization and Analysis](#visualization-and-analysis)\n        -   [Text Extraction / Manipulation](#text-extraction--manipulation)\n    -   [Text Summarization](#text-summarization)\n        -   [TextHero example use case](#texthero-example-use-case)\n-   [ScatterText example use case](#scattertext-example-use-case)\n-   [Design Choices \u0026 Troubleshooting](#design-choices--troubleshooting)\n    -   [What python package dependencies does this repo have?](#what-python-package-dependencies-does-this-repo-have)\n    -   [My computer crashes once it starts running the wav2vec2 model](#my-computer-crashes-once-it-starts-running-the-wav2vec2-model)\n    -   [The transcription is not perfect, and therefore I am mad](#the-transcription-is-not-perfect-and-therefore-i-am-mad)\n    -   [How can I improve the performance of the model from a word-error-rate perspective?](#how-can-i-improve-the-performance-of-the-model-from-a-word-error-rate-perspective)\n    -   [Why use transformer models instead of SpeechRecognition or other transcription methods?](#why-use-transformer-models-instead-of-speechrecognition-or-other-transcription-methods)\n    -   [Errors](#errors)\n-   [Examples](#examples)\n-   [Future Work, Collaboration, \u0026 Citations](#future-work-collaboration--citations)\n    -   [Project Updates](#project-updates)\n    -   [Future Work](#future-work)\n    -   [I've found x repo / script / concept that I think you should incorporate or collaborate with the author](#ive-found-x-repo--script--concept-that-i-think-you-should-incorporate-or-collaborate-with-the-author)\n    -   [Citations](#citations)\n        -   [Video Citations](#video-citations)\n\n\u003c!-- /TOC --\u003e\n\n* * *\n\n## Motivation\n\nVideo, specifically audio, is inefficient in conveying dense or technical information. The viewer has to sit through the whole thing, while only part of the video may be relevant to them. If you don't understand a statement or concept, you must search through the video or re-watch it. This project attempts to help solve that problem by converting long video files into text that can be easily searched and summarized.\n\n## Overview\n\n### Example Output\n\nExample output text of a video transcription of [JFK's speech on going to the moon](https://www.c-span.org/classroom/document/?7986):\n\n\u003chttps://user-images.githubusercontent.com/74869040/151491511-7486c34b-d1ed-4619-9902-914996e85125.mp4\u003e\n\n**vid2cleantxt output:**\n\n\u003e Now look into space to the moon and to the planets beyond and we have vowed that we shall not see it governed by a hostile flag of conquest but by a banner of freedom and peace we have vowed that we shall not see space filled with weapons of mass destruction but with instruments of knowledge and understanding yet the vow. In short our leadership in science and industry our hopes for peace and security our obligations to ourselves as well as others all require a. To solve these mysteries to solve them for the good of all men and to become the worlds leading space faring nation we set sail on this new sea because there is new knowledge to be gained and new rights to be won and they must be won and used for the progress of all people for space science like nuclear science and all technology. Has no conscience of its own whether it will become a force for good or ill depends on man and only if the united states occupies a position of preeminence can we help decide whether this new ocean will be a sea of peace\n\nmodel = `openai/whisper-medium.en`\n-\n\nSee the [demo notebook](https://colab.research.google.com/gist/pszemraj/9678129fe0b552e114e3576606446dee/vid2cleantxt-minimal-example.ipynb) for the full-text output.\n\n### Pipeline Intro\n\n![vid2cleantxt detailed](https://user-images.githubusercontent.com/74869040/131499569-c894c096-b6b8-4d17-b99c-a4cfce395ea8.png)\n\n1.  The `transcribe.py` script uses `audio2text_functions.py` to convert video files to `.wav` format audio chunks of duration X\\* seconds\n2.  transcribe all X audio chunks through a pretrained transformer model\n3.  Write all list results into a text file, store various runtime metrics into a separate text list, and delete the `.wav` audio chunk directory after using them.\n4.  (Optional) create two new text files: one with all transcriptions appended and one with all metadata appended.\n5.  FOR each transcription text file:\n    -   Passes the 'base' transcription text through a spell checker (_Neuspell_) and auto-correct spelling. Saves as a new text file.\n    -   Uses _pySBD_ to infer sentence boundaries on the spell-corrected text and add periods to delineate sentences. Saves as a new file.\n    -   Runs essential keyword extraction (via _YAKE_) on spell-corrected file. All keywords per file are stored in one data frame for comparison and exported to the `.xlsx` format\n\n_\\*\\* (where X is some duration that does not overload your computer/runtime)_\n\nGiven `INPUT_DIRECTORY`:\n\n-   _final_ transcriptions in`.txt` will be in `INPUT_DIRECTORY/v2clntxt_transcriptions/results_SC_pipeline/`\n-   metadata about transcription process will be in `INPUT_DIRECTORY/v2clntxt_transc_metadata`\n\n* * *\n\n## Quickstart\n\nInstall, then you can use `vid2cleantxt` in two ways:\n\n1.  CLI via `transcribe.py` script from the command line (`python vid2cleantxt/transcribe.py --input-dir \"path/to/video/files\" --output-dir \"path/to/output/dir\"\\`)\n2.  As a python package, import `vid2cleantxt` and use the `transcribe` module to transcribe videos (`vid2cleantxt.transcribe.transcribe_dir()`)\n\nDon't want to use it locally or don't have a GPU? you may be interested in the [demo notebook](https://colab.research.google.com/gist/pszemraj/9678129fe0b552e114e3576606446dee/vid2cleantxt-minimal-example.ipynb) on Google Colab.\n\n### Installation\n\n#### As a Python package\n\n-   (recommended) Create a new virtual environment with `python3 -m venv venv`\n    -   Activate the virtual environment with `source venv/bin/activate`\n-   Install the repo with pip:\n\n```bash\npip install git+https://github.com/pszemraj/vid2cleantxt.git\n```\n\nThe library is now installed and ready to use in your Python scripts.\n\n```python\nimport vid2cleantxt\n\ntext_output_dir, metadata_output_dir = vid2cleantxt.transcribe.transcribe_dir(\n    input_dir=\"path/to/video/files\",\n    model_id=\"openai/whisper-base.en\",\n    chunk_length=30,\n)\n\n# do things with text files in text_output_dir\n```\n\nSee below for more details on the `transcribe_dir` function.\n\n#### Install from source\n\n1.  `git clone https://github.com/pszemraj/vid2cleantxt.git`\n    -   use the `--depth=1` switch to clone only the latest master (_faster_)\n2.  `cd vid2cleantxt/`\n3.  `pip install -e .`\n\nAs a shell block:\n\n```bash\ngit clone https://github.com/pszemraj/vid2cleantxt.git --depth=1\ncd vid2cleantxt/\npip install -e .\n```\n\n#### install details \u0026 gotchas\n\n-   This should be automatically completed upon installation/import, but a spacy model may need to be downloaded for post-processing transcribed audio. This can be completed with `spacy download en_core_web_sm`\n-   `FFMPEG` is required as a base system dependency to do anything with video/audio. This should be already installed on your system; otherwise see [the FFmpeg site](https://ffmpeg.org/).\n-   We've added an implementation for whisper to the repo. Until further tests are completed, it's recommended to stick with the default 30s chunk length for these models. (_plus, they are fairly compute-efficient for the resulting quality_)\n\n### example usage\n\n**CLI example:** transcribe a directory of example videos in `./examples/` with the `whisper-small` model (not trained purely english) and print the transcriptions with the `cat` command:\n\n```bash\npython examples/TEST_folder_edition/dl_src_videos.py\npython vid2cleantxt/transcribe.py -i ./examples/TEST_folder_edition/ -m openai/whisper-small\nfind ./examples/TEST_folder_edition/v2clntxt_transcriptions/results_SC_pipeline -name \"*.txt\" -exec cat {} +\n```\n\nRun `python vid2cleantxt/transcribe.py --help` for more details on the CLI.\n\n**Python API example:** transcribe an input directory of user-specified videos using `whisper-tiny.en`, a smaller but faster model than the default.\n\n```python\nimport vid2cleantxt\n\n_my_input_dir = \"path/to/video/files\"\ntext_output_dir, metadata_output_dir = vid2cleantxt.transcribe.transcribe_dir(\n    input_dir=_my_input_dir,\n    model_id=\"openai/whisper-tiny.en\",\n    chunk_length=30,\n)\n```\n\nTranscribed files can then be interacted with for whatever purpose (see [Visualization and Analysis](#visualization-and-analysis) and below for ideas).\n\n```python\nfrom pathlib import Path\n\nv2ct_output_dir = Path(text_output_dir)\ntranscriptions = [f for f in v2ct_output_dir.iterdir() if f.suffix == \".txt\"]\n\n# read in the first transcription\nwith open(transcriptions[0], \"r\") as f:\n    first_transcription = f.read()\nprint(\n    f\"The first 1000 characters of the first transcription are:\\n{first_transcription[:1000]}\"\n)\n```\n\nSee the docstrings of `transcribe_dir()` for more details on the arguments. One way you can do this is with `inspect`:\n\n```python\nimport inspect\nimport vid2cleantxt\n\nprint(inspect.getdoc(vid2cleantxt.transcribe.transcribe_dir))\n```\n\n## Notebooks on Colab\n\nNotebook versions are available on Google Colab as they offer accessible GPUs which makes vid2cleantxt _much_ faster.\n\nAs `vid2cleantxt` is now available as a package with python API, there is no longer a need for long, complicated notebooks. See [this notebook](https://colab.research.google.com/gist/pszemraj/9678129fe0b552e114e3576606446dee/vid2cleantxt-minimal-example.ipynb) for a relatively simple example - copy it to your drive and adjust as needed.\n\n⚠️ The notebooks in `./colab_notebooks` are now deprecated and **not recommended to be used**. ⚠️ TODO: remove in a future PR.\n\n**Resources for those new to Colab**\n\nIf you like the benefits Colab/cloud notebooks offer but haven't used them before, it's recommended to read the [Colab Quickstart](https://colab.research.google.com/notebooks/intro.ipynb), and some of the below resources as things like file I/O are different than your PC.\n\n-   [Google's FAQ](https://research.google.com/colaboratory/faq.html)\n-   [Google's Demo Notebook on I/O](https://colab.research.google.com/notebooks/io.ipynb)\n-   [A better Colab Experience](https://towardsdatascience.com/10-tips-for-a-better-google-colab-experience-33f8fe721b82)\n\n* * *\n\n## Details \u0026 Application\n\n### How long does this take to run?\n\nOn Google Colab with a 16 GB GPU (available to free Colab accounts): **approximately 8 minutes to transcribe ~90 minutes of audio**. CUDA is supported - if you have an NVIDIA graphics card, you may see runtimes closer to that estimate on your local machine.\n\nOn my machine (CPU only due to Windows + AMD GPU), it takes approximately 30-70% of the total duration of input video files. You can also look at the \"console printout\" text files in `example_JFK_speech/TEST_singlefile`.\n\n-   with model = `facebook/wav2vec2-base-960h` approx 30% of original video RT\n-   with model = `facebook/hubert-xlarge-ls960-ft` (\\_perhaps the best pre-whisper model anecdotally) approx 70-80% of original video RT\n-   timing the whisper models is a TODO, but current estimate would be in between the above two models for `openai/whisper-base.en` on CPU.\n\n**Specs:**\n\n```text\nProcessor Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz\nSpeed 4.8 GHz\nNumber of Cores 8\nMemory RAM 32 GB\nVideo Card #1 Intel(R) UHD Graphics 620\nDedicated Memory 128 MB\nTotal Memory 16 GB\nVideo Card #2 AMD Radeon Pro WX3200 Graphics\nDedicated Memory 4.0 GB\nTotal Memory 20 GB\nOperating System  Windows 10 64-bit\n```\n\n\u003e _NOTE:_ that the default model is `openai/whisper-base.en`. See the [model card](https://huggingface.co/openai/whisper-base.en) for details.\n\n### Now I have a bunch of long text files. How are these useful?\n\nshort answer: `noam_chomsky.jpeg`\n\nMore comprehensive answer:\n\nWith natural language processing and machine learning algorithms, text data can be visualized, summarized, or reduced in many ways. For example, you can use TextHero or ScatterText to compare audio transcriptions with written documents or use topic models or statistical models to extract key topics from each file. Comparing text data can help you understand how similar they are or identify vital differences.\n\n#### Visualization and Analysis\n\n1.  [TextHero](https://github.com/jbesomi/texthero) - cleans text, allows for visualization / clustering (k-means) / dimensionality reduction (PCA, TSNE)\n    -   Use case here: I want to see how _this speaker_'s speeches differ from each other. Which are \"the most related\"?\n2.  [Scattertext](https://github.com/JasonKessler/scattertext) - allows for comparisons of one corpus of text to another via various methods and visualizes them.\n    -   Use case here: I want to see how the speeches by _this speaker_ compare to speeches by _speaker B_ in terms of topics, word frequency… so on\n\nSome examples from my usage are illustrated below from both packages.\n\n#### Text Extraction / Manipulation\n\n1.  [Textract](https://textract.readthedocs.io/)\n2.  [Textacy](https://github.com/chartbeat-labs/textacy)\n3.  [YAKE](https://github.com/LIAAD/yake)\n    -   A brief YAKE analysis is completed in this pipeline after transcribing the audio.\n\n### Text Summarization\n\nSeveral options are available on the [HuggingFace website](https://huggingface.co/models?pipeline_tag=summarization). To create a better, more general model for summarization, I have fine-tuned [this model](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum) on a [book summary dataset](https://arxiv.org/abs/2105.08209) which I find provides the best results for \"lecture-esque\" video conversion. I wrote a little about this and compared it to other models _WARNING: satire/sarcasm inside_ [here](https://www.dropbox.com/s/fsz9u4yk3hf9fak/A%20new%20benchmark%20for%20the%20generalizability%20of%20summarization%20models.pdf?dl=0).\n\nI use several similar methods in combination with the transcription script. However, it isn't in a place to be officially posted yet. It will be posted to a public repo on this account when ready. You can now check out [this Colab notebook](https://colab.research.google.com/drive/1BSIsYHH0w5pdVxqo_nK5vHgMeBiJKKGm?usp=sharing) using the same example text that is output when the JFK speeches are transcribed.\n\n#### TextHero example use case\n\nClustering vectorized text files into k-means groups:\n\n![iml Plotting with TSNE + USE, Colored on Directory Name](https://user-images.githubusercontent.com/74869040/110546335-a0baaf80-812e-11eb-8d7d-48da00989dce.png)\n\n![iml Plotting with TSNE + USE, Colored on K-Means Cluster](https://user-images.githubusercontent.com/74869040/110546452-c6e04f80-812e-11eb-9a4b-03213ec4a63b.png)\n\n## ScatterText example use case\n\nComparing the frequency of terms in one body of text vs. another\n\n![ST P 1 term frequency I ML 2021 Docs I ML Prior Exams_072122_](https://user-images.githubusercontent.com/74869040/110546149-69e49980-812e-11eb-9c94-81fcb395b907.png)\n\n* * *\n\n## Design Choices \u0026 Troubleshooting\n\n### What python package dependencies does this repo have?\n\nUpon cloning the repo, run the command `pip install -e .` (or`pip install -r requirements.txt` works too) in a terminal opened in the project directory. Requirements (upd. Oct 10, 2022) are:\n\n```text\nclean-text\nGPUtil\nhumanize\njoblib\nlibrosa\nmoviepy~=1.0.3\nnatsort\u003e=7.1.1\nneuspell\u003e=1.0.0\nnumpy\npackaging\npandas\u003e=1.3.0\npsutil\u003e=5.9.2\npydub\u003e=0.24.1\npysbd\u003e=0.3.4\nrequests\nsetuptools\u003e=58.1.0\nspacy\u003e=3.0.0,\u003c4.0.0\nsymspellpy~=6.7.0\ntorch\u003e=1.8.2\ntqdm\ntransformers\u003e=4.23.0\nwordninja==2.0.0\nwrapt\nyake\u003e=0.4.8\n```\n\nIf you encounter warnings/errors that mention FFmpeg, please download the latest version of FFMPEG from their website [here](https://www.ffmpeg.org/download.html) and ensure it is added to PATH.\n\n### My computer crashes once it starts running the wav2vec2 model\n\nFirst, try a smaller model: pass `-m openai/whisper-tiny.en` in CLI or `model_id=\"openai/whisper-tiny.en\"` in python.\n\nIf that doesn't help, reducing the `chunk_length` duration can reduce computational intensity but is less accurate use `--chunk-len \u003cINT\u003e` when calling `vid2cleantxt/transcribe.py` or `chunk_length=INT` in python.\n\n### The transcription is not perfect, and therefore I am mad\n\nPerfect transcripts are not always possible, especially when the audio is not clean. For example, audio recorded with a microphone that is not always perfectly tuned to the speaker can cause the model to have issues. Additionally, the default models are not trained on specific speakers, and therefore the model will not be able to recognize the speaker / their accent.\n\nDespite the small number of errors, the model can still recognize the speaker and their accent and capture a vast majority of the text. This should still save you a lot of time and effort.\n\n### How can I improve the performance of the model from a word-error-rate perspective?\n\n\u003e As of Oct 2022: there's really shouldn't be much to complain about given what we had before whisper. That said, there may be some butgs or issues with the new model. Please report them in the issues section :)\n\nThe neural ASR model that transcribes the audio is typically the most crucial element to choose/tune. You can use **any whisper, wav2vec2, or wavLM model** from the [huggingface hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition\u0026sort=downloads); pass the model ID string with `--model` in CLI and `model_id=\"my-cool-model\"` in python.\n\n. _Note: It's recommended to experiment with the different variants of whisper first, as thhey are the most performant for the vast majority of \"long speech\" transcription use cases._\n\nYou can also train your own model, but that requires you to have a transcription of that person's speech. As you may find, manual transcription is a bit of a pain; therefore, transcripts are rarely provided - hence this repo. If interested see [this notebook](https://github.com/huggingface/notebooks/blob/master/examples/speech_recognition.ipynb)\n\n### Why use transformer models instead of SpeechRecognition or other transcription methods?\n\nGoogle's SpeechRecognition (with the free API) requires optimization of three unknown parameters\\*, which in my experience, can vary widely among English as a second language speakers. With wav2vec2, the base model is pretrained, so a 'decent transcription' can be made without spending a lot of time testing and optimizing parameters.\n\nAlso, because it's an API, you can't train it even if you wanted to, you have to be online for most of the script runtime functionally, and then, of course you have privacy concerns with sending data out of your machine.\n\n_`*` these statements reflect the assessment completed around project inception in early 2021._\n\n### Errors\n\n-   \\_pickle.UnpicklingError: invalid load key, '\u0026lt;' --\u003e Neuspell model was not downloaded correctly. Try re-downloading it.\n-   manually open /Users/yourusername/.local/share/virtualenvs/vid2cleantxt-vMRD7uCV/lib/python3.8/site-packages/neuspell/../data\n-   download the model from \u003chttps://github.com/neuspell/neuspell#Download-Checkpoints\u003e\n-   import neuspell\n-   neuspell.seq_modeling.downloads.download_pretrained_model(\"scrnnelmo-probwordnoise\")\n\n## Examples\n\n-   two examples are available in the `examples/` directory. One example is a single video (another speech), and the other is multiple videos (MIT OpenCourseWare). Citations are in the respective folders.\n-   Note that the videos first need to be downloaded video the respective scripts in each folder first, i.e., run: `python examples/TEST_singlefile/dl_src_video.py`\n\n## Future Work, Collaboration, \u0026 Citations\n\n### Project Updates\n\nA _rough_ timeline of what has been going on in the repo:\n\n-   Oct 2022 Part 2 - Initial integration of [whisper](https://openai.com/blog/whisper/) model!\n-   Oct 2022 - Redesign as Python package instead of an assortment of python scripts/notebooks that share a repository and do similar things.\n-   Feb 2022 - Add backup functions for spell correction in case of NeuSpell failure (which, is a known issue at the time of writing).\n-   Jan 2022 - add huBERT support, abstract the boilerplate out of Colab Notebooks. Starting work on the PDF generation w/ results.\n-   Dec 2021 - greatly improved script runtime, and added more features (command line, docstring, etc.)\n-   Sept-Oct 2021: Fixing bugs, and formatting code.\n-   July 12, 2021 - sync work from Colab notebooks: add CUDA support for PyTorch in the `.py` versions, added Neuspell as a spell checker. General organization and formatting improvements.\n-   July 8, 2021 - python scripts cleaned and updated.\n-   April - June: Work done mostly on Colab, improving saving, grammar correction, etc.\n-   March 2021: public repository added\n\n### Future Work\n\n\u003e Note: these are largely not in order of priority.\n\n0.  ~~add OpenAI's [whisper](https://github.com/openai/whisper) through integration with the transformers lib.~~\n1.  Unfortunately, trying to use the [Neuspell](https://github.com/neuspell/neuspell) package is still not possible as the default package etc, has still not been fixed. I will add a permanent workaround to load/use with vid2cleantxt.\n2.  ~~syncing improvements currently in the existing **Google Colab** notebooks (links) above, such as [NeuSpell](https://github.com/neuspell/neuspell)~~\n\n    -   ~~this will include support for CUDA automatically when running the code (currently just on Colab)~~\n\n3.  ~~clean up the code, add more features, and make it more robust.~~\n4.  add a script to convert `.txt` files to a clean PDF report, [example here](https://www.dropbox.com/s/fpqq2qw7txbkujq/ACE%20NLP%20Workshop%20-%20Session%20II%20-%20Dec%202%202021%20-%20full%20transcription%20-%20txt2pdf%2012.05.2021%20%20Standard.pdf?dl=1)\n5.  add summarization script/module\n6.  further expand the functionality of the `vid2cleantxt` module\n7.  Add support for transcribing the other languages in the whisper model (e.g., French, German, Spanish, etc.). This will require synchronized API changes to ensure that English spell correction is only applied to English transcripts, etc.\n\n### I've found x repo / script / concept that I think you should incorporate or collaborate with the author\n\nCould you send me a message / start a discussion? Always looking to improve. Or create an issue that works too.\n\n### Citations\n\n**whisper (OpenAI)**\n\n    @report{,\n       abstract = {We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.},\n       author = {Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine Mcleavey and Ilya Sutskever},\n       title = {Robust Speech Recognition via Large-Scale Weak Supervision},\n       url = {https://github.com/openai/},\n    }\n\n**wav2vec2 (fairseq)**\n\n\u003e Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.\n\n-   repo [link](https://github.com/pytorch/fairseq)\n\n**HuBERT (fairseq)**\n\n    @article{Hsu2021,\n       author = {Wei Ning Hsu and Benjamin Bolte and Yao Hung Hubert Tsai and Kushal Lakhotia and Ruslan Salakhutdinov and Abdelrahman Mohamed},\n       doi = {10.1109/TASLP.2021.3122291},\n       issn = {23299304},\n       journal = {IEEE/ACM Transactions on Audio Speech and Language Processing},\n       keywords = {BERT,Self-supervised learning},\n       month = {6},\n       pages = {3451-3460},\n       publisher = {Institute of Electrical and Electronics Engineers Inc.},\n       title = {HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},\n       volume = {29},\n       url = {\u003chttps://arxiv.org/abs/2106.07447v1\u003e},\n       year = {2021},\n    }\n\n**MoviePy**\n\n-   [link](https://github.com/Zulko/moviepy) to repo as no citation info given\n\n**symspellpy / symspell**\n\n-   repo [link](https://github.com/mammothb/symspellpy/tree/e7a91a88f45dc4051b28b83e990fe072cabf0595)\n-   copyright:\n    \u003e Copyright (c) 2020 Wolf Garbe Version: 6.7 Author: Wolf Garbe \u003cmailto:wolf.garbe@seekstorm.com\u003e\n    \u003e Maintainer: Wolf Garbe \u003cmailto:wolf.garbe@seekstorm.com\u003e\n    \u003e URL: \u003chttps://github.com/wolfgarbe/symspell\u003e\n    \u003e Description: \u003chttps://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f\u003e\n    \u003e\n    \u003e MIT License\n    \u003e\n    \u003e Copyright (c) 2020 Wolf Garbe\n    \u003e\n    \u003e Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated\n    \u003e documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the\n    \u003e rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit\n    \u003e persons to whom the Software is furnished to do so, subject to the following conditions:\n    \u003e\n    \u003e The above copyright notice and this permission notice shall be included in all copies or substantial portions of the\n    \u003e Software.\n    \u003e\n    \u003e \u003chttps://opensource.org/licenses/MIT\u003e\n\n**YAKE (yet another keyword extractor)**\n\n-   repo [link](https://github.com/LIAAD/yake)\n-   relevant citations:\n    \u003e In-depth journal paper at Information Sciences Journal\n    \u003e\n    \u003e Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword\n    \u003e Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp\n    \u003e 257-289. pdf\n    \u003e\n    \u003e ECIR'18 Best Short Paper\n    \u003e\n    \u003e Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic\n    \u003e Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances\n    \u003e in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772,\n    \u003e pp. 684 - 691. pdf\n    \u003e\n    \u003e Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE!\n    \u003e Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds).\n    \u003e Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol\n    \u003e 10772, pp. 806 - 810. pdf\n\n#### Video Citations\n\n-   \u003cdiv class=\"csl-entry\"\u003e\u003ci\u003ePresident Kennedy’s 1962 Speech on the US Space Program | C-SPAN Classroom\u003c/i\u003e. (n.d.). Retrieved January 28, 2022, from https://www.c-span.org/classroom/document/?7986\u003c/div\u003e\n\n-   _Note: example videos are cited in respective `Examples/` directories_\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Fvid2cleantxt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpszemraj%2Fvid2cleantxt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Fvid2cleantxt/lists"}