{"id":21222087,"url":"https://github.com/adamelkholyy/nemo","last_synced_at":"2025-04-13T23:03:51.166Z","repository":{"id":261030394,"uuid":"883044104","full_name":"adamelkholyy/nemo","owner":"adamelkholyy","description":"Fork for running Whisper transcriptions with Nemo diarization on University of Exeter's ISCA Supercomputer. Includes slurm scripts and custom environment for HPC compatability.","archived":false,"fork":false,"pushed_at":"2025-01-13T10:38:57.000Z","size":158,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-13T11:40:57.134Z","etag":null,"topics":["asr","gpu-computing","hpc-clusters"],"latest_commit_sha":null,"homepage":"https://www.exeter.ac.uk","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adamelkholyy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-04T09:30:41.000Z","updated_at":"2025-01-13T10:39:01.000Z","dependencies_parsed_at":"2024-11-20T22:48:41.372Z","dependency_job_id":"e0e170db-0d4f-44ce-9fcd-3e3ce2ebbe5d","html_url":"https://github.com/adamelkholyy/nemo","commit_stats":null,"previous_names":["adamelkholyy/nemo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adamelkholyy%2Fnemo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adamelkholyy%2Fnemo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adamelkholyy%2Fnemo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adamelkholyy%2Fnemo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adamelkholyy","download_url":"https://codeload.github.com/adamelkholyy/nemo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234988228,"owners_count":18918097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","gpu-computing","hpc-clusters"],"created_at":"2024-11-20T22:39:37.074Z","updated_at":"2025-01-21T17:13:50.930Z","avatar_url":"https://github.com/adamelkholyy.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eSpeaker Diarization Using OpenAI Whisper - HPC Fork by @adamelkholyy\u003c/h1\u003e\n\n\n# HPC Fork \nFor use on University of Exeter ISCA Server.\n\n   \nBuilding the environment using conda:  \n```conda env create -f env.yaml```  \n\n\nBuilding the environment using mamba:  \n```mamba env create -f env.yaml```  \n\n\nHPC Compatability changelog\n- Edited requirements.txt to add all dependencies \n- Included Perl and C++ build tools in env.yaml \n- ```.SUBKILL``` changed to ```.SUBTERM``` to avoid errors       \n- Added Word Error Rate calculations   \n- Added SBATCH scripts (```diarize_test.sh```) for running jobs with slurm\n- Added errors and logging output for HPC GPUs    \n\n# Original README\n\u003cp align=\"center\"\u003e\n  Credit to @MahmoudAshraf97\n  \u003ca href=\"https://github.com/MahmoudAshraf97/whisper-diarization/stargazers\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/MahmoudAshraf97/whisper-diarization.svg?colorA=orange\u0026colorB=orange\u0026logo=github\"\n         alt=\"GitHub stars\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/MahmoudAshraf97/whisper-diarization/issues\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/issues/MahmoudAshraf97/whisper-diarization.svg\"\n             alt=\"GitHub issues\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/MahmoudAshraf97/whisper-diarization/blob/master/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/MahmoudAshraf97/whisper-diarization.svg\"\n             alt=\"GitHub license\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://twitter.com/intent/tweet?text=\u0026url=https%3A%2F%2Fgithub.com%2FMahmoudAshraf97%2Fwhisper-diarization\"\u003e\n  \u003cimg src=\"https://img.shields.io/twitter/url/https/github.com/MahmoudAshraf97/whisper-diarization.svg?style=social\" alt=\"Twitter\"\u003e\n  \u003c/a\u003e \n  \u003c/a\u003e\n  \u003ca href=\"https://colab.research.google.com/github/MahmoudAshraf97/whisper-diarization/blob/main/Whisper_Transcription_%2B_NeMo_Diarization.ipynb\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"\u003e\n  \u003c/a\u003e\n \n\u003c/p\u003e\n\nSpeaker Diarization pipeline based on OpenAI Whisper\nI'd like to thank [@m-bain](https://github.com/m-bain) for Batched Whisper Inference, [@mu4farooqi](https://github.com/mu4farooqi) for punctuation realignment algorithm\n\n\u003cimg src=\"https://github.blog/wp-content/uploads/2020/09/github-stars-logo_Color.png\" alt=\"drawing\" width=\"25\"/\u003e **Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!**\n\n## What is it\nThis repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.\n\n\nWhisperX and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later\n## Installation\n`FFMPEG` and `Cython` are needed as prerequisites to install the requirements\n```\npip install cython\n```\nor\n```\nsudo apt update \u0026\u0026 sudo apt install cython3\n```\n```\n# on Ubuntu or Debian\nsudo apt update \u0026\u0026 sudo apt install ffmpeg\n\n# on Arch Linux\nsudo pacman -S ffmpeg\n\n# on MacOS using Homebrew (https://brew.sh/)\nbrew install ffmpeg\n\n# on Windows using Chocolatey (https://chocolatey.org/)\nchoco install ffmpeg\n\n# on Windows using Scoop (https://scoop.sh/)\nscoop install ffmpeg\n\n# on Windows using WinGet (https://github.com/microsoft/winget-cli)\nwinget install ffmpeg\n```\n```\npip install -r requirements.txt\n```\n## Usage \n\n```\npython diarize.py -a AUDIO_FILE_NAME\n```\n\npython diarize.py -a YB_exit_interview_from_Pilot_RC_position.MP3\npython diarize.py -a YM_RC_exit_Interview_Part_II.MP3\n\nIf your system has enough VRAM (\u003e=10GB), you can use `diarize_parallel.py` instead, the difference is that it runs NeMo in parallel with Whisper, this can be beneficial in some cases and the result is the same since the two models are nondependent on each other. This is still experimental, so expect errors and sharp edges. Your feedback is welcome.\n\n## Command Line Options\n\n- `-a AUDIO_FILE_NAME`: The name of the audio file to be processed\n- `--no-stem`: Disables source separation\n- `--whisper-model`: The model to be used for ASR, default is `medium.en`\n- `--suppress_numerals`: Transcribes numbers in their pronounced letters instead of digits, improves alignment accuracy\n- `--device`: Choose which device to use, defaults to \"cuda\" if available\n- `--language`: Manually select language, useful if language detection failed\n- `--batch-size`: Batch size for batched inference, reduce if you run out of memory, set to 0 for non-batched inference\n\n## Known Limitations\n- Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation\n- There might be some errors, please raise an issue if you encounter any.\n\n## Future Improvements\n- Implement a maximum length per sentence for SRT\n\n## Acknowledgements\nSpecial Thanks for [@adamjonas](https://github.com/adamjonas) for supporting this project\nThis work is based on [OpenAI's Whisper](https://github.com/openai/whisper) , [Faster Whisper](https://github.com/guillaumekln/faster-whisper) , [Nvidia NeMo](https://github.com/NVIDIA/NeMo) , and [Facebook's Demucs](https://github.com/facebookresearch/demucs)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadamelkholyy%2Fnemo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadamelkholyy%2Fnemo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadamelkholyy%2Fnemo/lists"}