An open API service indexing awesome lists of open source software.

https://github.com/adamelkholyy/hpc-nemo

Fork for running Whisper transcriptions with Nemo diarization on University of Exeter's ISCA Supercomputer. Includes slurm scripts and custom environment for HPC compatability.
https://github.com/adamelkholyy/hpc-nemo

asr gpu-computing hpc-clusters

Last synced: 4 months ago
JSON representation

Fork for running Whisper transcriptions with Nemo diarization on University of Exeter's ISCA Supercomputer. Includes slurm scripts and custom environment for HPC compatability.

Awesome Lists containing this project

README

          

# ISCA Transcription + Diarization Tool

This project is a fork of @MahmoudAshraf97's [whisper diarization tool](https://github.com/MahmoudAshraf97/whisper-diarization), heavily modified for use on the University of Exeter ISCA's High Performance Computing Server. All due credit goes to the authors of the original project, without which this would not have been possible. This fork includes extra CLI args, transcript anonymisation, major refactoring, and many other additional features on top of the original project.

## Setup
To install the required modules you can either use pip
```
pip install -r requirements.txt -c constraints.txt
```
or poetry
```
poetry install
```
Poetry is highly recommended as all of the module versions have been locked and are guaranteed to work. The requirements.txt file depends on other git repositories which do not freeze their modules, meaning dependency conflicts are likely.

## Usage
To use the tool, simply run the following command
```
python diarize.py --audio my_audiofile.mp3
```
And the tool will output two files in the same directory as your audio file:
- a timestamped .srt file
- a diarized .txt file

You may also specify a directory with `--audio` in order to diarize every file in the specified folder.

## Optional arguments
`diarize.py` takes a number of optional arguments
```
usage: import argparse.py [-h] -a AUDIO [--no-stem] [--suppress_numerals] [--whisper-model MODEL_NAME] [--batch-size BATCH_SIZE] [--language LANGUAGE] [--device DEVICE] [--parallel] [--anonymise] [--num-speakers NUM_SPEAKERS] [--domain-type {telephonic,meeting,general}]

options:
-h, --help Show this help message and exit.
-a AUDIO, --audio AUDIO
Name of the target audio file.
--no-stem Disables source separation. This helps with long files that don't contain a lot of music.
--suppress_numerals Suppresses Numerical Digits.This helps the diarization accuracy but converts all digits into written text.
--whisper-model MODEL_NAME
Select which Whisper model to use. Default is large-v3.
--batch-size BATCH_SIZE
Batch size for batched inference, reduce if you run out of memory, set to 0 for non-batched inference.
--language LANGUAGE Language spoken in the audio, specify None to perform language detection.
--device DEVICE If you have a GPU use 'cuda', otherwise 'cpu'. Leave blank for automatic detection.
--parallel Enable parallel NeMo diarization during Whisper transcription.
--anonymise Anonymise files after diarization and transcription. Default is False.
--num-speakers NUM_SPEAKERS
Specify number of speakers in audio. Default is 0 for automatic detection.
--domain-type {telephonic,meeting,general}
Type of diarization model to use. Options are as follows (default is 'telephonic') - 'telephonic': Suitable for telephone recordings involving 2-8 speakers in a session and may not show the best performance on the other types of acoustic conditions or
dialogues - 'meeting': Suitable for 3-5 speakers participating in a meeting and may not show the best performance on other types of dialogues - 'general': Optimized to show balanced performances on various types of domain. VAD is optimized on multilingual
ASR datasets and diarizer is optimized on DIHARD3 development set
```
The most useful of these optional arguments is perhaps `--anonymise`, which fully anonymises the diarized transcript, redacting any names, dates, people, places, etc. from the text. `--no-stem` and `--suppress-numerals` are generally recommended, unless you are dealing with audio containing a lot of background music. It is highly recommended to specify the language using the `--language` argument, however NeMo and Whisper will simply auto-detect the language if unspecified (although this results in longer computation time).