An open API service indexing awesome lists of open source software.

https://github.com/i4ds/whisper-prep

Data preparation utility for the finetuning of OpenAI's Whisper model.
https://github.com/i4ds/whisper-prep

fine-tuning nlp speech-to-text whisper

Last synced: 8 months ago
JSON representation

Data preparation utility for the finetuning of OpenAI's Whisper model.

Awesome Lists containing this project

README

          

[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]




whisper-prep


Data preparation utility for the finetuning of OpenAI's Whisper model.


Table of Contents


  1. About The Project

  2. License

  3. Contact

## About The Project
This package assists in generating training data for fine-tuning Whisper by synthesizing .srt files from sentences, mimicking real data through sentence concatenation.

(back to top)

## Data Preparation Guide
1. **Data File (.tsv):**
- Create a `.tsv` file with two required columns:
- `path`: The relative path to the `.mp3` file.
- `sentence`: The text corresponding to the audio file.
- Optional: If a `client_id` is included, it can be used to increase the probability that following sentences are from the same speaker. Refer to `generate_fold` in `src/whisper_prep/generation/generate.py` for additional features.

1a. **Timestamp-based TSV (.tsv):**
- Create a `.tsv` file with four required columns:
- `srt_path`: Path to the `.srt` file containing subtitles.
- `language`: ISO language code for the subtitles (e.g., `de`, `en`).
- `id`: Unique identifier for the audio/transcript pair.
- `audio_path`: Path to the corresponding `.mp3` file.
- This TSV can be used to process existing SRT transcripts and audio files without directory globbing.

2. **Configuration File (.yaml):**
- Set up a `.yaml` configuration file. An example can be found at `example.yaml`.

- (Optional) To load data directly from a HuggingFace dataset with `audio` and `srt` columns, set the `hu_dataset` field to the dataset identifier; this will bypass TSV-based generation and process existing subtitles. For sentence-based datasets without an `srt` column, synthetic SRT files will be generated from the sentences.

- (Optional) To process existing SRT files and audio paths without directory globbing, specify a TSV via `transcripts_tsv`. The TSV must include columns `srt_path`, `audio_path`, `language`, and `id` to map each transcript to its audio file and language.

3. **Running the Generation Script:**
- Run `whisper_prep -c `.

4. **Upload a TSV as an ASR Dataset:**
- A helper script `upload_asr_dataset.py` can convert a `.tsv` file (with at least `path` and `sentence` columns) into a Hugging Face ASR dataset and push it to the Hub:
```bash
python upload_asr_dataset.py --tsv path/to/data.tsv \
--repo_id username/dataset_name --split train
```

5. **Upload to Huggingface.com:**
- https://huggingface.co/docs/datasets/v1.16.0/upload_dataset.html

(back to top)

## Contact

Vincenzo Timmel - vincenzo.timmel@fhnw.ch

(back to top)

## License

Distributed under the MIT License. See `LICENSE` for more information.

(back to top)

[issues-shield]: https://img.shields.io/github/issues/i4Ds/whisper-prep.svg?style=for-the-badge
[issues-url]: https://github.com/i4Ds/whisper-prep/issues
[license-shield]: https://img.shields.io/github/license/i4Ds/whisper-prep.svg?style=for-the-badge
[license-url]: https://github.com/i4Ds/whisper-prep/blob/main/LICENSE