https://github.com/i4ds/whisper-prep
Data preparation utility for the finetuning of OpenAI's Whisper model.
https://github.com/i4ds/whisper-prep
fine-tuning nlp speech-to-text whisper
Last synced: 8 months ago
JSON representation
Data preparation utility for the finetuning of OpenAI's Whisper model.
- Host: GitHub
- URL: https://github.com/i4ds/whisper-prep
- Owner: i4Ds
- License: mit
- Created: 2024-05-14T11:35:20.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-09-18T07:59:37.000Z (9 months ago)
- Last Synced: 2025-10-12T13:03:12.598Z (8 months ago)
- Topics: fine-tuning, nlp, speech-to-text, whisper
- Language: Python
- Homepage:
- Size: 443 KB
- Stars: 9
- Watchers: 2
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: agents.MD
Awesome Lists containing this project
README
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
whisper-prep
Data preparation utility for the finetuning of OpenAI's Whisper model.
Table of Contents
## About The Project
This package assists in generating training data for fine-tuning Whisper by synthesizing .srt files from sentences, mimicking real data through sentence concatenation.
## Data Preparation Guide
1. **Data File (.tsv):**
- Create a `.tsv` file with two required columns:
- `path`: The relative path to the `.mp3` file.
- `sentence`: The text corresponding to the audio file.
- Optional: If a `client_id` is included, it can be used to increase the probability that following sentences are from the same speaker. Refer to `generate_fold` in `src/whisper_prep/generation/generate.py` for additional features.
1a. **Timestamp-based TSV (.tsv):**
- Create a `.tsv` file with four required columns:
- `srt_path`: Path to the `.srt` file containing subtitles.
- `language`: ISO language code for the subtitles (e.g., `de`, `en`).
- `id`: Unique identifier for the audio/transcript pair.
- `audio_path`: Path to the corresponding `.mp3` file.
- This TSV can be used to process existing SRT transcripts and audio files without directory globbing.
2. **Configuration File (.yaml):**
- Set up a `.yaml` configuration file. An example can be found at `example.yaml`.
- (Optional) To load data directly from a HuggingFace dataset with `audio` and `srt` columns, set the `hu_dataset` field to the dataset identifier; this will bypass TSV-based generation and process existing subtitles. For sentence-based datasets without an `srt` column, synthetic SRT files will be generated from the sentences.
- (Optional) To process existing SRT files and audio paths without directory globbing, specify a TSV via `transcripts_tsv`. The TSV must include columns `srt_path`, `audio_path`, `language`, and `id` to map each transcript to its audio file and language.
3. **Running the Generation Script:**
- Run `whisper_prep -c `.
4. **Upload a TSV as an ASR Dataset:**
- A helper script `upload_asr_dataset.py` can convert a `.tsv` file (with at least `path` and `sentence` columns) into a Hugging Face ASR dataset and push it to the Hub:
```bash
python upload_asr_dataset.py --tsv path/to/data.tsv \
--repo_id username/dataset_name --split train
```
5. **Upload to Huggingface.com:**
- https://huggingface.co/docs/datasets/v1.16.0/upload_dataset.html
## Contact
Vincenzo Timmel - vincenzo.timmel@fhnw.ch
## License
Distributed under the MIT License. See `LICENSE` for more information.
[issues-shield]: https://img.shields.io/github/issues/i4Ds/whisper-prep.svg?style=for-the-badge
[issues-url]: https://github.com/i4Ds/whisper-prep/issues
[license-shield]: https://img.shields.io/github/license/i4Ds/whisper-prep.svg?style=for-the-badge
[license-url]: https://github.com/i4Ds/whisper-prep/blob/main/LICENSE