{"id":19120235,"url":"https://github.com/i4ds/whisper-prep","last_synced_at":"2025-10-12T13:06:32.326Z","repository":{"id":239753160,"uuid":"800462352","full_name":"i4Ds/whisper-prep","owner":"i4Ds","description":"Data preparation utility for the finetuning of OpenAI's Whisper model.","archived":false,"fork":false,"pushed_at":"2025-09-18T07:59:37.000Z","size":454,"stargazers_count":9,"open_issues_count":3,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-12T13:03:12.598Z","etag":null,"topics":["fine-tuning","nlp","speech-to-text","whisper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/i4Ds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"agents.MD","dco":null,"cla":null}},"created_at":"2024-05-14T11:35:20.000Z","updated_at":"2025-09-18T07:59:42.000Z","dependencies_parsed_at":"2024-05-28T13:51:31.362Z","dependency_job_id":"c0aafbf6-cbb8-4546-a45d-167ceb9f8c1b","html_url":"https://github.com/i4Ds/whisper-prep","commit_stats":null,"previous_names":["i4ds/whisper-prep"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/i4Ds/whisper-prep","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/i4Ds%2Fwhisper-prep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/i4Ds%2Fwhisper-prep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/i4Ds%2Fwhisper-prep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/i4Ds%2Fwhisper-prep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/i4Ds","download_url":"https://codeload.github.com/i4Ds/whisper-prep/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/i4Ds%2Fwhisper-prep/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279011468,"owners_count":26084947,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fine-tuning","nlp","speech-to-text","whisper"],"created_at":"2024-11-09T05:13:24.560Z","updated_at":"2025-10-12T13:06:32.288Z","avatar_url":"https://github.com/i4Ds.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca id=\"readme-top\"\u003e\u003c/a\u003e\n\u003c!-- PROJECT SHIELDS --\u003e\n\u003c!--\n*** I'm using markdown \"reference style\" links for readability.\n*** Reference links are enclosed in brackets [ ] instead of parentheses ( ).\n*** See the bottom of this document for the declaration of the reference variables\n*** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use.\n*** https://www.markdownguide.org/basic-syntax/#reference-style-links\n--\u003e\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n\n\u003c!-- PROJECT LOGO --\u003e\n\u003cbr /\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003ch3 align=\"center\"\u003ewhisper-prep\u003c/h3\u003e\n\n  \u003cp align=\"center\"\u003e\n    Data preparation utility for the finetuning of OpenAI's Whisper model.\n  \u003c/p\u003e\n\u003c/div\u003e\n\n\u003c!-- TABLE OF CONTENTS --\u003e\n\u003cdetails\u003e\n  \u003csummary\u003eTable of Contents\u003c/summary\u003e\n  \u003col\u003e\n    \u003cli\u003e\u003ca href=\"#about-the-project\"\u003eAbout The Project\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#license\"\u003eLicense\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#contact\"\u003eContact\u003c/a\u003e\u003c/li\u003e\n  \u003c/ol\u003e\n\u003c/details\u003e\n\n\u003c!-- ABOUT THE PROJECT --\u003e\n## About The Project\nThis package assists in generating training data for fine-tuning Whisper by synthesizing .srt files from sentences, mimicking real data through sentence concatenation.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- Guide --\u003e\n## Data Preparation Guide\n1. **Data File (.tsv):**\n   - Create a `.tsv` file with two required columns:\n     - `path`: The relative path to the `.mp3` file.\n     - `sentence`: The text corresponding to the audio file.\n   - Optional: If a `client_id` is included, it can be used to increase the probability that following sentences are from the same speaker. Refer to `generate_fold` in `src/whisper_prep/generation/generate.py` for additional features.\n\n1a. **Timestamp-based TSV (.tsv):**\n   - Create a `.tsv` file with four required columns:\n     - `srt_path`: Path to the `.srt` file containing subtitles.\n     - `language`: ISO language code for the subtitles (e.g., `de`, `en`).\n     - `id`: Unique identifier for the audio/transcript pair.\n     - `audio_path`: Path to the corresponding `.mp3` file.\n   - This TSV can be used to process existing SRT transcripts and audio files without directory globbing.\n\n2. **Configuration File (.yaml):**\n   - Set up a `.yaml` configuration file. An example can be found at `example.yaml`.\n  \n   - (Optional) To load data directly from a HuggingFace dataset with `audio` and `srt` columns, set the `hu_dataset` field to the dataset identifier; this will bypass TSV-based generation and process existing subtitles. For sentence-based datasets without an `srt` column, synthetic SRT files will be generated from the sentences.\n  \n   - (Optional) To process existing SRT files and audio paths without directory globbing, specify a TSV via `transcripts_tsv`. The TSV must include columns `srt_path`, `audio_path`, `language`, and `id` to map each transcript to its audio file and language.\n\n3. **Running the Generation Script:**\n   - Run `whisper_prep -c \u003cpath_to_your_yaml_file\u003e`.\n\n4. **Upload a TSV as an ASR Dataset:**\n   - A helper script `upload_asr_dataset.py` can convert a `.tsv` file (with at least `path` and `sentence` columns) into a Hugging Face ASR dataset and push it to the Hub:\n     ```bash\n     python upload_asr_dataset.py --tsv path/to/data.tsv \\\n         --repo_id username/dataset_name --split train\n     ```\n\n5. **Upload to Huggingface.com:**\n   - https://huggingface.co/docs/datasets/v1.16.0/upload_dataset.html\n\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\n\u003c!-- CONTACT --\u003e\n## Contact\n\nVincenzo Timmel - vincenzo.timmel@fhnw.ch\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- LICENSE --\u003e\n## License\n\nDistributed under the MIT License. See `LICENSE` for more information.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\n\u003c!-- MARKDOWN LINKS \u0026 IMAGES --\u003e\n\u003c!-- https://www.markdownguide.org/basic-syntax/#reference-style-links --\u003e\n[issues-shield]: https://img.shields.io/github/issues/i4Ds/whisper-prep.svg?style=for-the-badge\n[issues-url]: https://github.com/i4Ds/whisper-prep/issues\n[license-shield]: https://img.shields.io/github/license/i4Ds/whisper-prep.svg?style=for-the-badge\n[license-url]: https://github.com/i4Ds/whisper-prep/blob/main/LICENSE","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fi4ds%2Fwhisper-prep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fi4ds%2Fwhisper-prep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fi4ds%2Fwhisper-prep/lists"}