{"id":30066683,"url":"https://github.com/d1pankarmedhi/tiny-whisper","last_synced_at":"2026-04-20T19:06:58.387Z","repository":{"id":306882219,"uuid":"1026870116","full_name":"d1pankarmedhi/tiny-whisper","owner":"d1pankarmedhi","description":"A small, tiny Whisper like encoder-decoder transformer model for speech-to-text tasks. ","archived":false,"fork":false,"pushed_at":"2025-07-28T06:41:52.000Z","size":14,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-28T08:38:21.269Z","etag":null,"topics":["automatic-speech-recognition","encoder-decoder-model","pytorch","speech-to-text","transformer-architecture","whisper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/d1pankarmedhi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-26T19:40:57.000Z","updated_at":"2025-07-28T06:46:06.000Z","dependencies_parsed_at":"2025-07-28T08:38:22.614Z","dependency_job_id":"b951e9f6-3243-4987-9dbc-a007cdda21b2","html_url":"https://github.com/d1pankarmedhi/tiny-whisper","commit_stats":null,"previous_names":["d1pankarmedhi/tiny-whisper"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/d1pankarmedhi/tiny-whisper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d1pankarmedhi%2Ftiny-whisper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d1pankarmedhi%2Ftiny-whisper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d1pankarmedhi%2Ftiny-whisper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d1pankarmedhi%2Ftiny-whisper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/d1pankarmedhi","download_url":"https://codeload.github.com/d1pankarmedhi/tiny-whisper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d1pankarmedhi%2Ftiny-whisper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269385801,"owners_count":24408432,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-08T02:00:09.200Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-speech-recognition","encoder-decoder-model","pytorch","speech-to-text","transformer-architecture","whisper"],"created_at":"2025-08-08T08:00:47.883Z","updated_at":"2026-04-20T19:06:53.348Z","avatar_url":"https://github.com/d1pankarmedhi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003eTinyWhisper\u003c/h1\u003e\n  \u003cp\u003e A minimal, efficient encoder-decoder transformer model for speech-to-text (ASR) tasks. Inspired by OpenAI's Whisper, designed for research and educational purposes.\u003c/p\u003e\n\n![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat\u0026logo=PyTorch\u0026logoColor=white) ![Python](https://img.shields.io/badge/Python-blue.svg?style=flat\u0026logo=python\u0026logoColor=white)\n\n\u003c/div\u003e\n\nIt is a lightweight automatic speech recognition (ASR) system. It follows the encoder-decoder transformer paradigm, processing audio features and generating transcriptions. The project aims to provide a simple, readable codebase for understanding and experimenting with modern ASR techniques.\n\n## Model Architecture\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg width=\"400\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/9d29d0ab-cf38-4fd1-8fca-0faaeeb972cc\" /\u003e\n\u003cp\u003eFig: Encoder-Decoder ASR Model Architecture\u003c/p\u003e\n\u003c/div\u003e\n\n- **Encoder**: Processes input audio features (e.g., log-mel spectrograms) and produces hidden, contextual representations.\n- **Decoder**: Autoregressively generates text tokens from the encoder's output.\n- **Positional Encoding**: Used in both encoder and decoder to provide sequence order information.\n- **Downsampler**: Reduces the temporal resolution of input features for efficiency.\n\n## Tokenizer\n\nThe tokenizer is based on Byte Pair Encoding (BPE), similar to Whisper. It converts text to token IDs and vice versa, supporting multilingual and special tokens as needed.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/dbcad90f-0d78-4407-a48d-4973027fb9b2\" width=\"400\" /\u003e\n\u003cp\u003e\u003c/p\u003eFig: Tokenization process\u003c/p\u003e\n\u003c/div\u003e\n\n## Data Preprocessing\n\n### Audio Processing\n\nAudio or Sound is bascially air pressure that varies over time. It is the change in atmospheric presure caused by the vibration of air molecules. These fluctuations create regions of high and low pressure, which we perceive as sound waves. The frequency of these fluctuations determines the pitch of the sound, while the amplitude determines its loudness.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg height=\"200\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/5f0da975-5dc4-46e9-ab1a-6e7899fef32b\" /\u003e\n\u003cp\u003eFig: Waveform of a sound signal\u003c/p\u003e\n\u003c/div\u003e\n\nFor ease of processing, these audio signals are converted into a spectrogram, more precisely a log-mel spectrogram. It captures the frequence-time-intensity representation of the audio signal, making it suitable for input to the model.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg height=\"200\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/ccd6f8ec-a574-4668-92e9-4e03984e793b\" /\u003e\n\u003cp\u003eFig: Log-Mel Spectrogram of a sound signal\u003c/p\u003e\n\u003c/div\u003e\n\nThis helps in filtering out noise and irrelevant sounds from audio sources. It ensures words spoken by different people, man or woman, creates a similar spectrogram, making it easier for the model to learn and generalize.\n\n### Text Processing\n\nThe corresonsing audio transcript is tokenized into a sequence of tokens. For tokenization, we use a Byte Pair Encoding (BPE) tokenizer, which is efficient for handling large vocabularies and multilingual text.\n\nFor example, the **Start-of-Sequence (SOS)** token is used to indicate the beginning of a transcription, and the **End-of-Sequence (EOS)** token indicates its end. The tokenizer also handles special tokens like padding and unknown words.\n\n```\nlabels: [50257, 32, 1862, 2576, 12049, 477, 287, 11398, 318, 5055, 319, 257, 13990, 290, 2045, 379, 257, 8223, 50258]\ntext: \u003cSOS\u003eA young girl dressed all in pink is standing on a fence and looking at a horse\u003cEOS\u003e\n```\n\n## Training Process\n\nTraining scripts and utilities are provided in the `tinywhisper/train/` directory:\n\n- `train.py`: Main training loop, data loading, and optimization\n- Supports custom datasets and data augmentation\n- Configurable via `tinywhisper/config/config.py`\n\n### Steps:\n\n1. Prepare your dataset (audio files and transcripts)\n2. Configure training parameters in `config.py`\n3. Run the training script:\n   ```bash\n   python -m tinywhisper.train.train\n   ```\n\n## Evaluation\n\nEvaluation scripts are in `tinywhisper/eval/`:\n\n- `evaluation.py`: Computes WER/CER and other metrics on test data\n\n## Usage\n\nYou can use the model for inference after training:\n\n- Load a trained checkpoint\n- Use the inference utilities in `tinywhisper/inference/`\n\n## License\n\nThis project is licensed under the MIT License. See [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd1pankarmedhi%2Ftiny-whisper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fd1pankarmedhi%2Ftiny-whisper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd1pankarmedhi%2Ftiny-whisper/lists"}