https://github.com/d1pankarmedhi/tiny-whisper
A small, tiny Whisper like encoder-decoder transformer model for speech-to-text tasks.
https://github.com/d1pankarmedhi/tiny-whisper
automatic-speech-recognition encoder-decoder-model pytorch speech-to-text transformer-architecture whisper
Last synced: 6 months ago
JSON representation
A small, tiny Whisper like encoder-decoder transformer model for speech-to-text tasks.
- Host: GitHub
- URL: https://github.com/d1pankarmedhi/tiny-whisper
- Owner: d1pankarmedhi
- License: mit
- Created: 2025-07-26T19:40:57.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-07-28T06:41:52.000Z (6 months ago)
- Last Synced: 2025-07-28T08:38:21.269Z (6 months ago)
- Topics: automatic-speech-recognition, encoder-decoder-model, pytorch, speech-to-text, transformer-architecture, whisper
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
TinyWhisper
A minimal, efficient encoder-decoder transformer model for speech-to-text (ASR) tasks. Inspired by OpenAI's Whisper, designed for research and educational purposes.
 
It is a lightweight automatic speech recognition (ASR) system. It follows the encoder-decoder transformer paradigm, processing audio features and generating transcriptions. The project aims to provide a simple, readable codebase for understanding and experimenting with modern ASR techniques.
## Model Architecture
Fig: Encoder-Decoder ASR Model Architecture
- **Encoder**: Processes input audio features (e.g., log-mel spectrograms) and produces hidden, contextual representations.
- **Decoder**: Autoregressively generates text tokens from the encoder's output.
- **Positional Encoding**: Used in both encoder and decoder to provide sequence order information.
- **Downsampler**: Reduces the temporal resolution of input features for efficiency.
## Tokenizer
The tokenizer is based on Byte Pair Encoding (BPE), similar to Whisper. It converts text to token IDs and vice versa, supporting multilingual and special tokens as needed.
Fig: Tokenization process
## Data Preprocessing
### Audio Processing
Audio or Sound is bascially air pressure that varies over time. It is the change in atmospheric presure caused by the vibration of air molecules. These fluctuations create regions of high and low pressure, which we perceive as sound waves. The frequency of these fluctuations determines the pitch of the sound, while the amplitude determines its loudness.
Fig: Waveform of a sound signal
For ease of processing, these audio signals are converted into a spectrogram, more precisely a log-mel spectrogram. It captures the frequence-time-intensity representation of the audio signal, making it suitable for input to the model.
Fig: Log-Mel Spectrogram of a sound signal
This helps in filtering out noise and irrelevant sounds from audio sources. It ensures words spoken by different people, man or woman, creates a similar spectrogram, making it easier for the model to learn and generalize.
### Text Processing
The corresonsing audio transcript is tokenized into a sequence of tokens. For tokenization, we use a Byte Pair Encoding (BPE) tokenizer, which is efficient for handling large vocabularies and multilingual text.
For example, the **Start-of-Sequence (SOS)** token is used to indicate the beginning of a transcription, and the **End-of-Sequence (EOS)** token indicates its end. The tokenizer also handles special tokens like padding and unknown words.
```
labels: [50257, 32, 1862, 2576, 12049, 477, 287, 11398, 318, 5055, 319, 257, 13990, 290, 2045, 379, 257, 8223, 50258]
text: A young girl dressed all in pink is standing on a fence and looking at a horse
```
## Training Process
Training scripts and utilities are provided in the `tinywhisper/train/` directory:
- `train.py`: Main training loop, data loading, and optimization
- Supports custom datasets and data augmentation
- Configurable via `tinywhisper/config/config.py`
### Steps:
1. Prepare your dataset (audio files and transcripts)
2. Configure training parameters in `config.py`
3. Run the training script:
```bash
python -m tinywhisper.train.train
```
## Evaluation
Evaluation scripts are in `tinywhisper/eval/`:
- `evaluation.py`: Computes WER/CER and other metrics on test data
## Usage
You can use the model for inference after training:
- Load a trained checkpoint
- Use the inference utilities in `tinywhisper/inference/`
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.