https://github.com/michaelgathara/gpt_f
GPT From Scratch | FDS
https://github.com/michaelgathara/gpt_f
Last synced: 6 days ago
JSON representation
GPT From Scratch | FDS
- Host: GitHub
- URL: https://github.com/michaelgathara/gpt_f
- Owner: Michaelgathara
- License: mit
- Created: 2025-02-25T12:48:32.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-04-29T19:28:28.000Z (21 days ago)
- Last Synced: 2025-05-11T22:19:35.665Z (9 days ago)
- Language: Python
- Size: 4.51 MB
- Stars: 2
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GPT From Scratch
A PyTorch implementation of a GPT-style transformer language model trained from scratch, featuring modern training optimizations and both custom BPE tokenization and inference capabilities.
[](https://huggingface.co/purelyunfunctionalai/gibberishgpt)
## Features
- **Modern Transformer Architecture**: Implementation based on the GPT architecture with SwiGLU activation functions
- **Advanced Optimizations**:
- Flash Attention for improved performance on compatible hardware
- Mixed precision training (FP16)
- Gradient checkpointing (optional)
- Distributed training with DDP (DataParallel)
- **Custom BPE Tokenization**: Uses a custom-trained BPE tokenizer with 25K vocab size
- **Performance Monitoring**: TensorBoard integration for tracking training metrics
- **Flexible Training**: Supports dataset options including WikiText-103 and FineWeb-Edu## Model Architecture
There have been several models trained under this repo. The model that is currently live on Huggingface has the following characteristics.
- Embedding dimension: 768
- Number of attention heads: 12
- Number of transformer layers: 12
- Context size: 512 tokens
- Total parameters: ~152MWe have created models up to 1 billion parameters
## Requirements
1. Install `uv` from (here)[https://docs.astral.sh/uv/getting-started/installation/]2. Sync libraries,
```bash
uv sync
``````bash
# Core dependencies, if they did not sync
uv pip install torch
uv sync --no-build-isolation # Optional, for hardware that supports it# Hugging Face access (for datasets)
huggingface-cli login
```## Project Structure
- **transformer_setup/**: Core transformer model implementation
- `params.py`: Model configuration
- `transformer.py`: Transformer model classes including attention mechanisms
- **tokenization/**: Tokenizer training and utilities
- `custom_tokenizer/`: BPE tokenizer implementation
- **data/**: Dataset loading and preprocessing
- `wikitext_data.py`: Load and process WikiText-103
- `fineweb_data.py`: Load and process FineWeb datasets
- `clean_text.py`: Text cleaning utilities
- **models/**: Training and inference scripts
- `gpt_custom_BPE.py`: Main training script
- `inference.py`: Text generation script for trained models
- **evaluation/**: Testing and evaluation scripts
- `eval_perplexity.py`: Calculate perplexity score for saved model## Usage
### Training
To train the model from scratch:
```bash
source .venv/bin/activate
cd models/
python gpt_custom_BPE.py
```This will:
1. Download and preprocess the WikiText dataset
2. Train (or load) a BPE tokenizer
3. Tokenize the dataset
4. Train the transformer model using distributed data parallel
5. Save checkpoints to the `checkpoints/` directory### Inference
To generate text with a trained model:
```bash
python inference.py
```This will:
1. Load a checkpoint (you'll be prompted to choose one)
2. Allow you to enter prompts and generate continuations
3. Exit when you type 'exit'## Training Details
- **Optimizer**: AdamW with weight decay and cosine learning rate scheduling
- **Batch size**: 72 per GPU (configurable)
- **Learning rate**: 1e-3 with warmup
- **Gradient accumulation**: 4 steps
- **Mixed precision**: FP16 training enabled
- **Evaluation**: Every 100 iterations on validation set## Acknowledgements
This implementation draws inspiration from:
- The GPT architecture by OpenAI
- "Attention Is All You Need" (Vaswani et al., 2017)
- The Flash Attention implementation
- Hugging Face's tokenizers and datasets libraries## License
MIT