https://github.com/kjpou1/llm-zero-to-trained
Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.
https://github.com/kjpou1/llm-zero-to-trained
attention deep-learning educational from-scratch from-scratch-in-python llm machine-learning natural-language-processing python tokenizer torch training-loop transformers uv-pm
Last synced: 14 days ago
JSON representation
Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.
- Host: GitHub
- URL: https://github.com/kjpou1/llm-zero-to-trained
- Owner: kjpou1
- Created: 2025-07-14T06:24:51.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-14T09:02:15.000Z (11 months ago)
- Last Synced: 2025-07-14T09:41:42.227Z (11 months ago)
- Topics: attention, deep-learning, educational, from-scratch, from-scratch-in-python, llm, machine-learning, natural-language-processing, python, tokenizer, torch, training-loop, transformers, uv-pm
- Language: Jupyter Notebook
- Homepage:
- Size: 154 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **LLM: Zero to Trained**
> A fully manual, from-scratch implementation of a Large Language Model (LLM), guided by the book [**Build a Large Language Model from Scratch**](https://github.com/rasbt/LLMs-from-scratch) by Sebastian Raschka.
Unlike simply running code from the [reference repo](https://github.com/rasbt/LLMs-from-scratch.git), this project is about **understanding and re-implementing** every major component — tokenizer, model, data loader, training loop — by hand.
---
- [**LLM: Zero to Trained**](#llm-zero-to-trained)
- [🎯 Objective](#-objective)
- [📁 Project Structure](#-project-structure)
- [🧠 Learning Sources](#-learning-sources)
- [🗒️ Progress Log](#️-progress-log)
- [🔧 Getting Started](#-getting-started)
- [🧰 Environment Setup (with uv)](#-environment-setup-with-uv)
- [🚀 CLI: Start Building](#-cli-start-building)
- [📚 References \& Inspirations](#-references--inspirations)
- [� Internal Documentation](#-internal-documentation)
- [📜 License](#-license)
---
## 🎯 Objective
To develop a working LLM from first principles by:
* Writing each component from scratch in Python
* Following the structure and logic from Raschka’s book
* Validating ideas through notebooks and experiments
* Building modular, reusable, CLI-driven code
---
## 📁 Project Structure
```
llm-zero-to-trained/
├── src/llmscratch/ ← Modular CLI-driven Python package
│ ├── config/ ← Config loader with .env, CLI, YAML support
│ ├── models/ ← Core dataclasses and SingletonMeta
│ ├── runtime/ ← CLI argument parsing and dispatch
│ ├── launch_host.py ← Entry point for all commands (e.g., preprocess)
│ └── host.py ← Command execution coordinator
├── notebooks/ ← Book-aligned exploration notebooks
├── configs/ ← YAML configs for datasets, vocab, etc.
├── datasets/ ← Raw and processed tokenized data
├── pyproject.toml ← Project metadata and CLI definition
├── README.md ← This file
└── PROGRESS.md ← Running log of milestones
```
---
## 🧠 Learning Sources
* 📘 *Build a Large Language Model from Scratch* – Sebastian Raschka (2024)
* 💻 [LLMs-from-scratch GitHub Repo](https://github.com/rasbt/LLMs-from-scratch)
* 🧠 Karpathy’s [minGPT](https://github.com/karpathy/minGPT) and [nanoGPT](https://github.com/karpathy/nanoGPT) (inspirational, but not reused)
---
## 🗒️ Progress Log
See [PROGRESS.md](./PROGRESS.md) for completed milestones, model checkpoints, and active development notes.
---
## 🔧 Getting Started
```bash
git clone https://github.com/kjpou1/llm-zero-to-trained.git
cd llm-zero-to-trained
```
---
## 🧰 Environment Setup (with [uv](https://docs.astral.sh/uv/getting-started/installation/))
```bash
uv venv
source .venv/bin/activate # macOS/Linux
# OR
.venv\Scripts\activate # Windows
uv pip install --editable .
uv sync
```
> 🧪 This enables the `llmscratch` CLI from anywhere and installs all dependencies with reproducible locking via `uv`.
---
## 🚀 CLI: Start Building
Use the CLI to run modular LLM pipelines:
```bash
llmscratch preprocess --config configs/data_config.yaml
```
More commands like `train`, `sample`, and `evaluate` will follow as the project evolves.
---
## 📚 References & Inspirations
While the architecture is influenced by great projects, all code is original and written from scratch:
* [Raschka’s LLM Book](https://leanpub.com/llms-from-scratch)
* [minGPT](https://github.com/karpathy/minGPT)
* [nanoGPT](https://github.com/karpathy/nanoGPT)
* [Hugging Face Transformers](https://github.com/huggingface/transformers)
---
## 📄 Internal Documentation
This project includes implementation-focused documentation aligned with academic papers and architectural design:
* [`bpe_implementation.md`](docs/tokenizers/bpe_implementation.md) — Byte Pair Encoding (BPE) training process, aligned with Sennrich et al. (2015)
---
## 📜 License
This project is MIT licensed.