https://github.com/kjpou1/llm-zero-to-trained

Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.
https://github.com/kjpou1/llm-zero-to-trained

attention deep-learning educational from-scratch from-scratch-in-python llm machine-learning natural-language-processing python tokenizer torch training-loop transformers uv-pm

Last synced: about 2 months ago
JSON representation

Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.

Host: GitHub
URL: https://github.com/kjpou1/llm-zero-to-trained
Owner: kjpou1
Created: 2025-07-14T06:24:51.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-14T09:02:15.000Z (about 1 year ago)
Last Synced: 2025-07-14T09:41:42.227Z (about 1 year ago)
Topics: attention, deep-learning, educational, from-scratch, from-scratch-in-python, llm, machine-learning, natural-language-processing, python, tokenizer, torch, training-loop, transformers, uv-pm
Language: Jupyter Notebook
Homepage:
Size: 154 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # **LLM: Zero to Trained**

> A fully manual, from-scratch implementation of a Large Language Model (LLM), guided by the book [**Build a Large Language Model from Scratch**](https://github.com/rasbt/LLMs-from-scratch) by Sebastian Raschka.

Unlike simply running code from the [reference repo](https://github.com/rasbt/LLMs-from-scratch.git), this project is about **understanding and re-implementing** every major component — tokenizer, model, data loader, training loop — by hand.

---

- [**LLM: Zero to Trained**](#llm-zero-to-trained)

  - [🎯 Objective](#-objective)

  - [📁 Project Structure](#-project-structure)

  - [🧠 Learning Sources](#-learning-sources)

  - [🗒️ Progress Log](#️-progress-log)

  - [🔧 Getting Started](#-getting-started)

  - [🧰 Environment Setup (with uv)](#-environment-setup-with-uv)

  - [🚀 CLI: Start Building](#-cli-start-building)

  - [📚 References \& Inspirations](#-references--inspirations)

  - [� Internal Documentation](#-internal-documentation)

  - [📜 License](#-license)

---

## 🎯 Objective

To develop a working LLM from first principles by:

* Writing each component from scratch in Python

* Following the structure and logic from Raschka’s book

* Validating ideas through notebooks and experiments

* Building modular, reusable, CLI-driven code

---

## 📁 Project Structure

```

llm-zero-to-trained/

├── src/llmscratch/        ← Modular CLI-driven Python package

│   ├── config/            ← Config loader with .env, CLI, YAML support

│   ├── models/            ← Core dataclasses and SingletonMeta

│   ├── runtime/           ← CLI argument parsing and dispatch

│   ├── launch_host.py     ← Entry point for all commands (e.g., preprocess)

│   └── host.py            ← Command execution coordinator

├── notebooks/             ← Book-aligned exploration notebooks

├── configs/               ← YAML configs for datasets, vocab, etc.

├── datasets/              ← Raw and processed tokenized data

├── pyproject.toml         ← Project metadata and CLI definition

├── README.md              ← This file

└── PROGRESS.md            ← Running log of milestones

```

---

## 🧠 Learning Sources

* 📘 *Build a Large Language Model from Scratch* – Sebastian Raschka (2024)

* 💻 [LLMs-from-scratch GitHub Repo](https://github.com/rasbt/LLMs-from-scratch)

* 🧠 Karpathy’s [minGPT](https://github.com/karpathy/minGPT) and [nanoGPT](https://github.com/karpathy/nanoGPT) (inspirational, but not reused)

---

## 🗒️ Progress Log

See [PROGRESS.md](./PROGRESS.md) for completed milestones, model checkpoints, and active development notes.

---

## 🔧 Getting Started

```bash

git clone https://github.com/kjpou1/llm-zero-to-trained.git

cd llm-zero-to-trained

```

---

## 🧰 Environment Setup (with [uv](https://docs.astral.sh/uv/getting-started/installation/))

```bash

uv venv

source .venv/bin/activate        # macOS/Linux

# OR

.venv\Scripts\activate           # Windows

uv pip install --editable .

uv sync

```

> 🧪 This enables the `llmscratch` CLI from anywhere and installs all dependencies with reproducible locking via `uv`.

---

## 🚀 CLI: Start Building

Use the CLI to run modular LLM pipelines:

```bash

llmscratch preprocess --config configs/data_config.yaml

```

More commands like `train`, `sample`, and `evaluate` will follow as the project evolves.

---

## 📚 References & Inspirations

While the architecture is influenced by great projects, all code is original and written from scratch:

* [Raschka’s LLM Book](https://leanpub.com/llms-from-scratch)

* [minGPT](https://github.com/karpathy/minGPT)

* [nanoGPT](https://github.com/karpathy/nanoGPT)

* [Hugging Face Transformers](https://github.com/huggingface/transformers)

---

## 📄 Internal Documentation

This project includes implementation-focused documentation aligned with academic papers and architectural design:

* [`bpe_implementation.md`](docs/tokenizers/bpe_implementation.md) — Byte Pair Encoding (BPE) training process, aligned with Sennrich et al. (2015)

---

## 📜 License

This project is MIT licensed.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kjpou1/llm-zero-to-trained

Awesome Lists containing this project

README