https://github.com/huggon1/llm-from-scratch

Small, readable experiments from tokenizer training to LoRA and DPO.
https://github.com/huggon1/llm-from-scratch

dpo llm lora pretraining tokenizer

Last synced: 6 days ago
JSON representation

Small, readable experiments from tokenizer training to LoRA and DPO.

Host: GitHub
URL: https://github.com/huggon1/llm-from-scratch
Owner: huggon1
Created: 2026-03-14T13:49:47.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-14T14:00:37.000Z (4 months ago)
Last Synced: 2026-03-15T01:12:38.803Z (4 months ago)
Topics: dpo, llm, lora, pretraining, tokenizer
Language: Python
Size: 176 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# llm-from-scratch

Small, readable experiments that walk through a mini LLM pipeline:

1. train a tokenizer
2. pretrain a compact language model
3. run supervised fine-tuning
4. try DPO
5. try LoRA fine-tuning

## Highlights

- Follows a clear learning path from tokenizer training to preference optimization
- Keeps each stage in a separate folder so experiments stay easy to inspect
- Uses tiny public-safe samples so the repository remains runnable and lightweight

## Layout

```text
llm-from-scratch/
tokenizer/
pretrain/
sft/
dpo/
lora/
docs/
```

## Requirements

- Python 3.10+
- PyTorch
- Transformers
- Tokenizers
- Pandas and NumPy

Install:

```bash
pip install -r requirements.txt
```

## What's Included

- Training scripts and model definitions for each stage
- Small sample datasets for demonstration
- Notes copied from the original study folders under `docs/`
- Minimal public-safe data samples that keep the repository lightweight

## What's Omitted

- Large checkpoints and `.pth` weights
- Full training corpora and large raw datasets
- Temporary training outputs
- Cache files and Python bytecode

Some scripts expect base model weights produced by an earlier stage. Those weights are intentionally not committed, so you should place them in the expected local path before training or inference.

## Suggested Order

### 1. Tokenizer

```bash
cd tokenizer
python main.py
```

### 2. Pretrain

Use the tokenizer artifacts under `pretrain/tokenizer/`, then run:

```bash
cd pretrain
python main.py
```

### 3. SFT / DPO / LoRA

These stages depend on locally available base weights from prior training.

```bash
cd sft
python main.py
```

```bash
cd dpo
python main.py
```

```bash
cd lora
python main.py
```

In practice, the easiest way to explore the repo is to read `docs/` first, then run the tokenizer and pretrain stages before looking at SFT, DPO, and LoRA.

## Notes

- The repository keeps each stage self-contained so the training flow stays easy to follow.
- The original Chinese notes are preserved as markdown files in `docs/`.
- Sample datasets are intentionally small and mainly serve as runnable examples.
- The original larger datasets used during experimentation came from open-source or publicly available materials.
- Only a few sanitized sample rows are committed here so the repository stays lightweight and easy to share.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/huggon1/llm-from-scratch

Awesome Lists containing this project

README