https://github.com/toy-gpt/train-300-context-2
Demonstrates, at very small scale, how a language model is trained (two-context model).
https://github.com/toy-gpt/train-300-context-2
computer-science-education dual-context educational inspectable-models language-model machine-learning-education mkdocs next-token-prediction probabilistic-model pyright python reproducibility softmax-regression src-layout teaching toy-gpt toy-model two-context-model uv
Last synced: 2 months ago
JSON representation
Demonstrates, at very small scale, how a language model is trained (two-context model).
- Host: GitHub
- URL: https://github.com/toy-gpt/train-300-context-2
- Owner: toy-gpt
- License: mit
- Created: 2026-01-16T19:21:29.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-01-17T01:17:03.000Z (5 months ago)
- Last Synced: 2026-01-18T06:47:59.868Z (5 months ago)
- Topics: computer-science-education, dual-context, educational, inspectable-models, language-model, machine-learning-education, mkdocs, next-token-prediction, probabilistic-model, pyright, python, reproducibility, softmax-regression, src-layout, teaching, toy-gpt, toy-model, two-context-model, uv
- Language: Python
- Homepage: https://toy-gpt.github.io/train-300-context-2/
- Size: 706 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Toy-GPT: train-300-context-2
[](https://pypi.org/project/toy-gpt-train-300-context-2/)
[](https://github.com/toy-gpt/train-300-context-2/releases)
[](https://toy-gpt.github.io/train-300-context-2/)
[](https://opensource.org/license/MIT)
[](https://github.com/toy-gpt/train-300-context-2/actions/workflows/ci-python-mkdocs-shared.yml)
[](https://github.com/toy-gpt/train-300-context-2/actions/workflows/deploy-mkdocs-shared.yml)
[](https://github.com/toy-gpt/train-300-context-2/actions/workflows/links.yml)
[](https://github.com/toy-gpt/train-300-context-2/security)
> Demonstrates, at very small scale, how a language model is trained.
This repository is part of a series of toy training repositories plus a companion client repository:
- [**Training repositories**](https://github.com/toy-gpt) produce pretrained artifacts (vocabulary, weights, metadata).
- A [**web app**](https://toy-gpt.github.io/toy-gpt-chat/) loads the artifacts and provides an interactive prompt.
## Contents
- a small, declared text corpus
- a tokenizer and vocabulary builder
- a simple next-token prediction model
- a repeatable training loop
- committed, inspectable artifacts for downstream use
## Scope
This is:
- an intentionally inspectable training pipeline
- a next-token predictor trained on an explicit corpus
This is not:
- a production system
- a full Transformer implementation
- a chat interface
- a claim of semantic understanding
## Outputs
This repository produces and commits pretrained artifacts under `artifacts/`.
Training logs and evidence are written under `outputs/`
(for example, `outputs/train_log.csv`).
## Quick start
See `SETUP.md` for full setup and workflow instructions.
Run the full training script:
```shell
uv run python src/toy_gpt_train/d_train.py
```
Run individually:
- a/b/c are demos (can be run alone if desired)
- d_train produces artifacts
- e_infer consumes artifacts
```shell
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py
```
## Provenance and Purpose
The primary corpus used for training is declared in `SE_MANIFEST.toml`.
This repository commits pretrained artifacts so the client can run
without retraining.
## Annotations
[ANNOTATIONS.md](./ANNOTATIONS.md) - REQ/WHY/OBS annotations used
## Citation
[CITATION.cff](./CITATION.cff)
## License
[MIT](./LICENSE)
## SE Manifest
[SE_MANIFEST.toml](./SE_MANIFEST.toml) - project intent, scope, and role