https://github.com/toy-gpt/train-300-context-2

Demonstrates, at very small scale, how a language model is trained (two-context model).
https://github.com/toy-gpt/train-300-context-2

computer-science-education dual-context educational inspectable-models language-model machine-learning-education mkdocs next-token-prediction probabilistic-model pyright python reproducibility softmax-regression src-layout teaching toy-gpt toy-model two-context-model uv

Last synced: 4 months ago
JSON representation

Demonstrates, at very small scale, how a language model is trained (two-context model).

Host: GitHub
URL: https://github.com/toy-gpt/train-300-context-2
Owner: toy-gpt
License: mit
Created: 2026-01-16T19:21:29.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-01-17T01:17:03.000Z (6 months ago)
Last Synced: 2026-01-18T06:47:59.868Z (6 months ago)
Topics: computer-science-education, dual-context, educational, inspectable-models, language-model, machine-learning-education, mkdocs, next-token-prediction, probabilistic-model, pyright, python, reproducibility, softmax-regression, src-layout, teaching, toy-gpt, toy-model, two-context-model, uv
Language: Python
Homepage: https://toy-gpt.github.io/train-300-context-2/
Size: 706 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # Toy-GPT: train-300-context-2

[![PyPI version](https://img.shields.io/pypi/v/toy-gpt-train-300-context-2)](https://pypi.org/project/toy-gpt-train-300-context-2/)

[![Latest Release](https://img.shields.io/github/v/release/toy-gpt/train-300-context-2)](https://github.com/toy-gpt/train-300-context-2/releases)

[![Docs](https://img.shields.io/badge/docs-live-blue)](https://toy-gpt.github.io/train-300-context-2/)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/MIT)

[![CI](https://github.com/toy-gpt/train-300-context-2/actions/workflows/ci-python-mkdocs-shared.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/ci-python-mkdocs-shared.yml)

[![Deploy-Docs](https://github.com/toy-gpt/train-300-context-2/actions/workflows/deploy-mkdocs-shared.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/deploy-mkdocs-shared.yml)

[![Check Links](https://github.com/toy-gpt/train-300-context-2/actions/workflows/links.yml/badge.svg)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/links.yml)

[![Dependabot](https://img.shields.io/badge/Dependabot-enabled-brightgreen.svg)](https://github.com/toy-gpt/train-300-context-2/security)

> Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

- [**Training repositories**](https://github.com/toy-gpt) produce pretrained artifacts (vocabulary, weights, metadata).

- A [**web app**](https://toy-gpt.github.io/toy-gpt-chat/) loads the artifacts and provides an interactive prompt.

## Contents

- a small, declared text corpus

- a tokenizer and vocabulary builder

- a simple next-token prediction model

- a repeatable training loop

- committed, inspectable artifacts for downstream use

## Scope

This is:

- an intentionally inspectable training pipeline

- a next-token predictor trained on an explicit corpus

This is not:

- a production system

- a full Transformer implementation

- a chat interface

- a claim of semantic understanding

## Outputs

This repository produces and commits pretrained artifacts under `artifacts/`.

Training logs and evidence are written under `outputs/`

(for example, `outputs/train_log.csv`).

## Quick start

See `SETUP.md` for full setup and workflow instructions.

Run the full training script:

```shell

uv run python src/toy_gpt_train/d_train.py

```

Run individually:

- a/b/c are demos (can be run alone if desired)

- d_train produces artifacts

- e_infer consumes artifacts

```shell

uv run python src/toy_gpt_train/a_tokenizer.py

uv run python src/toy_gpt_train/b_vocab.py

uv run python src/toy_gpt_train/c_model.py

uv run python src/toy_gpt_train/d_train.py

uv run python src/toy_gpt_train/e_infer.py

```

## Provenance and Purpose

The primary corpus used for training is declared in `SE_MANIFEST.toml`.

This repository commits pretrained artifacts so the client can run

without retraining.

## Annotations

[ANNOTATIONS.md](./ANNOTATIONS.md) - REQ/WHY/OBS annotations used

## Citation

[CITATION.cff](./CITATION.cff)

## License

[MIT](./LICENSE)

## SE Manifest

[SE_MANIFEST.toml](./SE_MANIFEST.toml) - project intent, scope, and role

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/toy-gpt/train-300-context-2

Awesome Lists containing this project

README