An open API service indexing awesome lists of open source software.

https://github.com/toy-gpt/train-300-context-2

Demonstrates, at very small scale, how a language model is trained (two-context model).
https://github.com/toy-gpt/train-300-context-2

computer-science-education dual-context educational inspectable-models language-model machine-learning-education mkdocs next-token-prediction probabilistic-model pyright python reproducibility softmax-regression src-layout teaching toy-gpt toy-model two-context-model uv

Last synced: 2 months ago
JSON representation

Demonstrates, at very small scale, how a language model is trained (two-context model).

Awesome Lists containing this project

README

          

# Toy-GPT: train-300-context-2

[![PyPI version](https://img.shields.io/pypi/v/toy-gpt-train-300-context-2)](https://pypi.org/project/toy-gpt-train-300-context-2/)
[![Latest Release](https://img.shields.io/github/v/release/toy-gpt/train-300-context-2)](https://github.com/toy-gpt/train-300-context-2/releases)
[![Docs](https://img.shields.io/badge/docs-live-blue)](https://toy-gpt.github.io/train-300-context-2/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/MIT)
[![CI](https://github.com/toy-gpt/train-300-context-2/actions/workflows/ci-python-mkdocs-shared.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/ci-python-mkdocs-shared.yml)
[![Deploy-Docs](https://github.com/toy-gpt/train-300-context-2/actions/workflows/deploy-mkdocs-shared.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/deploy-mkdocs-shared.yml)
[![Check Links](https://github.com/toy-gpt/train-300-context-2/actions/workflows/links.yml/badge.svg)](https://github.com/toy-gpt/train-300-context-2/actions/workflows/links.yml)
[![Dependabot](https://img.shields.io/badge/Dependabot-enabled-brightgreen.svg)](https://github.com/toy-gpt/train-300-context-2/security)

> Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

- [**Training repositories**](https://github.com/toy-gpt) produce pretrained artifacts (vocabulary, weights, metadata).
- A [**web app**](https://toy-gpt.github.io/toy-gpt-chat/) loads the artifacts and provides an interactive prompt.

## Contents

- a small, declared text corpus
- a tokenizer and vocabulary builder
- a simple next-token prediction model
- a repeatable training loop
- committed, inspectable artifacts for downstream use

## Scope

This is:

- an intentionally inspectable training pipeline
- a next-token predictor trained on an explicit corpus

This is not:

- a production system
- a full Transformer implementation
- a chat interface
- a claim of semantic understanding

## Outputs

This repository produces and commits pretrained artifacts under `artifacts/`.

Training logs and evidence are written under `outputs/`
(for example, `outputs/train_log.csv`).

## Quick start

See `SETUP.md` for full setup and workflow instructions.

Run the full training script:

```shell
uv run python src/toy_gpt_train/d_train.py
```

Run individually:

- a/b/c are demos (can be run alone if desired)
- d_train produces artifacts
- e_infer consumes artifacts

```shell
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py
```

## Provenance and Purpose

The primary corpus used for training is declared in `SE_MANIFEST.toml`.

This repository commits pretrained artifacts so the client can run
without retraining.

## Annotations

[ANNOTATIONS.md](./ANNOTATIONS.md) - REQ/WHY/OBS annotations used

## Citation

[CITATION.cff](./CITATION.cff)

## License

[MIT](./LICENSE)

## SE Manifest

[SE_MANIFEST.toml](./SE_MANIFEST.toml) - project intent, scope, and role