https://github.com/muhammad-fiaz/gpt-2-rust

A native Rust implementation of the GPT-2 transformer architecture built from scratch using the Burn deep-learning framework.
https://github.com/muhammad-fiaz/gpt-2-rust

gpt gpt-2-rust gpt-rust gpt-rust-implementation gpt2 rust rust-bot rust-cli rust-lang

Last synced: 6 days ago
JSON representation

A native Rust implementation of the GPT-2 transformer architecture built from scratch using the Burn deep-learning framework.

Host: GitHub
URL: https://github.com/muhammad-fiaz/gpt-2-rust
Owner: muhammad-fiaz
License: mit
Created: 2026-06-12T08:46:38.000Z (14 days ago)
Default Branch: main
Last Pushed: 2026-06-12T10:30:49.000Z (14 days ago)
Last Synced: 2026-06-12T11:06:21.118Z (14 days ago)
Topics: gpt, gpt-2-rust, gpt-rust, gpt-rust-implementation, gpt2, rust, rust-bot, rust-cli, rust-lang
Language: Rust
Homepage: https://muhammad-fiaz.github.io/gpt-2-rust/
Size: 81.1 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          
GPT-2 Rust 🦀🔥




  A native Rust implementation of the GPT-2 transformer architecture built from scratch using the Burn deep-learning framework.





  

  

  

  

  

  



---

## Overview

This project is a native Rust implementation of the Transformer model from the seminal paper [Attention Is All You Need (arXiv)](https://arxiv.org/abs/1706.03762) / [(Hugging Face)](https://huggingface.co/papers/1706.03762) and OpenAI's GPT-2 architecture outlined in [Language Models are Unsupervised Multitask Learners](https://d4mucfpruptmv.cloudfront.net/better-language-models/language-models_are_unsupervised_multitask_learners.pdf). 

Built using the [Burn deep-learning framework](https://burn.dev), it includes:

- ✅ **Custom Causal Self-Attention:** Custom Q/K/V combined projection with causal masking.

- ✅ **Embeddings:** Learned token and absolute positional embeddings.

- ✅ **Pre-norm Block Architecture:** Layer normalization applied before self-attention and MLP blocks.

- ✅ **GELU-activated MLP:** Standard GPT-2 feedforward network.

- ✅ **LM Head with Weight Tying:** Shares parameters with the token embedding table.

- ✅ **Pure-Rust BPE Tokenizer:** Fast `tiktoken-rs` wrapper using GPT-2's vocabulary.

- ✅ **Real-Time Word Streaming:** Outputs newly generated words instantly to the console.

- ✅ **GPU Model Weight Offloading:** Automatically drops weights from GPU VRAM immediately after inference finishes.

- ✅ **CUDA Execution:** GPU-accelerated by default using native Rust CUDA support (via CubeCL/cudarc) across Windows and Linux.

- ✅ **Single CLI Entry Point:** Reuses dependencies and compiles quickly into a single main executable.

---

## Model Sizes & Parameters

| Variant | Params | n_layer | n_head | n_embd |

|---------|--------|---------|--------|--------|

| `small` | 117 M  | 12      | 12     | 768    |

| `medium`| 345 M  | 24      | 16     | 1024   |

| `large` | 762 M  | 36      | 20     | 1280   |

| `xl`    | 1.5 B  | 48      | 25     | 1600   |

---

## Command-Line Interface (CLI) Usages

Everything compiles into a single, unified binary. Choose the operation mode using the flags:

### 1. Download Model Parameters (`--download`)

Downloads safetensors weights, model configurations, and vocabulary mapping for all variants from Hugging Face:

```bash

# Download small variant parameters into weights/small/

cargo run --release -- --download --size small --weights-dir weights

# Download all 4 variant parameters (small, medium, large, xl)

cargo run --release -- --download --size all --weights-dir weights

```

* **Arguments:**

  * `--size `: The GPT-2 variant to fetch.

  * `--weights-dir `: Directory to save downloaded files.

  * `--force`: Force download even if file already exists.


### 2. Generate Text / Run Inference (`--generate`)

Generates text autoregressively with real-time word streaming:

```bash

cargo run --release -- --generate --model weights/small/model.safetensors --prompt "The future of artificial intelligence is" --size small --max-new-tokens 100 --temperature 0.8 --top-k 50

```

* **Arguments:**

  * `--model `: Path to model weights file.

  * `--size `: Size configuration to construct.

  * `--prompt ""`: Conditioning prompt text.

  * `--max-new-tokens `: Maximum new tokens to produce.

  * `--temperature `: Softmax temperature scaling (0 = greedy).

  * `--top-k `: Top-K cutoff value (0 to disable).

  * `--top-p `: Top-P (nucleus) value (0.0 to disable).

  * `--seed `: Random seed for reproducibility.

  * `--device `: Compute backend device (default is `cuda`).


### 3. Evaluate Perplexity (`--evaluate`)

Evaluates cross-entropy loss and perplexity on a test dataset:

```bash

cargo run --release -- --evaluate --model weights/small/model.safetensors --format safetensors --data data/input.txt --seq-len 128 --batch-size 4

```

* **Arguments:**

  * `--model `: Path to weights file.

  * `--format `: Weights format.

  * `--data `: Path to validation dataset file.

  * `--seq-len `: Sliding window sequence length.

  * `--batch-size `: Evaluation batch size.

  * `--device `: Compute backend device (default is `cuda`).

### 4. Pre-Train or Fine-Tune (`--train`)

Performs pre-training or fine-tuning from scratch:

```bash

cargo run --release -- --train --data data/input.txt --artifact-dir artifacts/ --size small --epochs 3 --batch-size 4 --seq-len 128 --lr 3e-4 --dropout 0.1

```

* **Arguments:**

  * `--data `: Path to plain-text training file.

  * `--artifact-dir `: Output directory for training checkpoints.

  * `--epochs `: Number of training epochs.

  * `--lr `: AdamW peak learning rate.

  * `--dropout `: Dropout probability.

  * `--device `: Compute backend device (default is `cuda`).

---

## Tech Stack

- **Deep Learning Framework:** [Burn](https://burn.dev) (v0.21)

- **Compute Backends:** CUDA (default, native Rust GPU), WGPU (optional WebGPU), or NdArray (CPU fallback)

- **Tokenizer:** `tiktoken-rs` (v0.5, BPE encoding)

- **Serialization:** `safetensors` (v0.4) & `memmap2` (v0.9)

---

## Citations & References

- **Burn Framework:**

  ```bibtex

  @misc{burn2024,

    title = {Burn: A Flexible and Modern Deep Learning Framework in Rust},

    howpublished = {\url{https://burn.dev/}},

    year = {2024}

  }

  ```

- **Attention Is All You Need:**

  - **arXiv:** [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)

  - **Hugging Face Papers:** [https://huggingface.co/papers/1706.03762](https://huggingface.co/papers/1706.03762)

  ```bibtex

  @inproceedings{vaswani2017attention,

    author    = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},

    title     = {Attention is all you need},

    booktitle = {Advances in neural information processing systems},

    pages     = {5998--6008},

    year      = {2017},

    url       = {https://arxiv.org/abs/1706.03762 or https://huggingface.co/papers/1706.03762}

  }

  ```

- **GPT-2 (Language Models are Unsupervised Multitask Learners):**

  ```bibtex

  @article{radford2019language,

    title   = {Language models are unsupervised multitask learners},

    author  = {Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},

    journal = {OpenAI blog},

    volume  = {1},

    number  = {8},

    pages   = {9},

    year    = {2019},

    url     = {https://d4mucfpruptmv.cloudfront.net/better-language-models/language-models_are_unsupervised_multitask_learners.pdf}

  }

  ```

- **GPT-2 Rust (This repository):**

  ```bibtex

  @misc{gpt2rust2026,

    author       = {Muhammad Fiaz},

    title        = {GPT-2 Rust: A native Rust implementation of the GPT-2 transformer architecture built from scratch using the Burn deep-learning framework},

    howpublished = {\url{https://github.com/muhammad-fiaz/gpt-2-rust}},

    year         = {2026},

    note         = {Licensed under MIT License}

  }

  ```

---

## Contributing

Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on code style, linting, and pull requests.

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/muhammad-fiaz/gpt-2-rust

Awesome Lists containing this project

README

GPT-2 Rust 🦀🔥