https://github.com/michaelellis003/lmt
PyTorch implementation of transformer-based language models (GPT) for pretraining and fine-tuning
https://github.com/michaelellis003/lmt
deep-learning fine-tuning gpt language-model machine-learning pretraining python pytorch transformer
Last synced: 4 months ago
JSON representation
PyTorch implementation of transformer-based language models (GPT) for pretraining and fine-tuning
- Host: GitHub
- URL: https://github.com/michaelellis003/lmt
- Owner: michaelellis003
- License: apache-2.0
- Created: 2025-07-11T01:52:37.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2026-03-06T03:31:19.000Z (4 months ago)
- Last Synced: 2026-03-06T05:33:09.193Z (4 months ago)
- Topics: deep-learning, fine-tuning, gpt, language-model, machine-learning, pretraining, python, pytorch, transformer
- Language: Python
- Homepage:
- Size: 1.16 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Language Modeling using Transformers (LMT)
[](https://github.com/michaelellis003/LMT/actions/workflows/ci-cd.yml)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
[](LICENSE)
[](https://github.com/astral-sh/ruff)
An educational PyTorch library for understanding how modern transformer architectures work -- from attention mechanisms to full language models. Every component is written to be **understood**, with clear code, detailed docstrings, and mathematical notation that maps directly to the papers.
## Features
**Attention Mechanisms**
| Component | Description | Paper |
|-----------|-------------|-------|
| Multi-Head Attention | Standard scaled dot-product attention | Vaswani et al., 2017 |
| Grouped Query Attention | Shared KV heads for efficiency | Ainslie et al., 2023 |
| Sliding Window Attention | Local attention with fixed window | Beltagy et al., 2020 |
| Multi-Head Latent Attention | KV compression + decoupled RoPE | DeepSeek-AI, 2024 |
**Feed-Forward Networks**
| Component | Description |
|-----------|-------------|
| SwiGLU | Gated FFN with Swish activation (LLaMA, Mixtral) |
| Mixture of Experts | Top-k sparse routing with load balancing loss |
**Other Components**: RMSNorm, Rotary Position Embedding (RoPE)
**Model Architectures**
| Model | Key Components |
|-------|---------------|
| GPT | Multi-head attention + GELU FFN + learned position embeddings |
| LLaMA | RMSNorm + RoPE + SwiGLU + GQA |
| Mixtral | LLaMA + MoE FFN + sliding window attention |
## Installation
```bash
pip install pylmt
```
Or install from source for development:
```bash
git clone https://github.com/michaelellis003/LMT.git
cd LMT
pip install uv
uv sync
```
## Quick Start
```python
import torch
from lmt.models.config import ModelConfig
from lmt.models.llama import LLaMA
config = ModelConfig(
vocab_size=32000,
embed_dim=512,
num_heads=8,
num_kv_heads=4, # GQA: 4 KV heads shared across 8 query heads
num_layers=6,
context_length=1024,
dropout=0.0,
)
model = LLaMA(config)
x = torch.randint(0, config.vocab_size, (1, 128))
logits = model(x) # [1, 128, 32000]
```
### Using Individual Layers
```python
from lmt.layers.attention import GroupedQueryAttention
from lmt.layers.ffn import SwiGLU
from lmt.layers.normalization import RMSNorm
norm = RMSNorm(d_model=512)
attn = GroupedQueryAttention(config)
ffn = SwiGLU(d_model=512)
x = torch.randn(1, 64, 512)
x = x + attn(norm(x)) # Pre-norm attention
x = x + ffn(norm(x)) # Pre-norm FFN
```
### Mixture of Experts
```python
from lmt.models.mixtral import Mixtral
config = ModelConfig(
vocab_size=32000, embed_dim=512, num_heads=8,
num_kv_heads=4, num_layers=8,
context_length=2048, window_size=256, dropout=0.0,
)
model = Mixtral(config, num_experts=8, top_k=2)
logits = model(x)
aux_loss = model.aux_loss # load balancing loss for training
```
## Project Structure
```
src/lmt/
layers/
attention/ # MHA, GQA, Sliding Window, MLA
ffn/ # SwiGLU, MoE (Router + Experts)
normalization/ # RMSNorm
positional/ # RoPE
models/
gpt/ # GPT (original decoder-only transformer)
llama/ # LLaMA (RMSNorm + RoPE + SwiGLU + GQA)
mixtral/ # Mixtral (LLaMA + MoE + sliding window)
training/ # Trainer, configs, dataloaders
tokenizer/ # BPE, naive tokenizers
```
## Development
```bash
uv run pytest tests/ -v # Run tests (171 passing)
uv run ruff check src/ tests/ # Lint
uv run ruff format src/ tests/ # Format
uv run pyright src/ # Type check
uv run mkdocs serve # Local docs server
```
## License
Apache License 2.0. See [LICENSE](LICENSE) for details.