https://github.com/seanwevans/pg_gpt2

gpt2 in postgres
https://github.com/seanwevans/pg_gpt2
database-experiment gpt gpt2 transformer
Last synced: 3 months ago
JSON representation
gpt2 in postgres
Host: GitHub
URL: https://github.com/seanwevans/pg_gpt2
Owner: seanwevans
Created: 2025-10-04T17:39:42.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-10-21T00:22:55.000Z (7 months ago)
Last Synced: 2025-10-21T02:29:16.012Z (7 months ago)
Topics: database-experiment, gpt, gpt2, transformer
Language: C
Homepage:
Size: 288 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # pg_gpt2



**pg_gpt2** is a complete implementation of the GPT-2 architecture *entirely inside PostgreSQL*.

It extends the database with tensor algebra, automatic differentiation, AdamW optimization, checkpointing, and a Byte-Pair Encoding tokenizer — allowing end-to-end training and text generation purely through SQL and C extensions.

---

## Overview

PostgreSQL is used as both the **storage** and **execution environment** for a large-scale transformer model.

Each layer, weight, and intermediate activation lives in relational tables; tensor operations are implemented as `C` functions returning `BYTEA` buffers.

Every forward pass, gradient computation, and parameter update is a deterministic SQL transaction.

The project demonstrates that a relational database can serve as a full numerical engine, state store, and model runtime — no Python, PyTorch, or external ML stack required.

---

## Prerequisites

Building the extension requires the PostgreSQL server development headers and build tooling so that `pg_config --pgxs` resolves to the `pgxs.mk` makefile. On Debian/Ubuntu systems install the package:

```bash

sudo apt-get install postgresql-server-dev-16

```

If PostgreSQL is installed somewhere custom, set the `PG_CONFIG` environment variable to point at the desired `pg_config` binary before running `make`.

---

## Getting Started

1. **Compile and install the extension.** From the repository root run `make install`. The build uses PGXS and will copy `pg_llm` artifacts into PostgreSQL's extension directory reported by `pg_config --pkglibdir`.

2. **Load the extension in a database.** Connect with `psql` and execute `CREATE EXTENSION pg_llm;` in the target database. This initializes all required tables, functions, and SQL entry points.

3. **Verify availability.** Confirm the extension is active with either `\dx pg_llm` in `psql` or a query such as `SELECT * FROM pg_extension WHERE extname = 'pg_llm';`. Successful output indicates the extension is ready for the workflow described below.

## Reproducing GPT-2 Results

Follow the [reproduction playbook](docs/reproducing_gpt2.md) for a step-by-step guide that mirrors the original GPT-2 training, evaluation, and sampling pipeline entirely within PostgreSQL.

### Docker Image

To simplify evaluation and demos you can run PostgreSQL with the `pg_gpt2` extension pre-installed using the provided Dockerfile.

```bash

# Build the image locally

docker build -t pg-gpt2-demo .

# Start PostgreSQL with pg_llm already installed in the default database

docker run --rm -e POSTGRES_PASSWORD=secret -p 5432:5432 --name pg-gpt2 pg-gpt2-demo

```

The container reuses the official `postgres:16` entrypoint. On first start it creates the default database and automatically enables the `pg_llm` extension so that `psql` connections can immediately run the SQL workflows described below.

---

## Core Design Principles

1. **Postgres as OS** — All computation and persistence live in SQL schemas and C extensions.

2. **Full Reproducibility** — Every step, gradient, and checkpoint is a logged transaction.

3. **Numerical Fidelity** — Bit-level parity with PyTorch’s GPT-2 (`float32`, row-major, GELU, LayerNorm, AdamW).

4. **Composability** — Every tensor op is an SQL function; model architectures are relational graphs.

5. **Auditable Learning** — Because gradients and weights are rows, the entire training process is queryable and replayable.

---

## Architecture Summary

| Component | Description |

|------------|-------------|

| **Tensor Engine** | C implementations of `matmul`, `add`, `gelu`, `softmax`, `layernorm`, `cross_entropy` over contiguous `float32` blobs (`BYTEA`). |

| **Autodiff Engine** | Reverse-mode differentiation recorded in a relational *tape* (`llm_tape`, `llm_tensor_rt`), supporting backpropagation of all GPT-2 ops. |

| **Optimizer** | AdamW with bias correction, decoupled weight decay, gradient clipping, and cosine learning-rate schedule. |

| **Checkpointing** | Import/export weights as `.npz` or `.safetensors` archives. Every snapshot is versioned in `llm_checkpoint`. |

| **Tokenizer** | Native Byte-Pair Encoding (BPE) tokenizer/decoder built from `vocab.json` + `merges.txt`. |

| **Sampling Engine** | Temperature, top-k, and top-p (nucleus) sampling for autoregressive generation. |

| **Training Loop** | SQL functions (`llm_train`, `llm_train_step`, `llm_loss`) orchestrate forward, backward, optimizer updates, and logging. |

| **Inference** | `llm_generate(prompt)` runs encoding → forward → sampling → decoding, returning coherent text completions. |

---

## Key Tables

| Table | Purpose |

|--------|----------|

| `llm_model_config` | Registered model dimensions (layers, heads, embedding size, positions, vocab). |

| `llm_param` | Model parameters, gradients, optimizer state. |

| `llm_dataset` | Tokenized training sequences. |

| `llm_tape` / `llm_tensor_rt` | Computational graph and runtime tensors for autograd. |

| `llm_autograd_mode` | Single-row toggle that signals when forward passes should record autograd tape entries. |

| `llm_checkpoint` | Versioned checkpoint metadata and file paths. |

| `llm_bpe_vocab` / `llm_bpe_merges` | GPT-2 tokenizer vocabulary and merge ranks. |

| `llm_train_log` | Per-step learning rate and loss history. |

---

## Roadmap

See [docs/roadmap.md](docs/roadmap.md) for the upcoming feature roadmap, including GPT-3 style architecture support, mixed-precision execution, and hardware acceleration milestones.

---

## Autograd Workflow

End-to-end training relies on a thin runtime that records every forward op in SQL

so that gradients can be replayed later. The key moving pieces are:

1. **Parameter materialization.** `llm_materialize_params` copies each row in

   `llm_param` into the temporary `llm_tensor` cache and creates a matching row

   in `llm_tensor_rt`. During that copy the helper `pg_llm_autograd_map_param`

   (or its SQL equivalent `INSERT` in the function) must be invoked so the runtime

   tensor id is associated with the original `(model, name, token_id)` tuple. Any

   new C routine that constructs parameter views needs to perform the same mapping

   or gradients will not flow back into `llm_param`. 【F:sql/pg_llm--0.1.0.sql†L403-L438】【F:src/pg_llm_autograd.c†L216-L246】

2. **Forward tape recording.** Every C kernel checks `pg_llm_autograd_enabled()`;

   when the flag is set the inputs and outputs are registered with

   `pg_llm_autograd_track_tensor` and the op is appended to `llm_tape` with any

   metadata (shape, constants, etc.). This produces an ordered tape of all ops in

   the forward pass. 【F:src/pg_llm.c†L19-L210】

3. **Reverse traversal.** `llm_backprop` walks the tape from the newest node back

   to the seed, dispatching gradients based on the recorded `name` field and

   writing results into `llm_tensor_rt.grad`. Once complete, `llm_accumulate_grads`

   copies those buffers back into `llm_param.grad` using the mapping created in

   step 1. 【F:sql/llm_backprop.sql†L1-L78】【F:sql/pg_llm--0.1.0.sql†L439-L456】

4. **Tied embeddings.** GPT-2 reuses the token embedding (`wte`) for the final

   logits projection. After flattening the embedding table into a single matrix

   for `pg_llm_matmul`, ensure that buffer is still mapped to the original

   embedding rows (via `pg_llm_autograd_map_param`) so the logits gradient is

   accumulated back into `wte` rather than a detached copy. 【F:sql/pg_llm--0.1.0.sql†L173-L205】【F:src/pg_llm_autograd.c†L216-L246】

---

## SQL API Reference

### Model Initialization

```sql

SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz', 'gpt2-small');

```

Imports all pretrained GPT-2 weights into the `llm_param` table.

`llm_model_config` tracks the expected architecture dimensions for each model

and is consulted during import; `gpt2-small` is pre-registered, but custom

models should insert their configuration before calling `pg_llm_import_npz`.

### Forward Pass and Inference

```sql

-- Generate text directly in SQL

SELECT llm_generate('Once upon a time', 80, 0.9, 40, 0.92);

-- Stream tokens as they are produced (step, token_id, token, text, is_complete)

SELECT * FROM llm_generate_stream('Once upon a time', 40, 0.8, 40, 0.95);

```

### Training

```sql

-- Train for 10,000 steps on tokenized text dataset

SELECT llm_train(

  'gpt2-small',

  10000,

  grad_workers => 4,

  prune_workers => 4

);

```

Every step performs:

1. Forward pass → loss (`llm_loss`)

2. Reverse pass (`llm_backprop`)

3. Gradient accumulation

4. AdamW parameter updates

5. Logging to `llm_train_log`

`llm_train` will automatically read the layer count, attention heads, hidden size,

and vocabulary size from `llm_model_config`. Provide overrides for custom

experiments by passing explicit values for `n_layer`, `n_head`, `D`, or `vocab`

when invoking the function.

The training helpers expose knobs for multi-core cleanup work:

- `grad_workers` sets the desired parallel worker count for `llm_accumulate_grads`,

  allowing gradient materialisation from `llm_tensor_rt` into `llm_param` to leverage

  PostgreSQL's parallel query engine.

- `prune_workers` applies the same hinting to `llm_prune_autograd_state`, which clears

  the autograd tape and runtime tensors between steps. Autograd tape pruning is safe

  to parallelise because every runtime tensor row is independent, so this option simply

  tunes planner settings before issuing the deletes.

Both parameters default to `1` (no parallel workers) to preserve existing behaviour.

### Checkpointing

```sql

-- Save a new checkpoint

SELECT llm_checkpoint_save('gpt2-small','after warmup 2k');

-- Restore a checkpoint

SELECT llm_checkpoint_load('gpt2-small',1);

```

### Tokenizer Utilities

```sql

-- Load GPT-2 BPE vocab and merges

SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');

SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- Encode and decode text

SELECT llm_encode('Hello world!','gpt2-small');

SELECT llm_decode(ARRAY[15496,2159,0],'gpt2-small');

```

### Utility Scripts

The repository includes Python helpers for preparing external assets before

calling the SQL functions above. All scripts live under `scripts/`.

| Script | Purpose |

|--------|---------|

| `convert_gpt2_checkpoint.py` | Download/convert a HuggingFace GPT-2 checkpoint into the gzip-based `.npz` container expected by `pg_llm_import_npz`. |

| `ingest_tokenizer.py` | Load `vocab.json` and `merges.txt` tokenizer assets into `llm_bpe_vocab`/`llm_bpe_merges` using a PostgreSQL connection. |

| `prepare_dataset.py` | Tokenize raw text files with the GPT-2 tokenizer and populate `llm_dataset` with fixed-length `(tokens, target)` arrays. |

Install the optional Python dependencies with:

```

pip install transformers torch psycopg[binary]

```

Examples:

```

# 1. Convert HuggingFace weights to /mnt/models/gpt2-small.npz

python scripts/convert_gpt2_checkpoint.py --source gpt2 --output /mnt/models/gpt2-small.npz

# 2. Load tokenizer assets into PostgreSQL

python scripts/ingest_tokenizer.py \

  --dsn postgresql://postgres@localhost:5432/postgres \

  --model gpt2-small \

  --vocab /mnt/gpt2/vocab.json \

  --merges /mnt/gpt2/merges.txt --truncate

# 3. Tokenize a corpus and fill llm_dataset

python scripts/prepare_dataset.py \

  --dsn postgresql://postgres@localhost:5432/postgres \

  --tokenizer gpt2 \

  --input /mnt/corpus/*.txt \

  --block-size 1024 --truncate

```

An end-to-end walkthrough that stitches the helper scripts together is available

in [docs/python_workflow.md](docs/python_workflow.md), and a fully annotated

Jupyter notebook showing the SQL fine-tuning loop from data ingestion through

generation lives at [docs/fine_tuning_workflow.ipynb](docs/fine_tuning_workflow.ipynb).

---

## Mathematical Fidelity

All core operations follow the official GPT-2 equations:

**Attention**

\[

\mathrm{Attn}(x) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

\]

with causal masking and learned positional embeddings.

**Feed-Forward**

\[

\mathrm{FFN}(x) = \mathrm{GELU}(xW_1 + b_1)W_2 + b_2

\]

**LayerNorm**

\[

y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\gamma + \beta

\]

**Loss**

\[

L = -\log \frac{e^{z_t}}{\sum_j e^{z_j}}

\]

**Optimizer (AdamW)**

\[

\begin{aligned}

m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\

v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\

\hat{m}_t &= m_t / (1-\beta_1^t), \quad

\hat{v}_t = v_t / (1-\beta_2^t) \\

\theta_t &= \theta_{t-1} - \eta (\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon) + \lambda\theta_{t-1})

\end{aligned}

\]

---

## Example: End-to-End Flow

```sql

-- 1. Load model + tokenizer

SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz','gpt2-small');

SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');

SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- 2. Encode text

SELECT llm_encode('The database that dreamed of language.','gpt2-small');

-- 3. Generate continuation

SELECT llm_generate('The database that dreamed of language', 40, 0.8, 40, 0.95);

-- 4. Train or fine-tune

SELECT llm_train('gpt2-small', 5000);

-- 5. Save checkpoint

SELECT llm_checkpoint_save('gpt2-small','finetuned on corpus X');

```

---

## Python Client Utilities

Client applications can connect to PostgreSQL using `psycopg` and drive the

text-generation workflow directly from Python. The :mod:`pg_llm_client`

package offers a high-level helper:

```python

import psycopg

from pg_llm_client import PGLLMClient

with psycopg.connect("postgresql://postgres@localhost:5432/postgres") as conn:

    client = PGLLMClient(conn)

    # Single completion with tuned sampling parameters

    print(client.generate("The database that dreamed of language", temperature=0.7))

    # Stream tokens as they arrive

    for event in client.stream("Streaming from SQL", max_tokens=8):

        print(event.text)

    # Retrieve the top beam search candidates

    beams = client.beam_search("Once upon a", beam_width=3, max_tokens=5)

    for beam in beams:

        print(beam.score, beam.text)

```

The helper wraps the SQL API so sampling temperature, beam width, and other

parameters can be adjusted per request without hand-writing SQL in every

client.

---

## Performance Notes

- All tensors are stored as raw `BYTEA` blobs and processed in-memory.

- Core kernels (`pg_llm_matmul`, attention) use a tiled AVX2-aware micro-kernel that falls back to scalar math when SIMD is unavailable, delivering BLAS-class throughput without external dependencies.

- Attention is evaluated in configurable row chunks (default 64 tokens) so that context matrices never exceed a manageable working set, enabling GPT-2 scale sequence lengths inside Postgres.

- For large models, raise `work_mem`/`maintenance_work_mem` and consider chunking your training data via windowed queries so each step fits inside the executor's memory context.

- Store activations and optimizer scratch data in `UNLOGGED` tables (e.g., `CREATE UNLOGGED TABLE llm_activations (...)`) to avoid WAL amplification when materializing large tensors.

- Autograd tape pruning and gradient accumulation can be parallelized safely within a transaction.

---

## Why Do This?

- **Proof of Concept:** show that gradient-based learning can be expressed purely as relational algebra and transaction semantics.

- **Determinism:** every computation is replayable and version-controlled.

- **Integration:** unifies data, model, and training loop under a single ACID engine.

- **Pedagogy:** transparent view into transformer internals, queryable step-by-step.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/seanwevans/pg_gpt2

Awesome Lists containing this project

README