https://github.com/hinanohart/circuitbench

Integrated mechanistic interpretability + sparse autoencoder framework for Hybrid SSM-Attention models (Mamba-2, Hymba, RWKV-7). v0.1.2 alpha: real forward-pass intervention + mean-ablation patching shipped, CPU smoke; GPU/real adapters in v0.2.
https://github.com/hinanohart/circuitbench

alignment hymba interpretability mamba mamba-2 mechanistic-interpretability pytorch rwkv sae sparse-autoencoder ssm state-space-model transformer-alternatives

Last synced: 9 days ago
JSON representation

Host: GitHub
URL: https://github.com/hinanohart/circuitbench
Owner: hinanohart
License: mit
Created: 2026-05-22T21:53:47.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-06-10T13:33:25.000Z (14 days ago)
Last Synced: 2026-06-10T14:08:36.162Z (14 days ago)
Topics: alignment, hymba, interpretability, mamba, mamba-2, mechanistic-interpretability, pytorch, rwkv, sae, sparse-autoencoder, ssm, state-space-model, transformer-alternatives
Language: Python
Homepage: https://github.com/hinanohart/circuitbench/releases/latest
Size: 105 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Notice: NOTICE

Awesome Lists containing this project

README

          # circuitbench

[![CI](https://github.com/hinanohart/circuitbench/actions/workflows/ci.yml/badge.svg)](https://github.com/hinanohart/circuitbench/actions/workflows/ci.yml)

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)

**Mechanistic interpretability + sparse autoencoder framework for Hybrid SSM-Attention models**, with first-class support for pure SSMs.

Where TransformerLens / SAELens dominate Transformer interpretability, circuitbench fills the gap for **post-Transformer architectures**: Mamba-2, Hymba, Jamba, Falcon-H1, RWKV-7.

> **v0.1.x scope.** v0.1.x ships the **API surface + CPU `MockSSMAdapter`** so the harness is end-to-end runnable without GPUs or model downloads. Real model weights, JumpReLU SAEs, and step-wise `h_t` patching land in **v0.2**. See [Status](#status) for the precise per-component split.

---

## What is this?

circuitbench is a research harness for understanding *how* state-space models (SSMs) compute. It provides four integrated operations over a common hook-point abstraction that maps onto Mamba-2's internal tensor sites:

- **`load_model`** — adapter registry; v0.1.x ships `MockSSMAdapter` (CPU); real weights in v0.2

- **`train_sae`** — TopK sparse autoencoders trained on SSM-specific hook points

- **`extract_circuit`** — coarse layer-level mean-ablation activation patching

- **`steer`** — additive feature-direction intervention during the forward pass

The same API surface works identically for the CPU mock (available now) and for real model weights (v0.2).

---

## Why circuitbench

SSMs and hybrid SSM-attention models have grown into a serious alternative to pure Transformers, but the mechanistic interpretability tooling has not caught up. Existing libraries either:

- Hard-bake Transformer-only assumptions (residual streams indexed by layer × position), or

- Provide raw hooks without SAE training, circuit discovery, or steering glue.

circuitbench provides one integrated harness for all four.

---

## Install

v0.1 is alpha and **not yet on PyPI** (planned for v0.2 with trusted publisher). Install from source:

```bash

git clone https://github.com/hinanohart/circuitbench.git

cd circuitbench

pip install -e .                       # core only (torch + numpy + einops + jaxtyping + pydantic)

pip install -e ".[ssm,hf]"             # placeholders for v0.2: mamba-ssm + HF transformers (GPU)

pip install -e ".[sae]"                # placeholder for v0.2: SAELens interop

pip install -e ".[dev]"                # development (pytest, ruff, mypy)

```

`[ssm]` / `[hf]` / `[sae]` installs do **not yet** unlock real backends in v0.1.x — they are reserved so the install path stays stable across the v0.1 → v0.2 transition.

---

## Quick start

```python

from circuitbench import load_model, train_sae, extract_circuit, steer

# v0.1.x ships a CPU-only MockSSMAdapter; real Mamba-2 / Hymba weights arrive

# in v0.2 (need `mamba-ssm` + GPU). The API surface is identical either way.

model = load_model("mock://mamba2-tiny", hook_point="out_proj_in")

sae = train_sae(model, layer=1, k=32, expansion=8, tokens=2048, batch_size=64)

circuit = extract_circuit(model, prompt="Paris is the capital of", target="France")

out = steer(model, prompt="Hello", feature_id=42, strength=2.0, sae=sae, layer=1)

print(circuit.top_layers(n=3))           # [(layer, ablation effect), ...]

print(out.delta_norm)                    # L2 shift in final output under steering

```

See [`examples/`](examples/):

- `01_load.py`, `02_train_sae.py`, `03_steer.py` — runnable on CPU in seconds

- `titans_hook.py` — v0.2 contrib stub (prints a marker; raises `NotImplementedError` when called)

---

## How it works

### Hook points (SSM-specific)

circuitbench defines five hook sites that map onto Mamba-2's internal computation path. The data flow inside each SSM block is:

```

x → x_proj → split(u, z, s)

              └── u → conv1d ──→ c

                                 └── ssm(c, s) → ssm_y           [H3]

                                                  └── gate(z)

                                                      └── post_gate [H1] → out_proj → +x → output

```

| ID | Location | Shape | Capture | Additive Intervention | Substitution |

|----|----------|-------|---------|-----------------------|--------------|

| H1 | `out_proj_in` (post-gate, pre-projection) | `(B, L, d_inner)` | ✅ | ✅ | ✅ |

| H2 | `x_proj` (gate/input/dt projection) | `(B, L, 2*d_inner + d_state)` | ✅ | ✅ | ✅ |

| H3 | `ssm_y` (SSM output, pre-gate) | `(B, L, d_inner)` | ✅ | ✅ | ✅ |

| H4 | `hidden_state_h` (the SSM state itself) | `(B, L, D, N)` | ✅ | ✅ | v0.2 |

| H5 | `conv1d` (short-conv branch output) | `(B, L, d_inner)` | ✅ | ✅ | ✅ |

Default for SAE training: **H1** (post-gate, pre-projection) — analogous to a Transformer's residual stream input.

H1 and H3 are **distinct tensors**: the gate `y * sigmoid(z)` sits between them.

### Circuit extraction (v0.1.x)

For each candidate layer `L`, circuitbench:

1. Runs a clean forward pass and captures the activation at `(L, hook_point)`.

2. Replaces that activation with its sequence mean (mean-ablation) and re-runs the forward.

3. Records `‖clean_output − ablated_output‖₂` as the layer's effect score.

The layer with the largest shift is the one the prompt depends on most.

Step-wise `h_t` patching and target-logit projection are planned for v0.2.

### Architecture



  



---

## Differentiation (design goals — implementation status)

| Axis | Status |

|------|--------|

| **Hybrid head separation SAE** (Hymba / Jamba attention vs SSM heads trained separately) | design goal, v0.2 |

| **`ssm_state` direct SAE** (SAE over `h_t ∈ (B, L, D, N)`) | capture shipped (mock); full SAE training v0.2 |

| **State-propagation circuit** (step-wise patching for recurrent models) | coarse layer-level mean-ablation shipped; step-wise `h_t` v0.2 |

| **RWKV-7 first-class** (loader + hook points for RWKV-7 Goose) | design goal, v0.2 |

---

## Status

| Component | v0.1.x (shipped) | v0.2 (planned) |

|-----------|------------------|----------------|

| `load_model` (registry + `MockSSMAdapter` CPU backend) | ✅ shipped | + real Mamba-2 / Hymba / Jamba / Falcon-H1 / RWKV-7 weights |

| `train_sae` (TopK, k=32, 8× expansion, decoder unit-norm) | ✅ shipped | + JumpReLU, dead-feature resample (200k step) |

| `extract_circuit` (coarse layer-level **mean-ablation** patching) | ✅ shipped | + step-wise `h_t` patching, target-logit projection, hybrid head separation |

| `steer` (additive feature intervention **during** forward) | ✅ shipped | + composable interventions, beam search |

| HF Hub SAE distribution (SAELens-compatible) | planned | shipped |

| Multi-Agent SAE | namespace reserved | shipped |

| PyPI publish | install from source | shipped (trusted publisher) |

| arXiv preprint (v0.1 harness paper) | deferred | shipped |

---

## Acknowledgments

**Inspired by** (no runtime dependency in v0.1.x — these projects are *not* imported; `[sae]` extra reserves SAELens interop for v0.2):

- [SAELens](https://github.com/jbloomAus/SAELens) — production SAE library; circuitbench's v0.2 will export SAEs in a SAELens-compatible format

- [TransformerLens](https://github.com/TransformerLensOrg/TransformerLens) — hook-based interpretability primitives

- [MambaLens](https://github.com/Phylliida/MambaLens) — early Mamba interpretability work

- [mamba-ssm](https://github.com/state-spaces/mamba) — official Mamba/Mamba-2 reference implementation

---

## Related projects

Part of [hinanohart](https://github.com/hinanohart)'s open-source portfolio:

- [transduce](https://github.com/hinanohart/transduce) — composable transducer streams

- [exitkit](https://github.com/hinanohart/exitkit) — Nozick closest-continuer model identity over PAM snapshots

- [subjunctor](https://github.com/hinanohart/subjunctor) — Nozick-grounded LLM agent gate

---

## License

MIT — see [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hinanohart/circuitbench

Awesome Lists containing this project

README