{"id":51018661,"url":"https://github.com/ashlrai/mechanistic-interpretability","last_synced_at":"2026-06-21T14:01:33.194Z","repository":{"id":360826007,"uuid":"1246294884","full_name":"ashlrai/mechanistic-interpretability","owner":"ashlrai","description":"Local agent-driven mechanistic interpretability research platform for Apple Silicon","archived":false,"fork":false,"pushed_at":"2026-05-28T02:30:09.000Z","size":3600,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T03:23:33.948Z","etag":null,"topics":["abliteration","acdc","activation-patching","ai-safety","apple-silicon","interpretability","mech-interp","mechanistic-interpretability","sparse-autoencoders","transformer-lens"],"latest_commit_sha":null,"homepage":"https://ashlrai.github.io/mechanistic-interpretability","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashlrai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-22T04:00:09.000Z","updated_at":"2026-05-28T02:29:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ashlrai/mechanistic-interpretability","commit_stats":null,"previous_names":["ashlrai/mechanistic-interpretability"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ashlrai/mechanistic-interpretability","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashlrai%2Fmechanistic-interpretability","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashlrai%2Fmechanistic-interpretability/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashlrai%2Fmechanistic-interpretability/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashlrai%2Fmechanistic-interpretability/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashlrai","download_url":"https://codeload.github.com/ashlrai/mechanistic-interpretability/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashlrai%2Fmechanistic-interpretability/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34610832,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abliteration","acdc","activation-patching","ai-safety","apple-silicon","interpretability","mech-interp","mechanistic-interpretability","sparse-autoencoders","transformer-lens"],"created_at":"2026-06-21T14:01:30.948Z","updated_at":"2026-06-21T14:01:33.188Z","avatar_url":"https://github.com/ashlrai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Mechanistic Interpretability Platform\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![CI](https://github.com/ashlrai/mechanistic-interpretability/actions/workflows/ci.yml/badge.svg)](https://github.com/ashlrai/mechanistic-interpretability/actions/workflows/ci.yml)\n[![Docs](https://img.shields.io/badge/docs-ashlrai.github.io-blue)](https://ashlrai.github.io/mechanistic-interpretability)\n[![Tests](https://img.shields.io/badge/tests-685%20passing-brightgreen)](https://github.com/ashlrai/mechanistic-interpretability/actions)\n[![mypy: strict](https://img.shields.io/badge/mypy-strict-blue)](https://mypy-lang.org/)\n[![ruff: clean](https://img.shields.io/badge/ruff-clean-brightgreen)](https://docs.astral.sh/ruff/)\n[![Version](https://img.shields.io/badge/version-v0.1.0--preview-orange)](CITATION.cff)\n[![Python](https://img.shields.io/badge/python-3.12%2B-blue)](pyproject.toml)\n\nLocal mechanistic interpretability research at the speed of curiosity — 15 experiment families, closed-loop agentic followup, and a reproducibility receipt on every run.\n\nReproduce canonical circuits, audit refusal on any instruct model, train SAEs, and run multi-model sweeps — all on a MacBook Pro, with SQLite-backed run history and a 19-second quickstart.\n\n---\n\n## In 19 seconds\n\n```bash\nuv sync --group dev --extra interp\nmech demo\n```\n\n```\nmech demo — running 3 experiments on gpt2-small …\nExperiments complete in 19.2s\n\n╭──────────────────── mech demo — gpt2-small factual recall ────────────────────╮\n│  Experiment               Finding                                    Value     │\n│  Direct Logit Attribution Top writing component: L0_mlp            +2.841     │\n│  Logit Lens               rank drops over 12 layers (never top-5)  rank 37    │\n│  Circuit Patching         Top causal site: L8·resid_pre            93% recov. │\n╰───────────────────────────────────────────────────────────────────────────────╯\n```\n\nDLA decomposes every component's contribution in one forward pass. Logit Lens tracks\nhow the model's best guess evolves layer by layer. Circuit Patching causally verifies\nthe top DLA site. All results are deterministic (seed=42).\n\n---\n\n## Headline Findings\n\n### SAE Features Are Not Reproducible Across Seeds\n\nTraining five identical SAEs on gpt2-small (128 features, k=8, same corpus) and\naligning them with the Hungarian algorithm reveals:\n\n| Condition | Layer | Median cosine | Stable @ ≥ 0.9 |\n|---|---|---|---|\n| Full matrix | 0 | 0.095 | 0.16% (2/1280) |\n| Live features only | 0 | **0.500** | 0.48% |\n| Live features only | 6 | 0.323 | 0.00% |\n| 512 features, live | 6 | 0.257 | 0.00% |\n\n**No condition crosses the 0.9 \"same feature\" threshold.** Fixing dead-feature\ninflation raises the layer-0 median to 0.50 — meaningful overlap, not reproducibility.\nMid-network representations (layer 6) are *less* stable despite being more structured.\nPublished feature descriptions describe one training run, not the model.\n\n→ [`docs/investigations/sae_replication_crisis.md`](docs/investigations/sae_replication_crisis.md)\n\n---\n\n### The Qwen Abliteration Recipe Fails\n\nFour-stage mechanistic audit of `Qwen2.5-1.5B-Instruct` refusal:\n\n| Stage | Key number |\n|---|---|\n| Direction extraction quality (layer 10) | **4.105** — clean linear separation |\n| CAA steering: best behavioral shift | +0.33 (1/3 prompts, only at coeff −3.0) |\n| Circuit patching: top site recovery | 1.037 (`blocks.11.hook_resid_post`) |\n| Attention heads at same layers | 0.02–0.13 (near zero) |\n| Causal scrubbing faithfulness | **0.041** — hypothesis formally rejected |\n\nRefusal is linearly separable in the residual stream at layers 10–12, but the\nattention heads at those layers write almost none of it. The standard Arditi/RepE\nabliteration recipe — find the direction, ablate the attention head weights that\nwrite it — produces faithfulness 0.04 under formal scrubbing. The recipe's\nmechanistic assumption does not transfer to this checkpoint.\n\n→ [`docs/investigations/refusal_audit.md`](docs/investigations/refusal_audit.md)\n\n---\n\n### GPT-2 Factual Recall Localized to a 4-Site Circuit\n\nSix experiment families chained on gpt2-small factual recall:\n\n| Evidence | Finding |\n|---|---|\n| Logit Lens | Correct token rank: 375 at L8 → **12.8 at L9** (phase transition) |\n| DLA | L9.MLP writes **+9.29**; L8.MLP suppresses −3.15 |\n| Attribution patching | `blocks.11.hook_resid_pre` top site, |score| 5.04 |\n| Circuit patching | Recovery fraction = 1.0 at resid_pre L8–L11 |\n| SAE at L9 | Geographic cluster: features 194, 212, 43 activate \u003e 1080 on Paris docs |\n| Causal scrubbing | 4-site circuit faithfulness **0.720** |\n\nThe 4-site hypothesis (L8.mlp_out + L9.resid_pre + L9.attn.z + L10.resid_pre)\ncaptures 72% of model behavior. The residual 28% lives in earlier attention heads\n(L5–L7) not yet patched.\n\n→ [`docs/investigations/gpt2_factual_recall.md`](docs/investigations/gpt2_factual_recall.md)\n\n---\n\n## What You Can Do Today\n\n| Task | Command |\n|---|---|\n| Reproduce IOI name-mover heads (Wang et al. 2022) | `mech run --name acdc-edge-ioi-gpt2-small` |\n| Audit refusal on any instruct model | `mech audit-refusal` |\n| Browse pretrained SAEs | `mech list-saes` → `mech download-sae` → `mech analyze-sae` |\n| Apply steering vectors side-by-side | `mech apply-steering --vector \u003cname\u003e` |\n| Train your own SAE | `mech run --name polysemanticity-sae-smoke` |\n| Use any HuggingFace model | `backend: huggingface` in experiment YAML |\n| Multi-model sweep | `mech sweep --axis \"parameters.model=...\"` |\n| Interactive 4-panel UI | `mech gradio` |\n| Closed-loop agentic research | `mech iterate-from-run --family polysemanticity_sae --artifact-dir \u003crun\u003e --max-depth 2` |\n\nDiscover all 37 commands: `mech help`\n\n---\n\n## Screenshots\n\n\u003c!-- TODO: capture Gradio demo screenshot:\n     mech gradio --port 7860 \u0026\n     open http://localhost:7860\n     # type prompt, run analysis, screenshot the 4-panel layout\n     # save to docs/images/gradio_demo.png (1280x720) --\u003e\n\n![Gradio UI](docs/images/gradio_demo.png)\n\n\u003c!-- TODO: capture Cockpit dashboard:\n     mech cockpit\n     open http://localhost:8000\n     # navigate to a completed SAE or circuit run\n     # save to docs/images/cockpit.png (1280x720) --\u003e\n\n---\n\n## Install\n\n```bash\n# Base dev environment\nuv sync --group dev\n\n# Add TransformerLens + SAE backends\nuv sync --group dev --extra interp\n\n# Apple Silicon MLX support\nuv sync --group dev --extra interp --extra apple\n\ncp .env.example .env   # optional shell-level defaults\n```\n\nRun the local check gate:\n\n```bash\nbash scripts/check.sh\n```\n\nFull walkthrough: [`notebooks/05_research_walkthrough.ipynb`](notebooks/05_research_walkthrough.ipynb)\n\n---\n\n## Architecture\n\nTwo model access tiers:\n\n1. **Instrumented backends** — TransformerLens (first-class), nnsight, MLX-native.\n   Expose activations, hooks, and interventions required for mech-interp.\n2. **Generation providers** — Ollama (`http://localhost:11434`), LM Studio\n   (`http://localhost:1234/v1`). OpenAI-compatible black-box generation for\n   baselines and dataset construction only.\n\nCore packages:\n\n| Package | Role |\n|---|---|\n| `mech_interp.backends` | Instrumented model adapters |\n| `mech_interp.experiments` | Spec registry, 15 experiment families |\n| `mech_interp.storage` | SQLite run metadata + artifact locations |\n| `mech_interp.orchestration` | Local batch planning, resource policy |\n| `mech_interp.datasets` | Prompt loaders, reproducibility hashes |\n| `mech_interp.providers` | Black-box generation adapters |\n| `mech_interp.config` | YAML configuration loading |\n\n```text\nYAML spec → registry → runner → seed + env fingerprint → family → SQLite + artifacts\n```\n\nEnvironment fingerprints (`torch` version, `uv.lock` SHA, seed, model name) are\nwritten before execution — every result carries a reproducibility receipt.\n\n```text\n.\n├── configs/          # Backend/model/experiment settings\n├── experiments/      # Runnable experiment spec files\n├── artifacts/        # Run metadata, tensors, reports, logs\n├── data/             # Prompt corpora and datasets\n├── notebooks/        # Exploratory analysis\n├── scripts/          # check.sh, smoke.sh, helpers\n└── src/mech_interp/  # Python package\n```\n\n---\n\n## Acknowledgments and Citation\n\nThis platform implements and extends:\n\n- Wang et al. (2022) — [Interpretability in the Wild: IOI](https://arxiv.org/abs/2211.00593) · `acdc_edge`, `acdc_lite`\n- Bricken et al. (2023) — [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemanticity/) · `polysemanticity_sae`\n- Conmy et al. (2023) — [Automated Circuit Discovery](https://arxiv.org/abs/2304.14997) · `acdc_lite`, `acdc_edge`\n- Arditi et al. (2024) — [Refusal in LLMs](https://arxiv.org/abs/2406.11717) · `refusal_direction`, `mech audit-refusal`\n- Gao et al. (2024) — [Scaling and Evaluating SAEs](https://arxiv.org/abs/2408.05147) · Top-K SAE implementation\n\nTo cite this platform, see [`CITATION.cff`](CITATION.cff).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashlrai%2Fmechanistic-interpretability","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashlrai%2Fmechanistic-interpretability","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashlrai%2Fmechanistic-interpretability/lists"}