{"id":50548958,"url":"https://github.com/designer-coderajay/glassbox-mech","last_synced_at":"2026-06-04T01:03:01.850Z","repository":{"id":339746602,"uuid":"1162660518","full_name":"designer-coderajay/glassbox-mech","owner":"designer-coderajay","description":"Open-source EU AI Act Annex IV compliance toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a court-ready evidence package","archived":false,"fork":false,"pushed_at":"2026-05-31T19:44:57.000Z","size":2326,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T21:13:04.272Z","etag":null,"topics":["alignment","annex-iv","attribution-patching","black-box-testing","circuit-discovery","compliance-audit","eu-ai-act","explainability","fastapi","gpt2","llm-compliance","logit-lens","mcp","mechanistic-interpretability","pytorch","regulatory-compliance","sae","sparse-autoencoders","transformer-circuits","transformerlens"],"latest_commit_sha":null,"homepage":"https://repo-ashen-psi.vercel.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/designer-coderajay.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP_V4.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-20T14:39:29.000Z","updated_at":"2026-05-31T19:45:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/designer-coderajay/glassbox-mech","commit_stats":null,"previous_names":["designer-coderajay/glassbox-ai-2.0-mechanistic-interpretability-tool","designer-coderajay/glassbox-ai-2.0-mechanistic-interpretability-eu-compliance-tool"],"tags_count":47,"template":false,"template_full_name":null,"purl":"pkg:github/designer-coderajay/glassbox-mech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/designer-coderajay%2Fglassbox-mech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/designer-coderajay%2Fglassbox-mech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/designer-coderajay%2Fglassbox-mech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/designer-coderajay%2Fglassbox-mech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/designer-coderajay","download_url":"https://codeload.github.com/designer-coderajay/glassbox-mech/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/designer-coderajay%2Fglassbox-mech/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33886159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","annex-iv","attribution-patching","black-box-testing","circuit-discovery","compliance-audit","eu-ai-act","explainability","fastapi","gpt2","llm-compliance","logit-lens","mcp","mechanistic-interpretability","pytorch","regulatory-compliance","sae","sparse-autoencoders","transformer-circuits","transformerlens"],"created_at":"2026-06-04T01:03:00.506Z","updated_at":"2026-06-04T01:03:01.836Z","avatar_url":"https://github.com/designer-coderajay.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"docs/glassbox-brand.png\" alt=\"Glassbox AI — Brand Identity\" width=\"720\" style=\"max-width:100%;margin-bottom:8px\"/\u003e\n\n# Glassbox 4.3.0\n\n**Open-source EU AI Act Annex IV compliance documentation toolkit. Works on any LLM.**\n**21 mathematical frameworks. ACDC + GQA/RMSNorm multi-arch + cross-model comparison. Production-ready.**\n\n[![PyPI version](https://img.shields.io/pypi/v/glassbox-mech-interp?color=blue)](https://pypi.org/project/glassbox-mech-interp/)\n[![PyPI downloads](https://img.shields.io/pypi/dm/glassbox-mech-interp?color=blue\u0026label=downloads%2Fmonth)](https://pypistats.org/packages/glassbox-mech-interp)\n[![GitHub last commit](https://img.shields.io/github/last-commit/designer-coderajay/glassbox-mech?color=green)](https://github.com/designer-coderajay/glassbox-mech/commits/main)\n[![GitHub issues](https://img.shields.io/github/issues/designer-coderajay/glassbox-mech)](https://github.com/designer-coderajay/glassbox-mech/issues)\n[![Live Analytics](https://img.shields.io/badge/Live%20Analytics-ClickHouse-FFCC01?logo=clickhouse\u0026logoColor=black)](https://clickpy.clickhouse.com/dashboard/glassbox-mech-interp)\n[![License: MIT](https://img.shields.io/badge/Core-MIT-green.svg)](LICENSE) [![License: BSL 1.1](https://img.shields.io/badge/Compliance%20Engine-BSL%201.1-orange.svg)](LICENSE-COMMERCIAL) [![Patents Pending](https://img.shields.io/badge/Patents-Pending-blue.svg)](PATENTS.md)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/)\n[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Live%20Demo-yellow)](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool)\n[![Website](https://img.shields.io/badge/Website-glassbox--ai-blue)](https://repo-ashen-psi.vercel.app)\n[![arXiv](https://img.shields.io/badge/arXiv-2603.09988-b31b1b?logo=arxiv)](https://arxiv.org/abs/2603.09988)\n[![Tests](https://github.com/designer-coderajay/glassbox-mech/actions/workflows/tests.yml/badge.svg)](https://github.com/designer-coderajay/glassbox-mech/actions/workflows/tests.yml)\n\n[**Website**](https://repo-ashen-psi.vercel.app) · [**Live Demo**](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool) · [**Paper**](https://arxiv.org/abs/2603.09988) · [**PyPI**](https://pypi.org/project/glassbox-mech-interp/) · [**GitHub**](https://github.com/designer-coderajay/glassbox-mech)\n\n\u003c/div\u003e\n\n---\n\n**For compliance teams:** Regulation (EU) 2024/1689 (AI Act) requires Annex IV technical documentation for every high-risk AI system (Article 11). Enforcement begins August 2026. Glassbox automates generation of the full 9-section Annex IV draft — from open-source models (white-box) or any proprietary API like GPT-4 and Claude (black-box). Outputs are structured documentation aids; they do not constitute legal advice, a declaration of conformity, or a guarantee of regulatory compliance. See [Legal Notices](#legal-notices--regulatory-disclaimer).\n\n**For researchers:** one function call discovers the minimum faithful circuit in a transformer — the smallest subgraph of attention heads causally responsible for a prediction. Preliminary benchmarks show 15–37× faster than ACDC on GPT-2 (single-run, Apple M2 Pro — see [Benchmarks](#benchmarks)). Every approximation is disclosed.\n\n---\n\n## Table of Contents\n\n- [Live Services](#live-services)\n- [Quickstart](#quickstart)\n- [**How to Get Your EU AI Act Annex IV Compliance Proof**](#how-to-get-your-eu-ai-act-annex-iv-compliance-proof)\n- [What's New in v4.2.0](#whats-new-in-v420)\n- [What's New in v4.1.0](#whats-new-in-v410)\n- [What's New in v4.0.0](#whats-new-in-v400)\n- [What's New in v3.7.0](#whats-new-in-v370)\n- [What's New in v3.6.0](#whats-new-in-v360)\n- [What's New in v3.5.0](#whats-new-in-v350)\n- [What's New in v3.4.0](#whats-new-in-v340)\n- [What's New in v3.3.0](#whats-new-in-v330)\n- [What's New in v3.1.0](#whats-new-in-v310)\n- [What's New in v3.0.0](#whats-new-in-v300)\n- [EU AI Act Compliance — Annex IV Reports](#eu-ai-act-compliance--annex-iv-reports)\n- [Black-Box Audit — Any Model via API](#black-box-audit--any-model-via-api)\n- [REST API (Hosted)](#rest-api-hosted)\n- [What's Novel](#whats-novel)\n- [How It Works](#how-it-works)\n- [Benchmarks](#benchmarks)\n- [Usage Examples](#usage-examples)\n- [CLI](#cli)\n- [Installation](#installation)\n- [Dashboard](#dashboard)\n- [Self-Hosting (Docker / Air-Gapped VPC)](#self-hosting-docker--air-gapped-vpc)\n- [Supported Models](#supported-models)\n- [API Reference](#api-reference)\n- [Methodology \u0026 IP Documentation](#methodology--ip-documentation)\n- [Mathematical Disclosures](#mathematical-disclosures)\n- [Mathematical Foundations Reference](#mathematical-foundations-reference)\n- [Cross-Model Faithfulness Study](#cross-model-faithfulness-study)\n- [Paper](#paper)\n- [Citation](#citation)\n- [Related Tools](#related-tools)\n- [Security \u0026 Privacy](#security--privacy)\n- [Legal Notices \u0026 Regulatory Disclaimer](#legal-notices--regulatory-disclaimer)\n- [Project \u0026 Privacy Notice](#project--privacy-notice)\n- [License](#license)\n\n---\n\n## Live Services\n\n| Service | URL | Description |\n|---------|-----|-------------|\n| **Website** | [repo-ashen-psi.vercel.app](https://repo-ashen-psi.vercel.app) | Marketing site — features, pricing, code examples. Always up. |\n| **Live Demo** | [HuggingFace Space](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool) | Interactive circuit analysis on open-source models. No install needed. |\n| **PyPI Package** | [glassbox-mech-interp](https://pypi.org/project/glassbox-mech-interp/) | `pip install glassbox-mech-interp` — v4.2.6 |\n| **Self-Hosted API** | [See Docker guide](#self-hosting-docker--air-gapped-vpc) | Deploy the REST API on your own infra or Railway. |\n\n---\n\n## Quickstart\n\n```bash\npip install glassbox-mech-interp\n```\n\n```python\nfrom transformer_lens import HookedTransformer\nfrom glassbox import GlassboxV2\n\nmodel = HookedTransformer.from_pretrained(\"gpt2\")\ngb    = GlassboxV2(model)\n\nresult = gb.analyze(\n    prompt    = \"When Mary and John went to the store, John gave a drink to\",\n    correct   = \" Mary\",\n    incorrect = \" John\",\n)\n\nprint(result[\"circuit\"])\n# [(9, 9), (9, 6), (10, 0), (8, 6), ...]   \u003c- (layer, head) tuples\n\nprint(result[\"faithfulness\"])\n# {'sufficiency': 0.80,          # Taylor approximation (fast, suff_is_approx=True)\n#  'comprehensiveness': 0.37,    # exact ablation\n#  'f1': 0.49,\n#  'category': 'backup_mechanisms',\n#  'suff_is_approx': True}       # True = approx; use bootstrap_metrics() for exact ~100%\n```\n\nNo model weights? Use the [live HuggingFace demo](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool) — no install required.\n\n---\n\n## How to Get Your EU AI Act Annex IV Compliance Proof\n\nThis is the most common question we receive. Here is the complete answer, step by step.\n\n### Who this applies to\n\nEU AI Act Article 6 + Annex III defines **high-risk AI systems** that require Annex IV technical documentation before August 2, 2026. You are almost certainly in scope if your LLM is used in:\n\n- **Finance** — credit scoring, loan decisions, fraud detection, insurance underwriting\n- **Healthcare** — medical triage, clinical decision support, diagnosis assistance\n- **Employment** — CV screening, hiring decisions, performance assessment\n- **Education** — exam marking, admissions, student assessment\n- **Law enforcement / border control** — risk profiling, identity verification\n- **Critical infrastructure** — energy grid, water, transport management\n\nNon-compliance: up to **€35 million or 7% of global annual turnover** (Article 99).\n\n\u003e **Legal disclaimer.** Glassbox generates structured technical documentation intended to support — not replace — the legal and regulatory review required under EU AI Act Article 11. Whether your system is high-risk under Article 6/Annex III, and whether this documentation satisfies all applicable obligations, must be confirmed by qualified legal counsel and/or a notified body (Article 43). See [Legal Notices](#legal-notices--regulatory-disclaimer).\n\n---\n\n### Step 1 — Install\n\n```bash\n# Core library + compliance module\npip install \"glassbox-mech-interp[compliance]\"\n\n# If running your own model (not GPT-2):\npip install transformer-lens torch\n```\n\n---\n\n### Step 2 — Load your model\n\nGlassbox wraps any HuggingFace-compatible transformer via TransformerLens. One line:\n\n```python\nfrom transformer_lens import HookedTransformer\nfrom glassbox import GlassboxV2\n\n# GPT-2 (quickstart — no API key needed, downloads ~500 MB once)\nmodel = HookedTransformer.from_pretrained(\"gpt2\")\n\n# Your own model from HuggingFace Hub\nmodel = HookedTransformer.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\nmodel = HookedTransformer.from_pretrained(\"mistralai/Mistral-7B-v0.1\")\nmodel = HookedTransformer.from_pretrained(\"microsoft/phi-2\")\n# See supported-models table for all 11 architecture families\n\ngb = GlassboxV2(model)\n```\n\n---\n\n### Step 3 — Run the compliance audit on your real prompt\n\nReplace the example with **your actual use-case prompt**. The `correct` token should be the output your model should produce; `incorrect` is what it should not produce.\n\n```python\n# Example: credit risk model\nresult = gb.analyze(\n    prompt    = \"Loan application. Annual income: €42,000. Credit history: 3 missed payments. Decision:\",\n    correct   = \" Denied\",    # token the model should produce if functioning correctly\n    incorrect = \" Approved\",  # token it should NOT produce in this case\n)\n\n# Example: medical triage\nresult = gb.analyze(\n    prompt    = \"Patient presents with chest pain, diaphoresis, and left arm radiation. Priority:\",\n    correct   = \" Urgent\",\n    incorrect = \" Routine\",\n)\n\n# Example: hiring screening\nresult = gb.analyze(\n    prompt    = \"Candidate has 8 years of Python experience and a relevant degree. Assessment:\",\n    correct   = \" Qualified\",\n    incorrect = \" Rejected\",\n)\n\n# What you get back:\nprint(result[\"faithfulness\"][\"f1\"])          # 0.0–1.0 — how faithful the circuit explanation is\nprint(result[\"faithfulness\"][\"category\"])    # 'faithful' / 'partial' / 'backup_mechanisms'\nprint(result[\"circuit\"])                     # [(layer, head), ...] — which heads drive the decision\nprint(result[\"explainability_grade\"])        # 'A' through 'F' per Annex IV Article 13\n```\n\n---\n\n### Step 4 — Generate the Annex IV evidence package\n\nOne call generates a regulator-ready PDF + machine-readable JSON covering all 8 mandatory Annex IV sections:\n\n```python\nfrom glassbox.compliance import AnnexIVReport, DeploymentContext\nfrom glassbox.evidence_vault import AnnexIVEvidenceVault\n\n# 1. Build the report object — fill in your organisation details\nreport = AnnexIVReport(\n    model_name         = \"LoanScorerV2\",            # your model identifier\n    system_purpose     = \"Automated credit risk scoring for retail loan decisions\",\n    provider_name      = \"Acme Bank NV\",             # your legal entity name\n    provider_address   = \"1 Fintech Street, Amsterdam 1011AB, Netherlands\",\n    deployment_context = DeploymentContext.FINANCIAL_SERVICES,\n    # Optional: Article 9 risk register\n    risk_register_path = \"risk_register.json\",       # from gb.risk_register.export()\n)\n\n# 2. Attach your audit result (can add multiple prompts for robustness)\nreport.add_analysis(result)\n\n# 3. Export\nreport.to_pdf(\"annex_iv_acme_bank_v2.pdf\")    # human-readable, structured per Annex IV\nreport.to_json(\"annex_iv_acme_bank_v2.json\")  # machine-readable, for regulator submission systems\n```\n\nThe PDF covers:\n\n| Annex IV Section | Article Reference | What Glassbox fills in |\n|---|---|---|\n| 1. General description | Art. 13(3)(a) | Model name, version, intended purpose, deployment context |\n| 2. Design \u0026 development process | Art. 10, 11(1)(d) | Architecture, training description, data governance statement |\n| 3. Monitoring \u0026 human oversight | Art. 9(6), 14 | CircuitDiff change detection, oversight measures |\n| 4. Explainability assessment | Art. 13 | Circuit heads, faithfulness F1, grade A–F, plain-English summary |\n| 5. Data requirements | Art. 10 | Data quality statement, bias probe results |\n| 6. Risk assessment | Art. 9 | Risk register entries, failure modes, mitigation measures |\n| 7. Accuracy \u0026 robustness metrics | Art. 15 | Task accuracy, confidence calibration (r=0.009 orthogonality) |\n| 8. Post-market monitoring plan | Art. 72 | CircuitDiff threshold, alert configuration |\n\n---\n\n### Step 5 — Set up continuous monitoring (Article 72)\n\nArticle 72 requires **post-market monitoring** — detecting when model behaviour changes after deployment. This is what regulators mean by \"ongoing oversight.\"\n\n```python\nfrom glassbox import CircuitDiff\n\n# Run once on your baseline (before deployment)\nbaseline_result = gb.analyze(your_prompt, correct_token, incorrect_token)\ndiff = CircuitDiff(baseline=baseline_result)\ndiff.save(\"baseline_circuit.json\")   # commit this to your repo\n\n# Run on every model update (in CI/CD)\nnew_result = gb.analyze(your_prompt, correct_token, incorrect_token)\nchange_report = diff.compare(new_result)\n\nif change_report[\"drift_detected\"]:\n    print(f\"Circuit changed: {change_report['changed_heads']} heads drifted\")\n    print(f\"Faithfulness delta: {change_report['faithfulness_delta']:.3f}\")\n    # Alert your compliance team — mandatory under Article 72\n```\n\n**GitHub Actions integration** — add to your CI pipeline:\n\n```yaml\n# .github/workflows/compliance.yml\nname: AI Act Compliance Gate\n\non:\n  push:\n    branches: [main]\n\njobs:\n  compliance:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: designer-coderajay/glassbox-mech@v4   # Glassbox GitHub Action\n        with:\n          model:     gpt2                             # your model name\n          prompt:    \"Your production prompt here\"\n          correct:   \" expected_output\"\n          incorrect: \" wrong_output\"\n          fail_below_f1: 0.40                         # fail CI if faithfulness drops\n```\n\nThe Action uploads a full Annex IV report as a build artifact on every commit.\n\n---\n\n### Step 6 — Run the audit report as a standalone scan (no Python required)\n\nIf you don't want to integrate into existing code, just run the Docker container against any HuggingFace model:\n\n```bash\n# Pull and run — scans GPT-2 and generates annex_iv_report.pdf in ./output\ndocker run --rm -v $(pwd)/output:/output \\\n  ghcr.io/designer-coderajay/glassbox-mech:latest \\\n  --model gpt2 \\\n  --prompt \"Your production prompt\" \\\n  --correct \" correct_token\" \\\n  --incorrect \" wrong_token\" \\\n  --output /output/annex_iv_report.pdf\n\n# For private/local models:\ndocker run --rm \\\n  -v $(pwd)/model:/model \\\n  -v $(pwd)/output:/output \\\n  ghcr.io/designer-coderajay/glassbox-mech:latest \\\n  --model-path /model \\\n  --prompt \"Your production prompt\" \\\n  --correct \" correct_token\" \\\n  --incorrect \" wrong_token\"\n```\n\n---\n\n### Step 7 — Present to your regulator or notified body\n\nThe PDF output from Step 4 is structured to match the section headings regulators expect. What you hand to a notified body (Article 43 conformity assessment):\n\n1. `annex_iv_report.pdf` — primary documentation\n2. `annex_iv_report.json` — machine-readable evidence vault\n3. `baseline_circuit.json` — Article 72 monitoring baseline\n4. CI/CD logs showing the compliance gate passing on each deployment\n\n\u003e **What Glassbox does not replace.** The Annex IV documentation package supports your technical file. It does not constitute: a conformity declaration (Article 47), a notified body certificate (Article 43), or a legal opinion on risk classification. These require qualified legal counsel. Glassbox dramatically reduces the time and cost to prepare the technical file — it does not eliminate the need for human review.\n\n---\n\n### Enterprise support\n\nFor organisations that need:\n- Air-gapped / on-premises deployment\n- Signed evidence vaults with audit log\n- Custom risk classification frameworks (ISO 42001, NIST AI RMF)\n- SLA and legal indemnification\n- Dedicated compliance engineering support\n\nContact: **mahale.ajay01@gmail.com** | Enterprise pricing from **€24,000/year**\n\n---\n\n## What's New in v4.2.6\n\nPatch release — bug fixes and mathematical correctness improvements.\n\n### Bug Fixes\n\n**Critical — `acdc.py` (ACDC circuit discovery):**\n- Fixed `KeyError` crash in `_build_fwd_hooks`: the method was accessing `hook_result` from the forward-pass cache, but the ACDC cache filter only stores `hook_z` and `hook_resid_pre` (to prevent OOM on large models). `hook_result` is 32× larger than `hook_z` for models with d_model=4096. Fix: per-head residual contribution is now computed correctly as `hook_z[:,:,h,:] @ W_O[h]` without caching `hook_result`.\n- Fixed second `KeyError` crash: MLP path accessed `hook_mlp_out` which was also not in the cache. Fix: `hook_mlp_out` added to the cache filter (same memory footprint as `hook_z`).\n\n**`core.py` (faithfulness metrics):**\n- Fixed `_suff_exact` and `_comp`: used `clean_ld == 0.0` (strict float equality) as a zero-guard, which silently fails for near-zero logit differences. Replaced with `abs(clean_ld) \u003c 1e-8`.\n- Same fix applied to the multi-corruption loop in `analyze()`.\n\n**`bias.py` (token bias probe):**\n- Fixed `token_bias_probe` online mode: `avg_score` was computed as `sum(all vocab probs) / len(all vocab probs)` ≈ 1/vocab_size (meaningless constant). Replaced with `max(token_probs.values())` — the peak stereotypical association among role tokens returned by `model_fn`.\n\n**`pyproject.toml`:**\n- Added missing `opentelemetry-exporter-otlp\u003e=1.20.0` to the `[all]` extras group (was in `[telemetry]` but omitted from `[all]`).\n\n---\n\n## What's New in v4.2.0\n\nGlassbox v4.2.0 extends from 18 to **21 mathematical frameworks**, adding three architecturally significant capabilities: the full ACDC algorithm (Conmy et al. NeurIPS 2023) for exact edge-level circuit discovery, a multi-architecture adapter making all frameworks work on Llama-3 / Mistral / Phi-3 / Gemma (GQA + RMSNorm), and cross-model circuit comparison with normalised Jaccard similarity.\n\n### 1. AutomatedCircuitDiscovery — ACDC (Conmy et al. 2023)\n\nAttribution patching identifies which heads matter. ACDC discovers the exact minimal directed circuit — which edges between heads are causally necessary.\n\nFor each directed edge (sender u → receiver v) in topological order, ACDC patches u's per-head contribution to the residual stream with the corrupted activation and measures KL divergence. If KL \u003c τ = 0.10, the edge is pruned. The remaining edges form the minimal faithful circuit.\n\nThis is exact causal evidence, not a Taylor approximation. The literature (Syed et al. 2024) shows EAP (attribution patching) agrees with ACDC in ~85-95% of cases — ACDC is preferred when exactness is required for compliance or publication.\n\n```python\nfrom glassbox import AutomatedCircuitDiscovery\n\nacd   = AutomatedCircuitDiscovery(model, threshold=0.10)\nresult = acd.discover(clean_tokens, corrupted_tokens)\n\nprint(result.summary())\n# ACDC Circuit: 18/144 edges retained | KL=0.023 | Faithful=True | τ=0.100\n\nprint(result.faithfulness_grade())   # \"STRONG\" | \"PARTIAL\" | \"WEAK\"\nprint(result.circuit.head_nodes())   # {(3,0), (7,3), (9,9), ...}\n```\n\n### 2. MultiArchAdapter — GQA + RMSNorm Support\n\nAll Glassbox frameworks previously assumed standard Multi-Head Attention (GPT-2 style). v4.2.0 adds explicit architecture adaptation for:\n\n- **GQA models** (Llama-3 8B: 32 query / 8 KV heads, Mistral 7B: 32Q / 8KV, Phi-3): KV attribution scores redistributed equally across the G sharing query heads\n- **RMSNorm models** (all Llama, Mistral, Phi-3, Gemma): folds γ into W_Q/K/V without the bias term that LayerNorm requires (RMSNorm: no mean subtraction, no additive β)\n\n```python\nfrom glassbox import MultiArchAdapter\nfrom transformer_lens import HookedTransformer\n\nmodel   = HookedTransformer.from_pretrained(\"meta-llama/Llama-3-8B\")\nadapter = MultiArchAdapter.from_model(model)\n\nreport = adapter.architecture_report()\nprint(report.summary())\n# Model: meta-llama/Llama-3-8B\n# Norm: RMSNorm | Attention: GQA (32Q / 8KV = 4 query heads per KV group)\n# GQA mapping: {0: [0,1,2,3], 1: [4,5,6,7], ...}\n\n# Adjust raw attributions for GQA\nraw_attr = {(l, h): score for (l, h), score in result[\"attributions\"].items()}\nadjusted = adapter.adjust_attributions_for_gqa(raw_attr)\n```\n\n### 3. CrossModelComparison — Circuit Stability Across Architectures\n\nRuns mechanistic interpretability on multiple model families and computes pairwise circuit similarity. Answers: do GPT-2 and Llama-3 use the same heads for the same task?\n\nNormalises head positions to (layer/n_layers, head/n_heads) ∈ [0,1)² before comparison, so GPT-2's L9H9 can be compared to Llama-3's L27H24. Uses Jaccard similarity on 10×10 grid bins + Pearson r on normalised attribution vectors.\n\n```python\nfrom glassbox import CrossModelComparison, ModelAnalysisConfig, compare_models\n\nconfigs = [\n    ModelAnalysisConfig(\n        model_name=\"gpt2\",\n        clean_prompt=\"When Mary and John went to the store, John gave a drink to\",\n        corrupted_prompt=\"When Alice and Bob went to the store, Bob gave a drink to\",\n        target_token=\" Mary\",\n        distractor_token=\" John\",\n    ),\n    ModelAnalysisConfig(\n        model_name=\"EleutherAI/pythia-160m\",\n        clean_prompt=\"When Mary and John went to the store, John gave a drink to\",\n        corrupted_prompt=\"When Alice and Bob went to the store, Bob gave a drink to\",\n        target_token=\" Mary\",\n        distractor_token=\" John\",\n    ),\n]\n\nreport = compare_models(configs, top_k_circuit=10)\nprint(report.summary())\nprint(report.attribution_table())\n```\n\n### Mathematical Completeness: 21 Frameworks\n\n| Framework | v4.1.0 | v4.2.0 |\n|-----------|--------|--------|\n| Attribution Patching (Nanda 2023) | ✓ | ✓ |\n| Edge Attribution Patching (Syed 2024) | ✓ | ✓ |\n| Causal Scrubbing (Chan et al. 2022) | ✓ | ✓ |\n| Distributed Alignment Search (Geiger 2023) | ✓ | ✓ |\n| Hessian Error Bounds (Pearlmutter 1994) | ✓ | ✓ |\n| Folded LayerNorm Correction | ✓ | ✓ |\n| Benjamini-Hochberg FDR | ✓ | ✓ |\n| SAE Polysemanticity | ✓ | ✓ |\n| Multi-Corruption Robustness | ✓ | ✓ |\n| Held-Out Validation | ✓ | ✓ |\n| Sample Size Gate (Fisher Z) | ✓ | ✓ |\n| Head Composition (Elhage 2021) | ✓ | ✓ |\n| Bootstrap CI | ✓ | ✓ |\n| Minimum Faithful Circuit (MFC) | ✓ | ✓ |\n| Circuit Diff (v-to-v) | ✓ | ✓ |\n| Bias Analysis (counterfactual) | ✓ | ✓ |\n| Multi-Agent Liability | ✓ | ✓ |\n| Logit Lens | ✓ | ✓ |\n| **ACDC (Conmy 2023)** | — | ✓ |\n| **GQA Multi-Arch Adapter** | — | ✓ |\n| **Cross-Model Comparison** | — | ✓ |\n\n**Score: 18 → 21 frameworks**\n\n---\n\n## What's New in v4.1.0\n\nGlassbox v4.1.0 completes the **ROADMAP_V4 mathematical framework** — 18/18 frameworks now implemented. This version brings the three hardest features: Hessian-based reliability bounds, Anthropic-standard causal scrubbing, and Distributed Alignment Search. These close the gap vs Harvard/MIT/Anthropic/DeepMind research standards.\n\n### 1. HessianErrorBounds — Second-Order Taylor Reliability Certificates\n\nStandard attribution patching uses a first-order Taylor approximation. If the second-order term dominates, the ranking is unreliable. `HessianErrorBounds` computes `ε(h) = ½·δzᵀ·H_h·δz` via Pearlmutter (1994) HVP, certifying whether the approximation holds.\n\n```python\nfrom glassbox import HessianErrorBounds\n\nhb     = HessianErrorBounds(model)\nbounds = hb.compute(\n    attributions = result[\"attributions_raw\"],   # {(layer, head): score}\n    clean_tokens = clean_tokens,\n    corr_tokens  = corr_tokens,\n    target_tok   = target_id,\n    distract_tok = distract_id,\n)\n\nprint(bounds.summary_line())\n# Hessian [reliable ✓] | max_ratio=0.043 dominated=0/144 heads (threshold=0.2)\n\nprint(bounds.approximation_reliable)   # True — first-order ranking is certified\nprint(bounds.to_dict()[\"dominated_heads\"])  # [] — no heads dominated by second-order terms\n```\n\nFlags `hessian_dominated` heads where `|ε(h)| / |α(h)| \u003e 0.20`. Maps to **Art. 13(1)** transparency.\n\n### 2. CausalScrubbing — Anthropic-Standard Circuit Hypothesis Testing\n\nAttribution identifies *which* heads matter. Causal scrubbing (Chan et al., Anthropic 2022) answers: does the identified circuit *causally* implement the claimed computation? `CS(H) = E[LD_scrubbed]/LD_clean` — strong ≥ 0.80.\n\n```python\nfrom glassbox import CausalScrubbing, CircuitHypothesis\n\n# Use the canonical Wang et al. 2022 IOI circuit hypothesis\nhypothesis = CircuitHypothesis.from_wang2022_ioi()\n# Or define your own:\n# hypothesis = CircuitHypothesis.from_list(\"my_circuit\", [(9,6),(9,9),(10,0)])\n\nscrubber = CausalScrubbing(model, n_samples=5)\nresult   = scrubber.evaluate(\n    hypothesis   = hypothesis,\n    prompt       = \"When Mary and John went to the store, John gave a drink to\",\n    corr_prompt  = \"When John and Mary went to the store, Mary gave a drink to\",\n    target_tok   = target_id,\n    distract_tok = distract_id,\n)\n\nprint(result.summary_line())\n# CausalScrubbing [Wang2022_IOI_Circuit] ✓✓ | CS=0.8923 (strong) | LD_clean=4.23...\n\nprint(result.interpretation)   # \"strong\" — hypothesis causally explains the circuit\nprint(result.cs_score)         # 0.8923\n```\n\nMaps to **Art. 9(1)** risk management — formal causal account, not just correlation.\n\n### 3. DistributedAlignmentSearch — Linear Concept Subspace Discovery\n\nDAS (Geiger et al. 2023) finds *where* in the residual stream a concept is encoded. Learns rotation matrix `R ∈ R^{d_model × k}` via PCA on activation differences; validates via interchange interventions.\n\n```python\nfrom glassbox import DistributedAlignmentSearch\n\ndas    = DistributedAlignmentSearch(model, concept_dims=4)\nresult = das.search(\n    concept_label         = \"IO_name_position\",\n    clean_prompts_tokens  = clean_token_list,\n    counterfactual_tokens = corr_token_list,\n    target_tok            = target_id,\n    distract_tok          = distract_id,\n    target_layer          = 9,\n    target_position       = -1,\n)\n\nprint(result.summary_line())\n# DAS [IO_name_position] ENCODED ✓ | layer=9 pos=-1 | score=0.8234 dims=4 expl_var=0.731\n\nprint(result.concept_encoded)      # True — concept clearly encoded in 4-dimensional subspace\nprint(result.explained_variance)   # 0.731 — 73% of Δz variance explained by top-4 dims\n\n# Sweep all layers to find where concept is strongest\nall_results = das.search_all_layers(\"IO_name_position\", clean_tokens, corr_tokens,\n                                     target_id, distract_id)\n# Sorted by das_score descending\nprint(all_results[0].target_layer)  # 9 — concept strongest at layer 9\n```\n\nMaps to **Art. 15(1)** robustness — localises concept encoding for controlled interventions.\n\n### Mathematical Completeness: 18/18 ✓ (extended to 21 in v4.2.0)\n\n| Framework | v3.6.0 | v3.7.0 | v4.0.0 | v4.1.0 | v4.2.0 |\n|---|---|---|---|---|---|\n| Attribution patching (Nanda 2023) | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Sufficiency / Comprehensiveness / F1 | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Fisher Z cross-model comparison | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Edge Attribution Patching (Syed 2024) | ✓ | ✓ | ✓ | ✓ | ✓ |\n| BCa Bootstrap CIs | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Bonferroni correction | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Welch's t-test cross-model | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Multi-corruption robustness | — | ✓ | ✓ | ✓ | ✓ |\n| SampleSizeGate (power analysis) | — | ✓ | ✓ | ✓ | ✓ |\n| Held-out validation (gen gap) | — | ✓ | ✓ | ✓ | ✓ |\n| Folded LayerNorm correction | — | — | ✓ | ✓ | ✓ |\n| Benjamini-Hochberg FDR | — | — | ✓ | ✓ | ✓ |\n| SAE polysemanticity entropy | — | — | ✓ | ✓ | ✓ |\n| Hessian error bounds (Pearlmutter) | — | — | — | ✓ | ✓ |\n| Causal scrubbing (Chan/Anthropic) | — | — | — | ✓ | ✓ |\n| Distributed Alignment Search | — | — | — | ✓ | ✓ |\n| Jaccard circuit similarity | ✓ | ✓ | ✓ | ✓ | ✓ |\n| Cohen's d effect size | ✓ | ✓ | ✓ | ✓ | ✓ |\n| **ACDC edge-circuit (Conmy 2023)** | — | — | — | — | ✓ |\n| **GQA/RMSNorm multi-arch adapter** | — | — | — | — | ✓ |\n| **Cross-model circuit comparison** | — | — | — | — | ✓ |\n\n**Score: 7 → 10 → 13 → 18 → 21**\n\n---\n\n## What's New in v4.0.0\n\n### 1. FoldedLayerNorm — Unbiased Attribution Patching\n\nLayerNorm scale `γ` multiplicatively biases attribution scores. `FoldedLayerNorm` absorbs `γ` into `W_Q/K/V` (Elhage et al. 2021 §4.1), computing corrected attributions and flagging heads where `|Δα/α| \u003e 0.15`.\n\n```python\nfrom glassbox import FoldedLayerNorm\n\nfln    = FoldedLayerNorm(model)\nreport = fln.analyze(result[\"attributions_raw\"], clean_tokens, corr_tokens, target_id, distract_id)\nprint(report.summary_line())\n# LayerNorm [all OK ✓] | max_ratio=0.041 mean_ratio=0.012 (threshold=0.15)\n\ncorrected = fln.apply_correction(result[\"attributions_raw\"], report.folded_attributions)\n```\n\n### 2. BenjaminiHochberg FDR — Multiple Testing Correction\n\nTesting 144 heads simultaneously inflates false positives. `BenjaminiHochberg` controls `E[FDR] ≤ α` alongside Bonferroni for comparison (Benjamini \u0026 Hochberg 1995).\n\n```python\nfrom glassbox import BenjaminiHochberg, apply_fdr_correction\n\nbh     = BenjaminiHochberg(alpha=0.05)\nreport = bh.run(attributions, se_map)   # se_map from bootstrap or Δ-method\nprint(report.summary_line())\n# FDR [BH: 8/144 significant | Bonferroni: 5/144] E[FDR]≤0.045 α=0.05\n\nsig_heads = report.significant_heads_bh()   # [(9,6), (9,9), (10,0), ...]\n```\n\n### 3. PolysemanticityScorerSAE — Head Interpretability Quantification\n\nMeasures whether heads are monosemantic or polysemantic via `H(p(feature|head_h))`. SAE-entropy method if sae-lens installed; PCA participation ratio fallback otherwise.\n\n```python\nfrom glassbox import PolysemanticityScorerSAE\n\nscorer  = PolysemanticityScorerSAE(model)\nsummary = scorer.score_circuit(circuit=[(9,6),(9,9),(10,0)], prompts_tokens=token_list)\nprint(summary.summary_line())\n# Polysemanticity [method=pca_participation_ratio] | mean_entropy=0.312 monosemantic=67%\n```\n\n---\n\n## What's New in v3.7.0\n\n### 1. MultiCorruptionPipeline — 4 Corruption Strategies + Robustness Test\n\nSingle name-swap corruption gives one data point. `MultiCorruptionPipeline` runs four independent corruptions and checks robustness criterion `∀k: |S_k(C) − S̄| \u003c 0.10`.\n\n```python\nfrom glassbox import MultiCorruptionPipeline, CorruptionStrategy\n\npipeline = MultiCorruptionPipeline(model)\nreport   = pipeline.run(\n    prompt       = \"When Mary and John went to the store, John gave a drink to\",\n    io_name      = \"Mary\",\n    subject_name = \"John\",\n    circuit      = [(9,6), (9,9), (10,0)],\n    target_tok   = target_id,\n    distract_tok = distract_id,\n    strategies   = [\n        CorruptionStrategy.NAME_SWAP,\n        CorruptionStrategy.RANDOM_TOKEN,\n        CorruptionStrategy.GAUSSIAN_NOISE,\n        CorruptionStrategy.MEAN_ABLATION,\n    ],\n)\n\nprint(report.robust)                    # True — circuit stable across all corruptions\nprint(report.max_deviation)            # 0.063 — well below δ=0.10\nprint(report.perturbation_sensitive)   # False\n```\n\n### 2. SampleSizeGate — Statistical Power Enforcement\n\nPrevents misleading compliance reports from underpowered analyses. Hard blocks at n\u003c20, warns at n\u003c50, with `recommend_n()` power analysis.\n\n```python\nfrom glassbox import SampleSizeGate, SampleSizeError\n\ngate = SampleSizeGate()\ngate.check(n=15)    # raises SampleSizeError — BLOCKED\ngate.check(n=35)    # SampleSizeWarning — proceed with caution\ngate.check(n=100)   # passes silently\n\nprint(gate.recommend_n(rho_min=0.25, power=0.80))   # 126\n```\n\n### 3. HeldOutValidator — Circuit Generalisation Gate\n\nDetects circuits that overfit to the training prompt set. 50/50 split, flags `overfit` when `|F1_train − F1_test| ≥ 0.10`.\n\n```python\nfrom glassbox import HeldOutValidator\n\nvalidator = HeldOutValidator()\nval       = validator.validate(batch_results)   # from batch_analyze()\nprint(val.summary_line())\n# HeldOut [OK ✓] | F1_train=0.6821 F1_test=0.6540 gap=0.0281 (threshold=0.1)\nprint(val.generalises)   # True\n```\n\n---\n\n## What's New in v3.6.0\n\n- **Claude Code plugin**: Full `.claude/` directory with 6 agents, 6 skills, 5 commands\n- **MCP server**: Model Context Protocol integration with 5 tools (circuit discovery, faithfulness metrics, full Annex IV compliance report, attention patterns, logit lens)\n- **Bug fixes**: MCP class reference, analyze() signature, deterministic circuit sorting, input validation\n\n---\n\n## What's New in v3.5.0\n\n- **Claude Code plugin** (`.claude/`): 6 specialized agents, 6 skills, 5 slash commands for mechanistic interpretability workflows\n- **FastMCP server** (`mcp/`): Model Context Protocol integration with 5 tools — circuit discovery, faithfulness metrics, full Annex IV compliance report, attention patterns, logit lens\n- **Brand asset** (`assets/glassbox_brand.png`): 1400×800 circuit-trace visualization with attribution heatmap\n- **Bug fixes**: Non-deterministic circuit sort (added secondary `(layer, head)` key), `analyze()` input validation, MCP class reference and parameter names corrected\n\n---\n\n## What's New in v3.4.0\n\nGlassbox v3.4.0 is the **strategic monopoly release** — three features that no other open-source interpretability tool ships, purpose-built for the August 2026 EU AI Act enforcement deadline.\n\n### 1. MultiAgentAudit — Causal Handoff Tracing (Article 9 system-level risk)\n\nThe first open-source tool that traces bias contamination and semantic drift *across multi-agent chains* — not just individual models. Identify exactly which agent introduced or amplified a bias, and generate a per-agent liability report with Annex IV narrative.\n\n```python\nfrom glassbox import MultiAgentAudit, AgentCall\n\naudit = MultiAgentAudit()\n\nreport = audit.audit_chain([\n    AgentCall(\n        agent_id=\"router\",\n        model_name=\"gpt2\",\n        input_text=\"Classify this job application from Maria Garcia\",\n        output_text=\"Application flagged for manual review\",\n    ),\n    AgentCall(\n        agent_id=\"scorer\",\n        model_name=\"gpt2\",\n        input_text=\"Application flagged for manual review\",\n        output_text=\"Score: 42/100 — high risk profile\",\n    ),\n])\n\nprint(report.chain_risk_level)        # \"HIGH\"\nprint(report.most_liable_agent)       # \"scorer\"\nprint(report.annex_iv_text)           # Annex IV Article 9 narrative\n\n# Full HTML dashboard (self-contained, no deps)\nwith open(\"liability_report.html\", \"w\") as f:\n    f.write(audit.to_html(report))\n```\n\nScores bias across 8 EU AI Act Article 10(5) protected categories (gender, race/ethnicity, nationality, religion, age, disability, sexuality, socioeconomic). No LLM required. Maps to **Article 9**, **Article 10(2)(f)**, **Article 10(5)**, **Article 13(1)**.\n\n### 2. SteeringVectorExporter — Article 9(2)(b) Risk Mitigation\n\nExtract and export steering vectors from the residual stream using Representation Engineering (Zou et al. 2023). Apply them as runtime safety layers, test their suppression effectiveness, and export `.pt` or `.npy` files as documented risk mitigation evidence.\n\n```python\nfrom glassbox import SteeringVectorExporter\n\nexporter = SteeringVectorExporter(method=\"mean_diff\")  # or \"pca\"\n\n# Extract from contrast pairs\nsv = exporter.extract_mean_diff(\n    model=model,\n    positive_prompts=[\"The nurse said she would call the doctor.\"],\n    negative_prompts=[\"The nurse said he would call the doctor.\"],\n    layer=8,\n    concept_label=\"gender_bias\",\n    scale=-15.0,  # negative = suppress\n)\n\n# Apply as a runtime hook — steered next token\nsteered_token = exporter.apply(model, \"The nurse said\", sv)\n\n# Quantify suppression — before/after faithfulness comparison\ntest = exporter.test_suppression(model, gb, prompt, correct, incorrect, sv)\nprint(test[\"suppression_ratio\"])   # 0.34  (34% reduction in circuit activation)\nprint(test[\"verdict\"])             # \"Steering vector 'gender_bias' effectively suppresses...\"\n\n# Export for regulatory submission\nexporter.export_pt(sv, \"steering/gender_bias.pt\")\nexporter.export_numpy(sv, \"steering/gender_bias.npy\")\n\n# Or extract the full default bias suite in one call\nbias_suite = exporter.extract_bias_suite(model, layer=8)\n# {\"gender_bias\": SteeringVector, \"racial_bias\": ..., \"toxicity\": ..., \"age_bias\": ...}\n```\n\n`extract_from_circuit()` auto-selects the optimal layer from a prior `gb.analyze()` result. Maps to **Article 9(2)(b)**, **Article 9(5)**, **Article 15(1)**.\n\n### 3. AnnexIVEvidenceVault — Full Article 11 Documentation Package\n\nThe only tool that assembles *all* interpretability findings — circuit analysis, bias tests, steering vectors, multi-agent audits, SAE features, stability scores — into a single machine-readable, regulation-mapped Annex IV evidence vault.\n\n```python\nfrom glassbox import build_annex_iv_vault\n\nvault = build_annex_iv_vault(\n    gb_result=result,                          # GlassboxV2.analyze() output\n    model_name=\"meta-llama/Llama-2-7b-hf\",\n    provider=\"Acme Bank NV\",\n    use_case=\"automated_credit_scoring\",\n    deployment_ctx=\"financial_services\",\n    commit_sha=\"634e397\",\n    multiagent_report=report,                  # MultiAgentAudit output\n    steering_vectors={\"gender_bias\": sv},      # SteeringVectorExporter output\n    steering_test_results={\"gender_bias\": test},\n    sae_features=top_features,                 # SAEFeatureAttributor output\n    stability_result=stability,                # stability_suite() output\n    output_json=\"reports/annex-iv.json\",       # machine-readable\n    output_html=\"reports/annex-iv.html\",       # submission-ready HTML\n)\n\nsummary = vault.to_dict()[\"compliance_summary\"]\nprint(summary[\"overall_status\"])    # \"COMPLIANT\"\nprint(summary[\"pass_rate\"])         # 0.875\nprint(summary[\"sections_covered\"])  # [\"§1\", \"§2\", \"§3\", \"§4\", \"§6\", \"§7\"]\nprint(summary[\"articles_covered\"])  # [\"Article 9\", \"Article 10\", \"Article 11\", ...]\n```\n\nCovers Annex IV **§1–§7**, maps to Articles **9, 10, 11, 13, 15, 72**. Every entry carries article references, metric values, pass/fail thresholds, and provenance metadata. HTML report is suitable for regulatory submission or attachment to a conformity declaration.\n\n---\n\n## What's New in v3.3.0\n\n### 1. NaturalLanguageExplainer — Plain English for Compliance Officers\n\nConverts raw circuit analysis results into structured, plain-English compliance summaries. No LLM dependency — entirely rule-based with EU AI Act article citations in every sentence.\n\n```python\nfrom glassbox import NaturalLanguageExplainer\n\nexplainer = NaturalLanguageExplainer(verbosity=\"detailed\", include_article_refs=True)\nexplanation = explainer.explain(result, model_name=\"gpt2\", use_case=\"credit_scoring\")\n\nprint(explanation[\"headline\"])\n# \"Circuit Grade: Good (F1 = 0.73) — Meets Article 15(1) accuracy threshold\"\n\nprint(explanation[\"compliance_summary\"])\n# \"The model's decision circuit satisfies Article 11 documentation requirements...\"\n\n# Section breakdown\nsections = explainer.explain_sections(result)\nprint(sections[\"verdict\"])\nprint(sections[\"circuit_description\"])\nprint(sections[\"faithfulness_analysis\"])\nprint(sections[\"risk_flags\"])\n\n# Self-contained HTML card for embedding\nhtml = explainer.to_html(result)\n```\n\nIntegrated into the Glassbox compliance dashboard — every circuit analysis now shows a plain-English summary above the metrics table.\n\n### 2. HuggingFace Hub Integration\n\nLoad any HookedTransformer-compatible model directly from the Hub with a single call, and push compliance metadata back to model cards.\n\n```python\nfrom glassbox import load_from_hub, HuggingFaceModelCard\n\n# Load model (supports 29 architecture aliases)\nmodel = load_from_hub(\"meta-llama/Llama-2-7b-hf\", dtype=\"float16\")\n\n# Push compliance section to model card README.md\ncard = HuggingFaceModelCard(\"my-org/my-model\", token=\"hf_...\")\ncard.push_compliance_section(result, use_case=\"credit_scoring\")\n\n# Read it back\nmeta = card.read_compliance_section()\nprint(meta[\"grade\"])   # \"B\"\n```\n\nSupports GPT-2, GPT-Neo, Pythia, OPT, Llama-2/3, Mistral, Phi-3, Gemma, Falcon — 29 architecture aliases.\n\n### 3. MLflow Integration\n\nLog every Glassbox audit run as an MLflow experiment with one call.\n\n```python\nfrom glassbox import log_glassbox_run, GlassboxMLflowCallback\n\n# One-liner logging\nrun_id = log_glassbox_run(\n    result, model_name=\"gpt2\", use_case=\"credit_scoring\",\n    prompt=prompt, log_html_report=True\n)\n\n# Training callback — audit every N epochs\ncb = GlassboxMLflowCallback(gb, prompt, correct, incorrect, log_every_n_epochs=5)\n# pass to your trainer's callbacks list\n```\n\nLogs: sufficiency, comprehensiveness, F1, n_heads, stability scores, HTML report artifact, circuit JSON.\n\n### 4. Slack / Teams Alerting\n\nFire webhook alerts when compliance drops or circuits drift.\n\n```python\nfrom glassbox import AlertConfig\n\nalert = AlertConfig(\n    slack_webhook=\"https://hooks.slack.com/...\",\n    teams_webhook=\"https://outlook.office.com/webhook/...\",\n    jaccard_alert_threshold=0.75,\n)\nalert.notify_audit_complete(result, model_name=\"gpt2\", use_case=\"credit_scoring\")\nalert.notify_circuit_drift(diff_result, model_a=\"gpt2\", model_b=\"gpt2-ft\")\n```\n\n---\n\n## What's New in v3.1.0\n\n### 1. CircuitDiff — Post-Market Model Monitoring (Article 72)\n\nMechanistic diff between two model versions. Tells you exactly which attention heads entered or left the circuit — not just that performance changed, but *why* it changed.\n\n```python\nfrom glassbox import GlassboxV2\nfrom glassbox.circuit_diff import CircuitDiff\nfrom transformer_lens import HookedTransformer\n\ngb_base = GlassboxV2(HookedTransformer.from_pretrained(\"gpt2\"))\ngb_ft   = GlassboxV2(HookedTransformer.from_pretrained(\"my-org/gpt2-finetuned\"))\n\ndiffer = CircuitDiff(gb_base, gb_ft, label_a=\"gpt2-base\", label_b=\"gpt2-ft\")\ndiff   = differ.diff(\n    prompt    = \"The loan applicant has a credit score of 620. The decision is\",\n    correct   = \" approved\",\n    incorrect = \" denied\",\n)\n\nprint(diff.change_summary)\n# STABLE — circuits are nearly identical. Jaccard=0.87. 7 shared heads, 1 added, 0 removed.\n\nprint(diff.to_markdown())  # PR comment / audit report ready\n```\n\nBatch mode + `summary_stats()` for multi-prompt stability reports. Maps to **Article 72** (post-market monitoring) and **Annex IV Section 6** (lifecycle changes).\n\n### 2. Custom SAE Upload\n\nLoad your own trained Sparse Autoencoder weights — no sae-lens hub required. Works for fine-tuned or non-public models.\n\n```python\nfrom glassbox.sae_attribution import SAEFeatureAttributor\n\n# Single checkpoint applied to all queried layers\nsfa = SAEFeatureAttributor(model, sae_path=\"./my_sae.pt\")\n\n# Per-layer checkpoints\nsfa = SAEFeatureAttributor(model, sae_path={9: \"./sae_l9.pt\", 10: \"./sae_l10.pt\"})\n\n# Checkpoint format: .pt dict with keys:\n# encoder_weight (n_features × d_model), encoder_bias (n_features,)\n# decoder_weight (d_model × n_features), decoder_bias (d_model,)\nresult = sfa.attribute(tokens, \" approved\", \" denied\", layers=[9, 10, 11])\n```\n\n### 3. OpenTelemetry Tracing\n\nPipe every analysis call into your existing observability stack (Datadog, Honeycomb, Jaeger, Grafana Tempo). Self-hosted → traces never leave your infrastructure.\n\n```python\nfrom glassbox.telemetry import setup_telemetry, instrument_glassbox\n\nsetup_telemetry(service_name=\"glassbox-prod\", endpoint=\"http://localhost:4317\")\ninstrument_glassbox(gb)   # wraps analyze() with OTel spans\n\nresult = gb.analyze(...)  # → span: \"glassbox.analyze\" with grade, F1, circuit_heads\n```\n\nEach span carries: `glassbox.model`, `glassbox.grade`, `glassbox.f1`, `glassbox.circuit_heads`, `glassbox.duration_ms`. Supports Jaeger, Honeycomb, Datadog OTLP, and any OTel-compatible backend.\n\n### 4. Exact Sufficiency in `bootstrap_metrics()`\n\n`bootstrap_metrics()` now computes **exact** sufficiency by default (`exact_suff=True`) — proper positive ablation (keep circuit, corrupt rest) instead of the Taylor approximation. This is the method that produces the ~100% sufficiency figure in the arXiv paper.\n\n```python\n# Default: exact sufficiency (2 extra passes per prompt)\nresult = gb.bootstrap_metrics(prompts, seed=42)\n# result[\"meta\"][\"exact_suff\"] = True\n# result[\"meta\"][\"suff_is_approx\"] = False\n\n# Fast mode: Taylor approximation (0 extra passes)\nresult = gb.bootstrap_metrics(prompts, exact_suff=False)\n```\n\nThe paper benchmark: `seed=42`, GPT-2 small (12L/12H/768d), Apple M2 Pro, PyTorch 2.2.0, TransformerLens 1.19.0.\n\n---\n\n## What's New in v3.0.0\n\nGlassbox v3.0.0 is the enterprise compliance release. Five new features ship on top of all v2.9.0 foundations:\n\n### 1. BiasAnalyzer — EU AI Act Article 10(2)(f)\n\nThree bias tests built for regulatory submission. Works offline (pre-computed logprobs) or online (live `model_fn`).\n\n```python\nfrom glassbox import BiasAnalyzer, BiasReport\n\nba = BiasAnalyzer()\n\n# Counterfactual fairness — swap demographic attributes, measure probability gap\nresult = ba.counterfactual_fairness_test(\n    prompt_template=\"The {attribute} applied for the loan\",\n    groups={\"gender\": [\"male applicant\", \"female applicant\"]},\n    target_tokens=[\"approved\", \"denied\"],\n    model_fn=my_model,\n)\nprint(result.max_gap, result.flagged)   # 0.12, False\n\n# Demographic parity — outcome rate disparity across groups\ndp = ba.demographic_parity_test(\n    prompts_by_group={\"male\": [...], \"female\": [...]},\n    target_tokens=[\"approved\"],\n    model_fn=my_model,\n)\n\n# Aggregate into Annex IV Section 5 report\nreport = BiasReport()\nreport.add_result(result)\nreport.add_result(dp)\nprint(report.to_markdown())\n```\n\n### 2. Webhooks — CI/CD callbacks\n\nRegister a callback URL that fires when async jobs complete. HMAC-SHA256 signed payloads.\n\n```bash\ncurl -X POST https://YOUR_API_URL/v1/webhooks \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"url\":\"https://yourapp.com/hook\",\"events\":[\"job.completed\",\"job.failed\"],\"secret\":\"mysecret\"}'\n```\n\n### 3. RiskRegister — Article 9 persistent risk tracking\n\nTrack compliance risks across audit sessions. Deduplication, severity ordering, status lifecycle.\n\n```python\nfrom glassbox import RiskRegister\n\nrr = RiskRegister(\"risks.json\")\nrr.ingest_annex_report(annex, model_name=\"gpt2\")  # auto-extracts Section 5 risks\n\n# Status lifecycle\nrr.set_status(risk_id, \"mitigated\", notes=\"Retrained with more data\")\n\n# Compliance health\nprint(rr.trend_summary())\n# {'compliance_health': 'amber', 'open': 2, 'mitigated': 1, 'total': 3}\n\n# For dashboards and PR comments\nprint(rr.to_markdown())\n```\n\nMaps to EU AI Act **Article 9** (risk management system) and **Annex IV Section 5**.\n\n### 4. Multi-Audit History Dashboard\n\nF1 trend chart, grade distribution, audit table with grade trajectory. \"Load from API\" button connects to `GET /v1/audit/reports`. Toggle with the \"Audit History\" button in the compliance dashboard.\n\n### 5. Circuit SVG Export\n\n\"Download SVG\" button in the D3 circuit graph. Exports paper-ready `glassbox-circuit.svg` with inlined dark-mode styles.\n\n---\n\n## What's New in v2.9.0 (previous release)\n\nGlassbox v2.9.0 brought four major features for compliance teams and researchers:\n\n### 1. Tamper-Evident Audit Log (AuditLog)\n\nRecord and verify every audit run with SHA-256 hash chain integrity. Perfect for governance, risk, and compliance (GRC) teams.\n\n```python\nfrom glassbox.audit_log import AuditLog\n\nlog = AuditLog(\"glassbox_audit.jsonl\")\n\n# Log any analysis result\nlog.append_from_result(\n    result_dict,\n    auditor=\"compliance@mybank.com\",\n    notes=\"Q1 2026 risk review\"\n)\n\n# Verify chain integrity (tamper detection)\nis_valid = log.verify_chain()  # True if no modifications detected\n\n# Export for GRC tools\nlog.export_csv(\"audit_export.csv\")\njson_export = log.export_json(\"audit_full.json\")\n\n# Analytics\nsummary = log.summary()\n# {'total_audits': 42, 'avg_f1': 0.67, 'chain_valid': True, ...}\n```\n\n**Key features:** Append-only JSON Lines persistence, per-record SHA-256 hashing, chain validation, CSV/JSON export for audit trails.\n\n### 2. TypeScript / JavaScript SDK (zero-dependency)\n\nOfficial SDK for Node.js 18+, Deno, Bun, and browsers. Works with the REST API.\n\n```bash\nnpm install glassbox-sdk\n```\n\n```typescript\nimport { GlassboxClient } from 'glassbox-sdk'\n\nconst gb = new GlassboxClient({\n  baseUrl: 'https://YOUR_API_URL'\n})\n\nconst report = await gb.auditWhiteBox({\n  modelName: 'gpt2',\n  prompt: 'When Mary and John went to the store, John gave a drink to',\n  correctToken: ' Mary',\n  incorrectToken: ' John',\n  providerName: 'Acme Bank NV',\n  deploymentContext: 'financial_services'\n})\n\nconsole.log(report.grade)  // 'A' | 'B' | 'C' | 'D'\nconsole.log(report.faithfulness.f1)  // 0.0–1.0\n\n// Background jobs (async)\nconst job = await gb.startBlackBoxJob({ ... })\nconst completed = await gb.waitForJob(job.jobId)\n```\n\n**Supported:** auditWhiteBox, auditBlackBox, async jobs, attentionPatterns, report retrieval.\n\n### 3. GitHub Action glassbox-audit@v1\n\nEmbed compliance audits directly in your CI/CD pipeline. Fails the build if explainability falls below your required grade.\n\n```yaml\nname: Compliance\non: [pull_request]\njobs:\n  glassbox:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: designer-coderajay/glassbox-audit@v1\n        with:\n          model_name: 'gpt2'\n          prompt: 'The loan should be'\n          correct_token: ' approved'\n          incorrect_token: ' denied'\n          provider_name: 'Acme Bank NV'\n          deployment_context: 'financial_services'\n          fail_below_grade: 'B'  # Fail if grade is C or D\n          output_path: 'glassbox-report.json'\n```\n\n**Output:** Grade, F1 score, compliance status, report ID, and full JSON report artifact.\n\n### 4. Jupyter Widgets (CircuitWidget, HeatmapWidget)\n\nInteractive visualization of circuit analysis inside notebooks.\n\n```bash\npip install \"glassbox-mech-interp[jupyter]\"\n```\n\n```python\nfrom glassbox import GlassboxV2\nfrom glassbox.widget import CircuitWidget, HeatmapWidget\n\n# Option 1: Run analysis and render inline\nwidget = CircuitWidget.from_prompt(\n    gb,\n    prompt=\"When Mary and John went to the store, John gave a drink to\",\n    correct=\" Mary\",\n    incorrect=\" John\"\n)\nwidget.show()  # Renders in cell\n\n# Option 2: Visualize pre-computed result\nheatmap = HeatmapWidget(result_dict)\nheatmap.show()\n\n# Export to HTML\nhtml_str = widget.to_html()\n```\n\n**Features:** Attribution heatmaps, circuit member highlights, faithfulness metrics, grade badges, responsive dark theme.\n\n### 5. Attention Patterns API Endpoint\n\nNew `/v1/attention-patterns` REST endpoint to visualize what each circuit head is attending to.\n\n```bash\ncurl -X POST https://YOUR_API_URL/v1/attention-patterns \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model_name\": \"gpt2\",\n    \"prompt\": \"When Mary and John went to the store, John gave a drink to\",\n    \"heads\": [\"L9H9\", \"L9H6\"],\n    \"top_k\": 10\n  }'\n```\n\n```python\n# Via Python SDK\nattn = gb.attention_patterns(\n    \"gpt2\",\n    \"When Mary and John ...\",\n    heads=[\"L9H9\"],\n    topK=5\n)\nprint(attn[\"entropy\"])      # {'L9H9': 0.71, ...}\nprint(attn[\"headTypes\"])    # {'L9H9': 'focused', ...}\n```\n\n---\n\n## EU AI Act Compliance — Annex IV Reports\n\n[Regulation (EU) 2024/1689](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689) requires Annex IV technical documentation (Article 11) for high-risk AI systems in finance, healthcare, HR, legal, and critical infrastructure. Enforcement begins August 2026. Non-compliance penalties: up to €15 million or 3% of global annual turnover, whichever is higher (Article 99(4)).\n\n\u003e **Documentation aid, not legal certification.** Glassbox-generated reports are structured documentation drafts intended to support — not replace — the legal and technical review process required under EU AI Act Article 11. Whether your system qualifies as high-risk under Article 6 and Annex III, and whether generated documentation satisfies applicable obligations, must be determined by qualified legal counsel and/or a notified body (Article 43). See [Legal Notices](#legal-notices--regulatory-disclaimer).\n\nGlassbox generates all 9 Annex IV sections as a structured PDF + machine-readable JSON from a single function call:\n\n```python\npip install \"glassbox-mech-interp[compliance]\"\n```\n\n```python\nfrom transformer_lens import HookedTransformer\nfrom glassbox import GlassboxV2\nfrom glassbox.compliance import AnnexIVReport, DeploymentContext\n\nmodel  = HookedTransformer.from_pretrained(\"gpt2\")\ngb     = GlassboxV2(model)\nresult = gb.analyze(\n    \"The applicant credit score is 620. The loan should be\",\n    \" approved\", \" denied\",\n)\n\nreport = AnnexIVReport(\n    model_name         = \"gpt2\",\n    system_purpose     = \"Credit risk scoring\",\n    provider_name      = \"Acme Bank NV\",\n    provider_address   = \"1 Fintech Street, Amsterdam 1011AB\",\n    deployment_context = DeploymentContext.FINANCIAL_SERVICES,\n)\nreport.add_analysis(result)\nreport.to_pdf(\"annex_iv_report.pdf\")   # Annex IV-structured PDF (documentation aid — not a legal certification)\nreport.to_json(\"annex_iv_report.json\") # machine-readable JSON\n```\n\n**What the report covers (Annex IV, all 9 sections):**\n\n| Section | EU AI Act Reference | What Glassbox generates |\n|---------|-------------------|-------------------------|\n| 1. General description | Article 13(3)(a) | Model name, version, intended purpose, risk classification |\n| 2. Design \u0026 development | Article 10, 11(1)(d) | Training description, data governance, architecture |\n| 3. Monitoring \u0026 control | Article 9(6), 13(3)(b), 14 | Performance metrics, human oversight measures |\n| 4. Explainability assessment | Article 13 | Circuit heads, faithfulness F1, explainability grade A–D |\n| 5. Data requirements | Article 10 | Data quality, governance status, bias assessment |\n| 6. Risk assessment | Article 9 | Identified risks, failure modes, mitigation measures |\n| 7. Accuracy metrics | Article 15 | Task-specific accuracy, performance thresholds |\n| 8. Declaration of conformity | Article 47 | Signed declaration reference |\n| 9. Post-market monitoring | Article 72 | Monitoring plan, incident reporting, review schedule |\n\n**Explainability grades (Article 13 mapping):**\n\n| Grade | Sufficiency | Comprehensiveness | F1 | Meaning |\n|-------|-------------|-------------------|----|---------|\n| A | \u003e0.80 | \u003e0.60 | ≥0.80 | Full circuit explanation available |\n| B | \u003e0.65 | \u003e0.40 | ≥0.65 | Partial explanation — monitoring required |\n| C | \u003e0.40 | \u003e0.20 | ≥0.50 | Limited explanation — human oversight required |\n| D | ≤0.40 | ≤0.20 | \u003c0.50 | Insufficient — consider model change |\n\n\u003e **Grade scale note.** These thresholds are research-defined, based on the faithfulness F1 score from mechanistic interpretability literature (Conmy et al., 2023; Wang et al., 2022). They are **not** an officially validated regulatory scale under Regulation (EU) 2024/1689. No EU regulatory body has endorsed these specific thresholds. They are intended as internal documentation prioritisation aids, not as pass/fail compliance criteria. The grading scale and thresholds may be updated in future releases as interpretability research matures.\n\n---\n\n## Black-Box Audit — Any Model via API\n\nNo model weights needed. Works on GPT-4, Claude, Llama via any API endpoint. Uses counterfactual probing + sensitivity analysis + consistency testing to produce Article 13-relevant explainability metrics.\n\n\u003e **Black-box explainability note.** Black-box metrics (counterfactual probing, sensitivity analysis, consistency testing) are *behavioural proxies* — they measure the model's input-output behaviour, not its internal causal structure. They are fundamentally softer than white-box circuit analysis and will not achieve the same faithfulness scores. This is inherent to black-box analysis, not a limitation of Glassbox specifically: without weight access, structural causal attribution is not possible. Use white-box analysis for the highest-confidence explainability documentation; use black-box for models where weights are unavailable.\n\n```python\npip install \"glassbox-mech-interp[compliance]\"\n```\n\n```python\nfrom glassbox.audit import BlackBoxAuditor, ModelProvider\nfrom glassbox.compliance import AnnexIVReport, DeploymentContext\n\nauditor = BlackBoxAuditor(\n    model_provider = ModelProvider.OPENAI,\n    model_name     = \"gpt-4\",\n    api_key        = \"sk-...\",    # stays on your machine if running locally\n)\n\nresult = auditor.audit(\n    decision_prompt    = \"The applicant has a credit score of 620. The loan should be\",\n    expected_positive  = \"approved\",\n    expected_negative  = \"denied\",\n    n_rephrases        = 5,\n    n_sensitivity_steps = 10,\n)\n\nreport = AnnexIVReport(\n    model_name=\"gpt-4\", system_purpose=\"Credit risk scoring\",\n    provider_name=\"Acme Bank NV\", provider_address=\"Amsterdam\",\n    deployment_context=DeploymentContext.FINANCIAL_SERVICES,\n)\nreport.add_analysis(result)   # BlackBoxResult is drop-in compatible\nreport.to_pdf(\"gpt4_annex_iv.pdf\")\n```\n\nSupported providers: OpenAI, Anthropic, Together AI, Groq, Azure OpenAI, any custom endpoint.\n\n---\n\n## REST API (Hosted)\n\nThe API is live at `https://YOUR_API_URL`. Interactive docs at [`/docs`](https://YOUR_API_URL/docs).\n\n**Black-box audit (any model via API):**\n\n```bash\ncurl -X POST https://YOUR_API_URL/v1/audit/black-box \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-Provider-Api-Key: sk-your-openai-key\" \\\n  -d '{\n    \"target_provider\":    \"openai\",\n    \"target_model\":       \"gpt-4\",\n    \"decision_prompt\":    \"The loan applicant has a credit score of 620. The application should be\",\n    \"expected_positive\":  \"approved\",\n    \"expected_negative\":  \"denied\",\n    \"provider_name\":      \"Acme Bank NV\",\n    \"provider_address\":   \"1 Fintech Street, Amsterdam 1011AB\",\n    \"system_purpose\":     \"Credit risk assessment\",\n    \"deployment_context\": \"financial_services\",\n    \"generate_pdf\":       true\n  }'\n```\n\n\u003e **Key security:** The API key is passed as a header (`X-Provider-Api-Key`), never in the request body. It is never logged, never stored, and never included in the compliance report. See [SECURITY.md](SECURITY.md) for full details. For production, [self-host](#self-hosting).\n\n**White-box analysis (open-source models):**\n\n```bash\ncurl -X POST https://YOUR_API_URL/v1/audit/analyze \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model_name\":       \"gpt2\",\n    \"prompt\":           \"When Mary and John went to the store, John gave a drink to\",\n    \"correct_token\":    \" Mary\",\n    \"incorrect_token\":  \" John\",\n    \"provider_name\":    \"Research Lab\",\n    \"provider_address\": \"1 University Ave\",\n    \"system_purpose\":   \"NLP research\",\n    \"generate_pdf\":     true\n  }'\n```\n\n**Retrieve a report:**\n\n```bash\ncurl https://YOUR_API_URL/v1/audit/report/{report_id}\ncurl https://YOUR_API_URL/v1/audit/pdf/{report_id}  # download PDF\n```\n\n---\n\n## What's Novel\n\nFeatures not available in any other single open-source toolkit (as of March 2026):\n\n| Feature | Glassbox | TransformerLens | Baukit | Pyvene |\n|---------|:--------:|:---------------:|:------:|:------:|\n| O(3) Attribution Patching | ✅ | ✅ (manual) | ✅ (manual) | ✅ (manual) |\n| Integrated Gradients (path-integral) | ✅ | ❌ | ❌ | ❌ |\n| Edge Attribution Patching (Syed et al. 2024) | ✅ | ❌ | ❌ | ❌ |\n| Logit Lens + Per-head Direct Effects | ✅ | Partial | ❌ | ❌ |\n| Attribution Stability (Kendall τ-b) | ✅ | ❌ | ❌ | ❌ |\n| SAE Feature Attribution (sae-lens) | ✅ | ❌ | ❌ | ❌ |\n| QK / OV Composition Scores | ✅ | ❌ | ❌ | ❌ |\n| Token-level Saliency Maps | ✅ | ❌ | ❌ | ❌ |\n| Attention Pattern Analysis + Head Typing | ✅ | ❌ | ❌ | ❌ |\n| Bootstrap 95% CI on faithfulness | ✅ | ❌ | ❌ | ❌ |\n| Cross-model circuit alignment (FCAS) | ✅ | ❌ | ❌ | ❌ |\n| MLP attribution | ✅ | ❌ | ❌ | ❌ |\n| **EU AI Act Annex IV report (all 9 sections)** | ✅ | ❌ | ❌ | ❌ |\n| **Black-box audit — any API model** | ✅ | ❌ | ❌ | ❌ |\n| **REST API (FastAPI)** | ✅ | ❌ | ❌ | ❌ |\n| **Compliance officer web dashboard** | ✅ | ❌ | ❌ | ❌ |\n| One-call API | ✅ | ❌ | ❌ | ❌ |\n| Interactive dashboard (HF Spaces) | ✅ | ❌ | ❌ | ❌ |\n\n---\n\n## How It Works\n\n```\nClean prompt     →  model  →  logit(Mary)\nCorrupted prompt →  model  →  logit(John)\n\nAttribution Patching (Nanda et al. 2023):\n  attr(l, h) = ∇_{z_lh} LD · (z_clean_lh − z_corr_lh)\n\nEdge Attribution Patching (Syed et al. 2024):\n  EAP(u→v) = (∂LD/∂resid_pre_v) · Δh_u\n\nLogit Lens (nostalgebraist 2020):\n  LD_l = (W_U · LN(resid_post_l))_target − (W_U · LN(resid_post_l))_distractor\n\nSAE Feature Attribution (Bloom et al. 2024):\n  f_acts = ReLU(W_enc @ (resid − b_dec) + b_enc)\n  score(f) = f_acts[f] × (W_dec[f] @ unembed_dir)\n\nQK Composition (Elhage et al. 2021):\n  C_Q = ‖W_Q^{recv} · W_OV^{sender}‖_F / (‖W_Q^{recv}‖_F · ‖W_OV^{sender}‖_F)\n```\n\n**Faithfulness metrics** follow the ERASER framework (DeYoung et al. 2020):\n\n- **Sufficiency** — does the circuit alone recover the clean prediction?\n- **Comprehensiveness** — how much does ablating the circuit hurt?\n- **F1** — harmonic mean\n\n---\n\n## Benchmarks\n\n\u003e **Reproducible results.** All timings are wall-clock from `gb.analyze()` call to returned result dict. Model weights pre-loaded; load time excluded. Every approximation is disclosed via `suff_is_approx` flag. Full methodology and raw data in [`BENCHMARKS.md`](BENCHMARKS.md). Reproduce with `scripts/benchmark_v340.py`.\n\n### Core engine speed — GPT-2 vs ACDC\n\n| Model | Method | Passes | Time (M1 Pro) | Time (CPU 8-core) | Speedup vs ACDC |\n|-------|--------|--------|--------------|-------------------|----------------|\n| GPT-2 Small | `analyze()` Taylor approx | 3 | **1.8 s** | **4.2 s** | **~37×** |\n| GPT-2 Small | `bootstrap_metrics()` exact | 3+2·\\|C\\| | 8.4 s | 22.1 s | ~8× |\n| GPT-2 Medium | `analyze()` Taylor approx | 3 | **4.9 s** | **11.8 s** | **~24×** |\n| GPT-2 Large | `analyze()` Taylor approx | 3 | **14.3 s** | **34.1 s** | **~15×** |\n| Pythia-1.4B | `analyze()` Taylor approx | 3 | **8.3 s** | **19.6 s** | — |\n\nACDC baseline: official implementation (Conmy et al. 2023, NeurIPS) on NVIDIA A100.\n\n### IOI Faithfulness — GPT-2 family\n\n| Model | Suff. (approx) | Suff. (exact) | Comp. | F1 | Grade | Circuit (heads) |\n|-------|----------------|---------------|-------|----|-------|----------------|\n| GPT-2 Small | 80.0% | **~100%** | 37.2% | 48.8% | C | 26 |\n| GPT-2 Medium | 35.1% | ~61% | 23.7% | 27.9% | D | 31 |\n| GPT-2 Large | 18.2% | ~34% | 14.2% | 15.9% | D | 38 |\n\n### EU AI Act use case — Credit Scoring (Annex III representative task)\n\n`\"The loan applicant has a credit score of 620. The bank decision is\"` — correct: ` approved`\n\n| Model | Sufficiency | F1_faith | Grade | n_heads | Time (M1 Pro) |\n|-------|-------------|----------|-------|---------|--------------|\n| GPT-2 Small | 73% | 0.61 | **B** | 14 | 1.8 s |\n| GPT-2 Medium | 78% | 0.65 | **B** | 18 | 4.9 s |\n| GPT-Neo-125M | 69% | 0.57 | C | 11 | 2.3 s |\n| Pythia-160M | 71% | 0.59 | C | 13 | 2.1 s |\n\n### Multi-Agent Audit, Steering, and Vault\n\n| Component | Input | Time |\n|-----------|-------|------|\n| `MultiAgentAudit.audit_chain()` | 4-agent chain, 100 tokens/agent | **0.07 s** |\n| `SteeringVectorExporter.extract_mean_diff()` | 3 contrast pairs, 1 layer | **0.9 s** |\n| `SteeringVectorExporter.apply()` | 1 hook, greedy decode | **0.3 s** |\n| `build_annex_iv_vault()` | gb_result + all inputs | **\u003c 0.1 s** |\n\n### Cross-model Circuit Alignment (FCAS)\n\n| Model pair | FCAS | z-score |\n|-----------|------|---------|\n| GPT-2 Small ↔ GPT-2 Medium | 0.835 | 4.21 |\n| GPT-2 Small ↔ GPT-2 Large | 0.783 | 3.67 |\n| GPT-2 Medium ↔ GPT-2 Large | 0.833 | 4.18 |\n\n### Reproduce\n\n```bash\npython scripts/benchmark_v340.py --model gpt2 --task credit --seed 42\npython scripts/benchmark_v340.py --suite standard --output results/bench_v340.json\n```\n\nSee [`BENCHMARKS.md`](BENCHMARKS.md) for full methodology, hardware specs, and planned Llama-2-7B / Mistral-7B benchmarks (v4.1.0).\n\n---\n\n## Usage Examples\n\n### Core Circuit Analysis\n\n```python\n# Attribution patching — Taylor (fast) or Integrated Gradients (accurate)\ntokens_c    = model.to_tokens(\"When Mary and John went to the store, John gave a drink to\")\ntokens_corr = model.to_tokens(\"When John and Mary went to the store, Mary gave a drink to\")\nt_tok, d_tok = model.to_single_token(\" Mary\"), model.to_single_token(\" John\")\n\nattrs, clean_ld = gb.attribution_patching(tokens_c, tokens_corr, t_tok, d_tok)\n# Returns {(layer, head): score} dict + clean logit diff\n\nattrs_ig, _ = gb.attribution_patching(\n    tokens_c, tokens_corr, t_tok, d_tok,\n    method=\"integrated_gradients\", n_steps=20,\n)\n# Exact path-integral attribution (Sundararajan et al. 2017)\n\nmlp_attrs = gb.mlp_attribution(tokens_c, tokens_corr, t_tok, d_tok)\n# Returns {layer: score} dict\n\ncircuit, attrs, clean_ld = gb.minimum_faithful_circuit(tokens_c, tokens_corr, t_tok, d_tok)\n```\n\n### Logit Lens + Direct Effects\n\n```python\nll = gb.logit_lens(tokens_c, \" Mary\", \" John\")\n\nprint(ll[\"logit_diffs\"])    # [0.12, 0.18, 0.34, ..., 3.21]\nprint(ll[\"logit_shifts\"])   # [0.06, 0.16, ...]\nprint(ll[\"head_direct_effects\"][9])  # n_heads direct effects at layer 9\n\nresult = gb.analyze(\n    \"When Mary and John went to the store, John gave a drink to\",\n    \" Mary\", \" John\", include_logit_lens=True,\n)\nprint(result[\"logit_lens\"][\"logit_diffs\"])\n```\n\n### Edge Attribution Patching (EAP)\n\n```python\n# Scores every directed edge (sender → receiver) — more informative than node AP (Syed et al. 2024)\neap = gb.edge_attribution_patching(tokens_c, tokens_corr, t_tok, d_tok, top_k=50)\n\nfor edge in eap[\"top_edges\"][:5]:\n    print(f\"{edge['sender']:15s} → {edge['receiver']:15s}  score={edge['score']:.4f}\")\n# attn_L09H09      → resid_pre_L10    score=0.3421\n```\n\n### Attribution Stability\n\n```python\nstability = gb.attribution_stability(tokens_c, t_tok, d_tok, n_corruptions=25, seed=42)\nprint(stability[\"rank_consistency\"])      # Kendall τ-b ∈ [-1, 1]\nprint(stability[\"top_stable_heads\"][:3])\n```\n\n### Token Attribution (Saliency Maps)\n\n```python\ntok_attr = gb.token_attribution(tokens_c, t_tok, d_tok)\nfor t in tok_attr[\"top_tokens\"]:\n    sign = \"+\" if t[\"attribution\"] \u003e 0 else \"-\"\n    print(f\"  [{sign}] {t['token_str']!r:15s}  |attr|={abs(t['attribution']):.4f}\")\n# [+] ' Mary'           |attr|=0.4231\n# [+] ' John'           |attr|=0.3187\n```\n\n### Attention Patterns + Head Typing\n\n```python\nattn = gb.attention_patterns(tokens_c, heads=[(9, 9), (10, 0), (5, 5)])\nprint(attn[\"entropy\"])      # {'L09H09': 0.71, 'L10H00': 1.24, ...}\nprint(attn[\"head_types\"])   # {'L09H09': 'focused', 'L10H00': 'previous_token', ...}\nattn_auto = gb.attention_patterns(tokens_c, heads=None, top_k=10)\n```\n\n### SAE Feature Attribution\n\n\u003e Requires: `pip install sae-lens`\n\n```python\nfrom glassbox import SAEFeatureAttributor\n\nsfa    = SAEFeatureAttributor(model)\ntokens = model.to_tokens(\"When Mary and John went to the store, John gave a drink to\")\nfeats  = sfa.attribute(tokens, \" Mary\", \" John\", layers=[9, 10, 11])\n\nfor f in feats[\"top_features\"][:5]:\n    print(f\"  Layer {f['layer']}  Feature {f['feature_id']:5d}  LD={f['ld_contribution']:+.4f}\")\n    if f[\"neuronpedia_url\"]:\n        print(f\"    → {f['neuronpedia_url']}\")\n# Layer 9   Feature  4821  LD=+0.3124\n#   → https://www.neuronpedia.org/gpt2-small/9-res-jb/4821\n```\n\n### Head Composition Scores (Elhage et al. 2021)\n\n```python\nfrom glassbox import HeadCompositionAnalyzer\n\ncomp    = HeadCompositionAnalyzer(model)\nq_score = comp.q_composition_score(5, 5, 9, 9)\nprint(f\"Q-comp (5,5)→(9,9): {q_score:.4f}\")\n\ncircuit  = [(5, 5), (7, 3), (9, 9), (9, 6)]\nall_comp = comp.all_composition_scores(circuit, min_score=0.05)\nfor edge in all_comp[\"combined_edges\"][:5]:\n    print(f\"  {edge['sender']} → {edge['receiver']}  Q={edge['q']:.3f}  K={edge['k']:.3f}  V={edge['v']:.3f}\")\n```\n\n### Bootstrap Faithfulness CIs\n\n```python\nboot = gb.bootstrap_metrics(\n    prompts=[\n        (\"When Mary and John went to the store, John gave a drink to\", \" Mary\", \" John\"),\n        (\"When Alice and Bob entered the room, Bob handed the key to\", \" Alice\", \" Bob\"),\n        # recommended n \u003e= 20 for stable CIs\n    ],\n    n_boot=500,\n)\nprint(boot[\"sufficiency\"])\n# {\"mean\": 0.82, \"std\": 0.06, \"ci_lo\": 0.71, \"ci_hi\": 0.91, \"n\": 2}\n```\n\n### Cross-model Circuit Alignment (FCAS)\n\n```python\nmodel_sm = HookedTransformer.from_pretrained(\"gpt2\")\nmodel_md = HookedTransformer.from_pretrained(\"gpt2-medium\")\ngb_sm, gb_md = GlassboxV2(model_sm), GlassboxV2(model_md)\n\nr_sm = gb_sm.analyze(\"When Mary and John went to the store, John gave a drink to\", \" Mary\", \" John\")\nr_md = gb_md.analyze(\"When Mary and John went to the store, John gave a drink to\", \" Mary\", \" John\")\n\nfcas = gb_sm.functional_circuit_alignment(r_sm[\"top_heads\"], r_md[\"top_heads\"], top_k=5)\nprint(f\"FCAS: {fcas['fcas']:.3f}  (z={fcas['z_score']:.2f})\")\n# FCAS GPT-2-small ↔ GPT-2-medium: 0.835  (z=4.21)\n```\n\n---\n\n## CLI\n\n```bash\npip install glassbox-mech-interp\n\nglassbox analyze \\\n  --prompt \"When Mary and John went to the store, John gave a gift to\" \\\n  --correct \" Mary\" \\\n  --incorrect \" John\" \\\n  --model gpt2\n\n# Output:\n#   Sufficiency      : 80.0%\n#   Comprehensiveness: 37.2%\n#   F1-score         : 48.8%\n#   Category         : backup_mechanisms\n#   Head         Attribution\n#   ------------ ------------\n#   L09H09           0.1742\n#   L09H06           0.1231\n```\n\n---\n\n## Installation\n\n### Core Install\n\n```bash\n# Minimal — circuit analysis only\npip install glassbox-mech-interp\n```\n\n### Optional Dependency Groups\n\n```bash\n# Jupyter widgets (CircuitWidget, HeatmapWidget)\npip install \"glassbox-mech-interp[jupyter]\"\n\n# EU AI Act compliance reports (AnnexIVReport, BlackBoxAuditor)\npip install \"glassbox-mech-interp[compliance]\"\n\n# SAE feature attribution (requires sae-lens)\npip install \"glassbox-mech-interp[sae]\"\n\n# REST API stack (FastAPI, ClickHouse, Docker)\npip install \"glassbox-mech-interp[api]\"\n\n# Full development install\ngit clone https://github.com/designer-coderajay/glassbox-mech\ncd glassbox-mech\npip install -e \".[dev]\"\n```\n\n### TypeScript / JavaScript SDK\n\n```bash\nnpm install glassbox-sdk    # Node.js, Deno, Bun\n# or \u003cscript src=\"https://cdn.jsdelivr.net/npm/glassbox-sdk/dist/glassbox.js\"\u003e\u003c/script\u003e  (browser)\n```\n\n**Requirements:** Python ≥ 3.8, PyTorch ≥ 2.0, TransformerLens ≥ 1.0\n\n---\n\n## Dashboard\n\nTwo dashboard options:\n\n**Option 1 — Live Demo (no install):** Visit the [HuggingFace Space](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool). Interactive circuit analysis on open-source models, no install needed.\n\n**Option 2 — Research UI (Gradio, local):**\n\n```bash\npip install glassbox-mech-interp gradio matplotlib\ngit clone https://github.com/designer-coderajay/glassbox-mech\ncd glassbox-mech\npython dashboard/app.py\n# Opens Gradio at http://localhost:7860\n# Tabs: Circuit Analysis · Logit Lens · Attention Patterns\n```\n\n**Option 3 — HuggingFace Space:** [huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool](https://huggingface.co/spaces/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool) — white-box circuit analysis, no install needed.\n\n---\n\n## Self-Hosting (Docker / Air-Gapped VPC)\n\nRun the full Glassbox stack on your own infrastructure. **No data leaves your environment.** Designed for regulated industries (banking, healthcare, insurance) where outbound API calls are prohibited.\n\n### Quick start — single container\n\n```bash\ngit clone https://github.com/designer-coderajay/glassbox-mech\ncd glassbox-mech\n\n# API only\ndocker build --target api -t glassbox-api:4.1.0 .\ndocker run -p 8000:8000 glassbox-api:4.1.0\n# REST API:    http://localhost:8000\n# Swagger UI:  http://localhost:8000/docs\n# Health:      http://localhost:8000/health\n\n# Dashboard only\ndocker build --target dashboard -t glassbox-dashboard:4.1.0 .\ndocker run -p 7860:7860 glassbox-dashboard:4.1.0\n```\n\n### Production stack — API + Dashboard + Redis cache\n\n```bash\n# Copy and configure environment\ncp .env.example .env\n# Edit .env: set GLASSBOX_SECRET_KEY, optional SLACK_WEBHOOK_URL, etc.\n\n# Start API + Dashboard (no TLS)\ndocker compose up api dashboard\n\n# Full production stack with TLS and Redis\ndocker compose --profile production up\n```\n\n### Air-gapped / offline deployment\n\n```bash\n# On a machine with internet access — export the image\ndocker build --target api -t glassbox-api:4.1.0 .\ndocker save glassbox-api:4.1.0 | gzip \u003e glassbox-api-4.1.0.tar.gz\n\n# Transfer to air-gapped machine (USB, internal file share, etc.)\n# On the air-gapped machine:\ndocker load \u003c glassbox-api-4.1.0.tar.gz\n\n# Set offline mode — disables all HuggingFace Hub network calls\ndocker run -p 8000:8000 \\\n  -e HF_HUB_OFFLINE=1 \\\n  -v /path/to/model/cache:/app/.cache/huggingface \\\n  glassbox-api:4.1.0\n```\n\n### Environment variables\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `GLASSBOX_SECRET_KEY` | `change-me` | HMAC key for webhook signing |\n| `GLASSBOX_LOG_LEVEL` | `info` | `debug` / `info` / `warning` |\n| `GLASSBOX_MAX_WORKERS` | `2` | Uvicorn worker processes |\n| `HF_HUB_OFFLINE` | `0` | Set `1` for air-gapped deployment |\n| `MLFLOW_TRACKING_URI` | — | MLflow server for experiment logging |\n| `SLACK_WEBHOOK_URL` | — | Compliance alert webhook |\n| `TEAMS_WEBHOOK_URL` | — | Teams compliance alert webhook |\n| `MODEL_CACHE_PATH` | `./data/model_cache` | Host path for model weight volume |\n\nOne-click deploy to Railway (always-on, no sleep):\n\n[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/new/template?template=https://github.com/designer-coderajay/glassbox-mech)\n\n---\n\n## Supported Models\n\nGlassbox works with any model loaded via TransformerLens. Tested on:\n\n| Model family | Examples |\n|-------------|---------|\n| GPT-2 | `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl` |\n| GPT-Neo (EleutherAI) | `EleutherAI/gpt-neo-125m`, `EleutherAI/gpt-neo-1.3B` |\n| Pythia (EleutherAI) | `EleutherAI/pythia-70m`, `EleutherAI/pythia-160m`, `EleutherAI/pythia-410m` |\n| OPT (Meta) | `facebook/opt-125m`, `facebook/opt-1.3b` |\n\nSAE feature attribution currently supports GPT-2 small via Joseph Bloom's pretrained SAEs. Pythia SAEs are available via `sae-lens` — pass `sae_release` explicitly.\n\nBlack-box audit works on **any model with an OpenAI-compatible API**, including GPT-4, Claude, Gemini, Llama (via Together/Groq), and custom endpoints.\n\n---\n\n## API Reference\n\n### `GlassboxV2(model)`\n\n| Method | Complexity | Description |\n|--------|-----------|-------------|\n| `analyze(prompt, correct, incorrect, method, include_logit_lens)` | O(3+2p) | Full circuit analysis. Returns circuit, attributions, faithfulness. |\n| `attribution_patching(tokens_c, tokens_corr, t_tok, d_tok, method, n_steps)` | O(3) or O(2+n) | Per-head attribution. Taylor (fast) or IG (accurate). |\n| `mlp_attribution(tokens_c, tokens_corr, t_tok, d_tok)` | O(3) | Per-layer MLP contribution scores. |\n| `minimum_faithful_circuit(...)` | O(3+2p) | Greedy circuit pruning. p = pruning steps. |\n| `logit_lens(tokens, target, distractor)` | O(1) | Layer-by-layer LD + per-head direct effects. |\n| `edge_attribution_patching(...)` | O(3) | Edge-level EAP scores (Syed et al. 2024). |\n| `attribution_stability(tokens, target, distractor, n_corruptions)` | O(3K) | Per-head stability + Kendall τ-b rank consistency. Novel. |\n| `token_attribution(tokens, target, distractor)` | O(2) | Input-token saliency via gradient × embedding. |\n| `attention_patterns(tokens, heads, top_k)` | O(1) | Attention matrices + entropy + head type classification. |\n| `bootstrap_metrics(prompts, n_boot)` | O(3N) | 95% CI on faithfulness across N prompts. |\n| `functional_circuit_alignment(heads_a, heads_b, top_k)` | O(1) | FCAS between two circuits. Novel. |\n\n### `SAEFeatureAttributor(model)` — requires `sae-lens`\n\n| Method | Description |\n|--------|-------------|\n| `attribute(tokens, target, distractor, layers)` | SAE feature attribution at specified layers. |\n| `attribute_circuit_heads(circuit, tokens, target, distractor)` | Circuit-scoped SAE feature attribution. |\n\n### `HeadCompositionAnalyzer(model)`\n\n| Method | Description |\n|--------|-------------|\n| `q_composition_score(sl, sh, rl, rh)` | Q-composition between head (sl,sh) → (rl,rh). |\n| `k_composition_score(sl, sh, rl, rh)` | K-composition. |\n| `v_composition_score(sl, sh, rl, rh)` | V-composition. |\n| `all_composition_scores(circuit, min_score)` | Q + K + V scores in one call. |\n\n### `AnnexIVReport` — requires `[compliance]`\n\n| Method | Description |\n|--------|-------------|\n| `add_analysis(result, use_case)` | Add a GlassboxV2 or BlackBoxAuditor result. |\n| `to_json(path)` | Export as structured JSON (all 9 sections). |\n| `to_pdf(path)` | Export as signed PDF with EU AI Act article references. |\n\n### `BlackBoxAuditor` — requires `[compliance]`\n\n| Method | Description |\n|--------|-------------|\n| `audit(decision_prompt, expected_positive, expected_negative, ...)` | Full behavioural audit. Returns BlackBoxResult. |\n| `from_env(provider, model)` | Construct auditor from `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` env vars. |\n\n### `AuditLog` — append-only audit trail (v2.9.0+)\n\n| Method | Description |\n|--------|-------------|\n| `append(model_name, analysis_mode, prompt, ...)` | Append a single audit record with SHA-256 hash chain. |\n| `append_from_result(result, auditor, notes)` | Append from a GlassboxV2 or BlackBoxAuditor result. |\n| `verify_chain()` | Returns True if hash chain is intact (no tampering). |\n| `summary()` | Analytics dict: total_audits, grade_distribution, compliance_rate, avg_f1, chain_valid. |\n| `export_json(path)` | Export all records as JSON array with metadata. |\n| `export_csv(path)` | Export all records as CSV for GRC/Excel import. |\n| `by_model(name)`, `by_grade(grade)`, `non_compliant()` | Query methods. |\n\n### `MultiCorruptionPipeline(model)` — v3.7.0+\n\n| Method | Description |\n|--------|-------------|\n| `run(prompt, io_name, subject_name, circuit, target_tok, distract_tok, strategies)` | Run all 4 corruptions, return `RobustnessReport` with `robust` flag. |\n\n`CorruptionStrategy` enum: `NAME_SWAP`, `RANDOM_TOKEN`, `GAUSSIAN_NOISE`, `MEAN_ABLATION`\n\n### `SampleSizeGate()` + `HeldOutValidator()` — v3.7.0+\n\n| Class | Method | Description |\n|-------|--------|-------------|\n| `SampleSizeGate` | `check(n)` | Block n\u003c20, warn n\u003c50. Raises `SampleSizeError`. |\n| `SampleSizeGate` | `recommend_n(rho_min, alpha, power)` | Power-analysis minimum n via Fisher Z. |\n| `HeldOutValidator` | `validate(results)` | 50/50 split on `batch_analyze()` output. Returns `HeldOutValidationResult`. |\n\n### `FoldedLayerNorm(model)` — v4.0.0+\n\n| Method | Description |\n|--------|-------------|\n| `analyze(raw_attributions, clean_tokens, corr_tokens, target_tok, distract_tok)` | Returns `LayerNormBiasReport` with per-head `bias_ratio`, `biased_heads` set. |\n| `apply_correction(raw_attributions, folded_attrs)` | Returns corrected attribution dict. |\n\n### `BenjaminiHochberg(alpha)` — v4.0.0+\n\n| Method | Description |\n|--------|-------------|\n| `run(attributions, se_map)` | BH FDR with z-test p-values. Returns `FDRReport`. |\n| `run_bootstrap(attributions_per_sample, observed_attributions)` | Bootstrap-SE variant. |\n| `run_permutation(attributions_per_permutation, observed_attributions)` | Permutation-based p-values. |\n| `apply_fdr_correction(attributions, se_map, alpha)` | Convenience wrapper. |\n\n### `PolysemanticityScorerSAE(model)` — v4.0.0+\n\n| Method | Description |\n|--------|-------------|\n| `score_circuit(circuit, prompts_tokens)` | Returns `PolysemanticitySummary` with entropy per head. SAE or PCA fallback. |\n\n### `HessianErrorBounds(model)` — v4.1.0+\n\n| Method | Description |\n|--------|-------------|\n| `compute(attributions, clean_tokens, corr_tokens, target_tok, distract_tok)` | Returns `HessianBoundsReport`. Flags `hessian_dominated` heads where `\\|ε(h)/α(h)\\| \u003e 0.20`. |\n\n### `CausalScrubbing(model, n_samples)` + `CircuitHypothesis` — v4.1.0+\n\n| Class | Method | Description |\n|-------|--------|-------------|\n| `CircuitHypothesis` | `from_wang2022_ioi()` | Pre-built IOI circuit (13 heads with role labels). |\n| `CircuitHypothesis` | `from_list(name, heads, description, roles)` | Custom hypothesis. |\n| `CausalScrubbing` | `evaluate(hypothesis, prompt, corr_prompt, target_tok, distract_tok)` | CS(H) score + interpretation. |\n| `CausalScrubbing` | `evaluate_batch(hypothesis, prompts)` | Multi-prompt evaluation. |\n| `CausalScrubbing` | `mean_cs_score(results)` | Aggregate statistics. |\n\n### `DistributedAlignmentSearch(model, concept_dims)` — v4.1.0+\n\n| Method | Description |\n|--------|-------------|\n| `search(concept_label, clean_tokens, cf_tokens, target_tok, distract_tok, target_layer, target_position)` | PCA subspace + DAS score. Returns `DASResult`. |\n| `search_all_layers(concept_label, ...)` | Layer sweep, sorted by `das_score` descending. |\n\n---\n\n### `GlassboxClient` (TypeScript/JavaScript SDK) — v2.9.0+\n\n```typescript\ntype DeploymentContext = 'financial_services' | 'healthcare' | 'hr_employment' | 'legal' | 'critical_infrastructure' | 'education' | 'other_high_risk'\ntype ExplainabilityGrade = 'A' | 'B' | 'C' | 'D'\ntype ComplianceStatus = 'conditionally_compliant' | 'incomplete' | 'non_compliant'\n\nclass GlassboxClient {\n  // Audits\n  auditWhiteBox(req: WhiteBoxRequest): Promise\u003cAuditReport\u003e\n  auditBlackBox(req: BlackBoxRequest): Promise\u003cAuditReport\u003e\n  startBlackBoxJob(req: BlackBoxRequest): Promise\u003cAsyncJobResponse\u003e\n  waitForJob(jobId: string, intervalMs?, maxWaitMs?): Promise\u003cAsyncJobResponse\u003e\n  pollJob(jobId: string): Promise\u003cAsyncJobResponse\u003e\n\n  // Reports \u0026 data\n  getReport(reportId: string): Promise\u003cAuditReport\u003e\n  listReports(): Promise\u003c{ reports: unknown[], total: number }\u003e\n  pdfUrl(reportId: string): string\n\n  // Patterns\n  attentionPatterns(modelName: string, prompt: string, heads?: string[], topK?: number): Promise\u003cAttentionPatternsResponse\u003e\n\n  // Health\n  health(): Promise\u003c{ status: string, glassbox_version: string, timestamp: string }\u003e\n}\n```\n\n---\n\n## Methodology \u0026 IP Documentation\n\nThe core innovation in Glassbox is not the mechanistic interpretability math — that's academic. The core innovation is the **legal-technical translation layer**: the specific, proprietary mapping from mathematical circuit analysis results to EU AI Act provisions that makes Annex IV reports both mathematically rigorous and legally structured.\n\nFull documentation is in [`METHODOLOGY.md`](METHODOLOGY.md). Key claims:\n\n**Taylor-approximated circuit discovery in O(3) passes.** The standard approach (ACDC) requires O(E) passes where E is the number of edges in the computation graph. Glassbox uses a first-order Taylor approximation to reduce this to exactly 3 passes, enabling circuit discovery on consumer hardware without loss of Annex IV documentation value.\n\n**Faithfulness F1 as a compliance gate.** F1_faith = harmonic mean(sufficiency, comprehensiveness). Neither metric alone is sufficient — high sufficiency with low comprehensiveness signals backup mechanisms (unpredictable behaviour under distribution shift); the combination catches both. Threshold of 0.65 derived from Article 15(1).\n\n**Multi-agent contamination scoring.** `contamination(A→B) = |bias_tokens(B) ∩ bias_tokens(A)| / |bias_tokens(B)|`. This formalises a chain-of-causation argument for Article 9 system-level liability that no other tool implements.\n\n**Steering vector as Article 9(2)(b) evidence.** Representation Engineering vectors (Zou et al. 2023) are formalised as documented risk mitigation measures with provenance metadata and quantified suppression tests — converting an ad-hoc patch into an auditable compliance artifact.\n\n**Evidence Vault architecture.** Every interpretability finding maps to an Annex IV section (§1–§7) and specific Articles. This structure is the proprietary IP — not the underlying math.\n\nAll threshold values, grade mappings, section assignments, and article citations are original contributions of Ajay Pravin Mahale and are documented with timestamps in `METHODOLOGY.md`.\n\n---\n\n## Mathematical Disclosures\n\nGlassbox is explicit about approximations. Nothing is hidden.\n\n**Sufficiency (in `analyze()`)** is a first-order Taylor approximation:\n\n```\nSuff ≈ Σ_{h ∈ circuit} attr(h) / LD_clean\n```\n\nThis is accurate when individual head contributions are small relative to LD_clean and head interactions are approximately linear. For exact causal sufficiency, use `bootstrap_metrics()` or the MFC ablation method.\n\n**Per-head direct effects** (in `logit_lens()`) apply the unembed direction without the final LayerNorm scale, which is nonlinear and cannot be decomposed per-head. Relative rankings are preserved; absolute values are directional.\n\n**SAE feature attribution** in `attribute_circuit_heads()` applies the SAE to isolated head outputs rather than the full residual stream. See docstring for exact assumptions.\n\nAll other metrics (Comprehensiveness, EAP scores, Composition scores, Bootstrap CIs) are exact or asymptotically exact.\n\n---\n\n## Mathematical Foundations Reference\n\nEvery formula used in Glassbox — attribution patching, faithfulness metrics, Fisher Z\ntransformations, Bonferroni correction, power analysis, and EU AI Act regulatory mapping\n— is formally derived and cited in **[`MATH_FOUNDATIONS.md`](MATH_FOUNDATIONS.md)**.\n\nThis 16-section document is the single source of truth for all mathematical operations\nin the library. Key equations include:\n\n**Attribution patching** (first-order Taylor approximation, 3 forward passes):\n```\nα(h) ≈ (∂LD / ∂z_h)|_{z_h = z_h^clean}  ·  (z_h^clean − z_h^corrupt)\n```\n\n**Faithfulness F1** (harmonic mean of sufficiency and comprehensiveness):\n```\nF1_faith = 2 · S(C) · Comp(C) / (S(C) + Comp(C))\n```\n\n**Confidence–faithfulness correlation test** (Fisher Z transform):\n```\nz = atanh(r),   SE = 1/√(n−3),   Z = z/SE  ~  N(0,1)  under H₀: ρ = 0\n```\n\nReference values from Mahale (2026) / arXiv:2603.09988:\n`r = 0.009`, `S = 1.00`, `Comp = 0.22`, `F1 = 0.64` (full 26-head Wang et al. IOI circuit).\nCircuit coverage: **61.4%** of logit difference explained by 6 identified heads.\nFull results documented in [`BENCHMARKS.md § 8`](./BENCHMARKS.md#8-peer-reviewed-results--arxiv26039988).\n\n---\n\n## Cross-Model Faithfulness Study\n\nGlassbox includes a multi-LLM experiment harness testing whether confidence–faithfulness\nindependence generalises beyond GPT-2 to four architecturally distinct model families.\n\n**Models:** GPT-2-small (117M), GPT-2-XL (1.5B), Pythia-1.4B, Llama-2-7B\n\n**Task:** Indirect Object Identification (IOI) — 100 prompts per model, 20 name pairs × 5 sentence frames.\n\n**Statistical tests:**\n- Per-model Fisher Z test of H₀: ρ = 0 (two-sided, α = 0.05)\n- Cross-model Welch's t-test on F1 with Bonferroni correction (α_adj = 0.05/6 ≈ 0.0083)\n- Pairwise Jaccard circuit similarity (normalised head positions, ε = 0.05)\n- BCa bootstrap CIs (B = 2,000 resamples) on all faithfulness metrics\n\n**Run the experiment:**\n```bash\n# Dry-run (no model loading, synthetic data, validates pipeline):\npython experiments/cross_model_study.py --dry-run\n\n# Full run (requires GPU with ≥16 GB VRAM for Llama-2-7B):\npython experiments/cross_model_study.py --n-prompts 100 --device cuda --output-dir results/\n\n# Single model:\npython experiments/cross_model_study.py --models gpt2-small --n-prompts 100 --dry-run\n```\n\n**Dry-run results** (synthetic data, reproducible via fixed seeds):\n\n| Model | r | p-value | F1 | H₀ |\n|-------|---|---------|----|----|\n| GPT-2-small | 0.069 | 0.496 | 0.624 | not rejected |\n| GPT-2-XL | −0.032 | 0.751 | 0.651 | not rejected |\n| Pythia-1.4B | −0.054 | 0.596 | 0.593 | not rejected |\n| Llama-2-7B | 0.096 | 0.342 | 0.718 | not rejected |\n\nPaper outline: [`experiments/PAPER_OUTLINE.md`](experiments/PAPER_OUTLINE.md)\nFull mathematical details: [`MATH_FOUNDATIONS.md`](MATH_FOUNDATIONS.md)\n\n---\n\n## Paper\n\n**[Glassbox: A Causal Mechanistic Interpretability Toolkit with Circuit Alignment Scoring](https://arxiv.org/abs/2603.09988)**\n\nIntroduces the **Functional Circuit Alignment Score (FCAS)**, automated Minimum Faithful Circuit (MFC) discovery, and bootstrap CIs on circuit faithfulness. Submitted to ICML 2026 Mechanistic Interpretability Workshop (deadline April 24, 2026).\n\n---\n\n## Citation\n\nIf you use Glassbox 2.0 in your research, please cite:\n\n```bibtex\n@software{mahale2026glassbox,\n  author    = {Mahale, Ajay Pravin},\n  title     = {Glassbox: A Causal Mechanistic Interpretability Toolkit with Circuit Alignment Scoring},\n  year      = {2026},\n  publisher = {GitHub},\n  url       = {https://github.com/designer-coderajay/glassbox-mech},\n  note      = {arXiv:2603.09988}\n}\n```\n\n**Core references this work builds on:**\n\n- Wang et al. (2022). [Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.](https://arxiv.org/abs/2211.00593)\n- Nanda et al. (2023). [Attribution Patching: Activation Patching at Industrial Scale.](https://www.neelnanda.io/mechanistic-interpretability/attribution-patching)\n- Syed et al. (2024). [Attribution Patching Outperforms Automated Circuit Discovery.](https://arxiv.org/abs/2310.10348) ACL BlackboxNLP.\n- Elhage et al. (2021). [A Mathematical Framework for Transformer Circuits.](https://transformer-circuits.pub/2021/framework/index.html)\n- nostalgebraist (2020). [Interpreting GPT: the Logit Lens.](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru)\n- Bloom et al. (2024). [Open Source Sparse Autoencoders for GPT-2 Small.](https://www.neuronpedia.org/gpt2-small)\n- Olsson et al. (2022). [In-context Learning and Induction Heads.](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)\n- Sundararajan et al. (2017). [Axiomatic Attribution for Deep Networks.](https://arxiv.org/abs/1703.01365) ICML.\n- Conmy et al. (2023). [Towards Automated Circuit Discovery for Mechanistic Interpretability.](https://arxiv.org/abs/2304.14997) NeurIPS.\n- DeYoung et al. (2020). [ERASER: A Benchmark to Evaluate Rationalized NLP Models.](https://arxiv.org/abs/1911.03429) ACL.\n- Regulation (EU) 2024/1689 of the European Parliament and of the Council (AI Act). [EUR-Lex CELEX:32024R1689](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689).\n\n---\n\n## Related Tools\n\n- [TransformerLens](https://github.com/neelnanda-io/TransformerLens) — mechanistic interpretability library Glassbox is built on\n- [sae-lens](https://github.com/jbloomAus/SAELens) — pretrained Sparse Autoencoders (required for SAE feature attribution)\n- [ACDC](https://github.com/ArthurConmy/Automatic-Circuit-DisCovery) — automated circuit discovery (Conmy et al. 2023). Timing baseline; preliminary benchmarks show Glassbox is 15–37× faster on GPT-2.\n- [Neuronpedia](https://www.neuronpedia.org/) — SAE feature browser (linked from SAE attribution output)\n\n---\n\n## Security \u0026 Privacy\n\nSee [SECURITY.md](SECURITY.md) for full details on API key handling, self-hosting recommendation, and GDPR/German law compliance notes.\n\n**TL;DR:** API keys go in the `X-Provider-Api-Key` header — never in the request body. A logging filter scrubs any accidental key leakage. Keys are never stored. For production compliance audits, run Glassbox locally or on your own infrastructure.\n\n---\n\n## Legal Notices \u0026 Regulatory Disclaimer\n\n\u003e **PLEASE READ THIS SECTION CAREFULLY BEFORE USING GLASSBOX FOR REGULATORY OR COMPLIANCE PURPOSES.**\n\n### 1. Nature of the Software — Documentation Aid Only\n\nGlassbox is a software toolkit that automates the *drafting* of technical documentation structured in accordance with Annex IV of Regulation (EU) 2024/1689 (\"EU AI Act\"). It is provided strictly as a **documentation aid and research instrument**, not as a legal, regulatory, or compliance service.\n\n**Use of Glassbox does not:**\n- constitute legal advice or a legal opinion of any kind;\n- establish an attorney-client, auditor-client, or any other professional relationship;\n- guarantee, certify, or represent that your AI system is compliant with the EU AI Act, GDPR, or any other applicable law or regulation;\n- replace the obligation to obtain a conformity assessment from a notified body where required under EU AI Act Article 43;\n- constitute or substitute for a Declaration of Conformity under EU AI Act Article 47;\n- determine whether your AI system qualifies as \"high-risk\" under EU AI Act Article 6 and Annex III — that is a legal determination requiring qualified counsel.\n\n### 2. Regulatory Guidance — Key References\n\nAll regulation references in this codebase and documentation cite the following instruments. Citations are provided for informational accuracy only:\n\n| Instrument | Reference | Scope |\n|------------|-----------|-------|\n| EU AI Act | Regulation (EU) 2024/1689 | Risk management (Art. 9), Technical documentation (Art. 11, Annex IV), Transparency (Art. 13), Data governance (Art. 10), Accuracy \u0026 robustness (Art. 15), Post-market monitoring (Art. 72), Conformity assessment (Art. 43), Declaration of conformity (Art. 47), Penalties (Art. 99) |\n| GDPR | Regulation (EU) 2016/679 | Personal data processed through or about the AI system |\n| EU AI Act Implementing Acts | To be adopted by European Commission | Technical harmonised standards (Art. 40), common specifications (Art. 41) — **not yet finalised as of March 2026** |\n\n\u003e **Important:** The EU AI Act entered into force 1 August 2024. Most obligations for high-risk AI providers apply from **2 August 2026**. Implementing acts, harmonised standards, and guidance from the European AI Office are still being developed. The regulatory landscape will evolve before enforcement. Regulatory interpretations in Glassbox's output reflect publicly available text as of the tool's release date and may not reflect subsequent guidance. Always consult the [EU AI Act official text](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689) and current European AI Office guidance.\n\n### 3. No Warranty\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, REGULATORY ADEQUACY, OR NON-INFRINGEMENT. THE AUTHORS AND CONTRIBUTORS MAKE NO REPRESENTATION THAT USE OF THIS SOFTWARE WILL SATISFY ANY OBLIGATION UNDER ANY LAW OR REGULATION, INCLUDING THE EU AI ACT OR GDPR.\n\n### 4. Limitation of Liability\n\nTO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL THE AUTHORS, CONTRIBUTORS, OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, REGULATORY SANCTIONS, FINES, PENALTIES, REPUTATIONAL HARM, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR RELIANCE ON THE SOFTWARE'S OUTPUTS FOR REGULATORY COMPLIANCE PURPOSES.\n\nThis limitation applies regardless of whether the authors have been advised of the possibility of such damages, and applies to the fullest extent permitted by Regulation (EU) 2024/1689 and applicable national law.\n\n### 5. Your Obligations as Deployer / Provider\n\nIf you deploy an AI system that is subject to the EU AI Act as a **provider** (Article 2(1)(a)), **deployer** (Article 2(1)(b)), or **importer/distributor** (Article 2(1)(c)-(d)), you are responsible for:\n\n- Independently determining whether your system is high-risk under Article 6 and Annex III;\n- Conducting a conformity assessment as required by Article 43 (self-assessment or notified body, depending on Annex III category);\n- Completing and signing a Declaration of Conformity under Article 47;\n- Registering your system in the EU database under Article 71;\n- Maintaining technical documentation under Article 11 and Annex IV — Glassbox outputs are a *starting point* for this documentation, not a finished regulatory submission;\n- Implementing a post-market monitoring plan under Article 72;\n- Appointing an EU representative if you are a non-EU provider (Article 22).\n\nGlassbox automates the *drafting* of Annex IV section content. All outputs must be reviewed, validated, completed, and signed by responsible persons within your organisation before regulatory use.\n\n### 6. Explainability Grades — Informational Only\n\nThe A–D explainability grades produced by Glassbox are derived from mechanistic interpretability metrics (faithfulness F1 score) defined in the [accompanying research paper](https://arxiv.org/abs/2603.09988). These grades:\n\n- are **not** official EU AI Act classifications, nor do they map to any officially defined grading scale in Regulation (EU) 2024/1689;\n- represent internal research-defined thresholds intended to aid documentation and prioritisation;\n- do **not** determine whether your AI system meets the \"appropriate level of accuracy, robustness and cybersecurity\" required under Article 15;\n- are based on a single test prompt; real-world compliance assessment requires comprehensive evaluation across representative inputs.\n\n### 7. Bias Analysis — Article 10(2)(f) Guidance\n\nThe `BiasAnalyzer` module is designed to support documentation of data governance practices relevant to EU AI Act Article 10(2)(f) (examination for possible biases). Its outputs:\n\n- are intended to surface potential bias signals, not to certify absence of discrimination or bias;\n- do **not** constitute an equality impact assessment, human rights due diligence report, or any assessment required under national anti-discrimination law (e.g., General Equal Treatment Act (AGG) in Germany, Equality Act 2010 in the UK);\n- should be complemented by domain-expert review and, where the AI system makes decisions affecting natural persons, a Data Protection Impact Assessment (DPIA) under GDPR Article 35.\n\n### 8. Jurisdiction and Governing Law\n\nThis project is developed under the laws of the Federal Republic of Germany. The EU AI Act and GDPR are directly applicable EU regulations. Nothing in this notice limits the application of mandatory consumer protection or regulatory law. If a provision of this notice is unenforceable in your jurisdiction, the remaining provisions continue in full force.\n\n### 9. Contact for Legal Inquiries\n\nFor questions regarding the legal scope of Glassbox, please contact: [mahale.ajay01@gmail.com](mailto:mahale.ajay01@gmail.com)\n\nFor security vulnerabilities, see [SECURITY.md](SECURITY.md).\n\n---\n\n## Project \u0026 Privacy Notice\n\n**Academic research project.** Glassbox AI is an open-source MSc research project developed by Ajay Pravin Mahale as part of postgraduate studies in Germany. It is not a commercial product, not operated by a registered company, and not offered as a professional compliance or legal service. There is no registered business entity behind this project.\n\n**Privacy / GDPR (Regulation (EU) 2016/679).**\n\n- **No personal data is intentionally collected, stored, or processed.** Prompt text submitted via the HuggingFace Space is processed in-memory to return a result and is not logged, retained, or shared.\n- Standard server access logs (IP address, timestamp, request path) may be recorded automatically by HuggingFace. These are not controlled by the project author. See [HuggingFace's privacy policy](https://huggingface.co/privacy) for details.\n- If you submit prompts containing personal data (e.g., names, financial details), you do so at your own risk. Do not send real personal data to the hosted demo. For sensitive work, [self-host](#self-hosting).\n- **Contact for data inquiries:** [mahale.ajay01@gmail.com](mailto:mahale.ajay01@gmail.com)\n- **Responsible person (§5 TMG / Impressum):** Ajay Pravin Mahale, student, German","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdesigner-coderajay%2Fglassbox-mech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdesigner-coderajay%2Fglassbox-mech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdesigner-coderajay%2Fglassbox-mech/lists"}