https://github.com/tegridydev/mechamap
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
https://github.com/tegridydev/mechamap
explanability llm-research mechanistic-interpretability open-source python research-tool transformers
Last synced: 17 days ago
JSON representation
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
- Host: GitHub
- URL: https://github.com/tegridydev/mechamap
- Owner: tegridydev
- License: mit
- Created: 2024-12-27T03:10:33.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-04-25T01:00:13.000Z (29 days ago)
- Last Synced: 2025-05-07T04:41:37.841Z (17 days ago)
- Topics: explanability, llm-research, mechanistic-interpretability, open-source, python, research-tool, transformers
- Language: Python
- Homepage:
- Size: 27.3 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MechaMap - Tool for Mechanistic Interpretability (MI) Research
**MechaMap** is a scanner and analysis framework designed to help researchers working on interpreting **"wtfisgoingon"** within transformer-based language models. It quickly surfaces which neurons (across all layers) might be responding strongly to different semantic categories—like *location*, *food*, *numeric*, *animal*, etc.—based on a single pass of a master text/s.
## What is Mechanistic Interpretability?
**Mechanistic Interpretability (MI)** is the discipline of **opening the black box** of large language models (and other neural networks) to understand the **underlying circuits**, **features** and/or **mechanisms** that give rise to specific behaviors. Instead of treating a model as a monolithic function, we:
1. **Trace** how input tokens propagate through attention heads, MLP layers, or neurons,
2. **Identify** localized “circuit motifs” or subsets of weights that implement certain tasks,
3. **Explain** how the distribution of learned parameters leads to emergent capabilities,
4. **Develop** methods to systematically break down or “edit” these circuits to confirm we understand the causal structure.Mechanistic Interpretability aspires to yield **human-understandable explanations** of how advanced models represent and manipulate concepts like “zero,” “red,” “lion,” or “London.” By doing so, we gain:
- **Trust & Reliability**: More confidence in model outputs if we know the circuits behind them.
- **Safety & Alignment**: Early detection of harmful or unintended sub-circuits.
- **Debugging**: Efficient fixes or interventions if a model shows undesired behaviors.> **Reference & Kudos**: This project owes a great deal to the insights from [Neel Nanda’s Mechanistic Interpretability Glossary](https://www.neelnanda.io/mechanistic-interpretability/glossary).
Neel’s research and writing efforts have significantly helped the understanding of circuits and interpretability in large language models.## Goals of MechaMap
- **Rapid Discovery**: Provide a **one-and-done** pass that highlights potentially interesting neurons—particularly those that strongly respond to high-level semantic categories (like “vehicle,” “food,” “numeric,” etc.).
- **Foundational Baseline**: Act as a **launchpad** for deeper Mechanistic Interpretability experiments. Once MechaMap flags certain “candidate neurons,” researchers can do single-neuron hooking or more advanced circuit analysis.
- **Usability**: Keep the scanning code straightforward, and produce easy-to-parse CSV/JSON files that can be quickly ingested into more advanced interpretability pipelines.
- **Transparency**: Centralize all category tokens, scanning thresholds, and master text within a single config. This fosters reproducibility and allows for quick expansions (adding more categories or new domain tokens).## Key Features
1. **Single-Pass “Master Text”**
A single text that includes sample tokens from each category. MechaMap runs one forward pass per neuron (hooked at MLP outputs), computing **average activation** on tokens for each category.2. **Customizable Categories**
Default categories include *date*, *location*, *animal*, *food*, *numeric*, *language*, *vehicle*, *color*, but you can easily add *sports*, *finance*, or any domain tokens you care about.3. **Partial Scanning for Large Models**
If a model is huge, you can limit to the **first N neurons** per layer. That speeds up scanning while preserving the same analysis code.4. **Top Tokens**
For each neuron, MechaMap also saves a short list of the **top-activating tokens** from the text. This can reveal surprising structural triggers (like punctuation or stopwords).5. **Interpretation Script**
A separate `interpret_map.py` tool helps you parse the CSV, show pivot tables, detect multi-category overlaps, or compute correlations among categories.## Why Use MechaMap for Mechanistic Interpretability?
- **Discovery Layer**
Instead of manually investigating thousands of neurons, MechaMap quickly flags where the biggest domain-sensitive signals might be happening—particularly in later layers, or “multi-domain” neurons that consistently appear across categories.- **Hypothesis Generation**
Once you find a neuron that strongly activates for *food* tokens, you can design follow-up tests (e.g., hooking that neuron in isolation, adding or removing certain tokens) to confirm if it truly encodes “foodness.”- **Comparative Studies**
Scan different models (e.g., `gpt2`, `EleutherAI/pythia-70m`) with the same master text and see whether they converge on similarly specialized neurons or if they use distinct “circuits.”- **Extendable**
MechaMap is **config-based**: just add tokens to a category, or define new categories, and re-run. You can also adapt the master text to reflect your specific research interests (e.g., adding legal terms, chemistry tokens, or coding keywords).