https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs
https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
List: Awesome-Interpretability-in-Large-Language-Models
dictionary-learning interpretability-and-explainability mechanistic-interpretability sparse-autoencoder
Last synced: about 1 month ago
JSON representation
This repository collects all relevant resources about interpretability in LLMs
Host: GitHub
URL: https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
Owner: ruizheliUOA
License: cc0-1.0
Created: 2024-06-30T17:27:14.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-09-19T12:18:29.000Z (8 months ago)
Last Synced: 2024-10-19T20:53:04.052Z (7 months ago)
Topics: dictionary-learning, interpretability-and-explainability, mechanistic-interpretability, sparse-autoencoder
Homepage:
Size: 61.5 KB
Stars: 267
Watchers: 5
Forks: 16
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-trustworthy-deep-learning - Awesome Interpretability in Large Language Models - Interpretability-in-Large-Language-Models) ![](https://img.shields.io/github/last-commit/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models) (Interpretability Lists)
ultimate-awesome - Awesome-Interpretability-in-Large-Language-Models - This repository collects all relevant resources about interpretability in LLMs. (Other Lists / Julia Lists)
README

        # Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: `[email protected]`.

---

 Table of Contents  

- [Awesome Interpretability Libraries](#awesome-interpretability-libraries)

- [Awesome Interpretability Blogs & Videos](#awesome-interpretability-blogs--videos)

- [Awesome Interpretability Tutorials](#awesome-interpretability-tutorials)

- [Awesome Interpretability Forums](#awesome-interpretability-forums)

- [Awesome Interpretability Tools](#awesome-interpretability-tools)

- [Awesome Interpretability Programs](#awesome-interpretability-programs)

- [Awesome Interpretability Papers](#awesome-interpretability-papers)

  - [Survey Papers](#survey-papers)

  - [Position Papers](#position-papers)

  - [Interpretable Analysis of LLMs](#interpretable-analysis-of-llms)

  - [SAE, Dictionary Learning and Superposition](#sae-dictionary-learning-and-superposition)

  - [Interpretability in Vision LLMs](#interpretability-in-vision-llms)

  - [Benchmarking Interpretability](#benchmarking-interpretability)

  - [Enhancing Interpretability](#enhancing-interpretability)

  - [Others](#others)

- [Other Awesome Interpretability Resources](#other-awesome-interpretability-resources)    

---

# Awesome Interpretability Libraries

- ![GitHub Repo stars](https://img.shields.io/github/stars/TransformerLensOrg/TransformerLens) [**TransformerLens**](https://github.com/TransformerLensOrg/TransformerLens): A Library for Mechanistic Interpretability of Generative Language Models. ([Doc](https://transformerlensorg.github.io/TransformerLens/), [Tutorial](https://arena3-chapter1-transformer-interp.streamlit.app/[1.2]_Intro_to_Mech_Interp), [Demo](https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Main_Demo.ipynb))

- ![GitHub Repo stars](https://img.shields.io/github/stars/ndif-team/nnsight) [**nnsight**](https://github.com/ndif-team/nnsight): enables interpreting and manipulating the internals of deep learned models. ([Doc](https://nnsight.net/documentation/), [Tutorial](https://nnsight.net/tutorials/), [Paper](https://arxiv.org/abs/2407.14561))

- ![GitHub Repo stars](https://img.shields.io/github/stars/jbloomAus/SAELens) [**SAE Lens**](https://github.com/jbloomAus/SAELens): train and analyse SAE. ([Doc](https://jbloomaus.github.io/SAELens/), [Tutorial](https://github.com/jbloomAus/SAELens/tree/main/tutorials), [Blog](https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream))

- ![Github Repo stars](https://img.shields.io/github/stars/EleutherAI/sae) [**EleutherAI: sae**](https://github.com/EleutherAI/sae): train SAE on very large model based on the method and released code of the [openAI SAE paper](https://arxiv.org/abs/2406.04093v1)

- ![GitHub Repo stars](https://img.shields.io/github/stars/ArthurConmy/Automatic-Circuit-Discovery) [**Automatic Circuit DisCovery**](https://github.com/ArthurConmy/Automatic-Circuit-Discovery): automatically build circuit for mechanistic interpretability. ([Paper](https://arxiv.org/pdf/2304.14997), [Demo](https://colab.research.google.com/github/ArthurConmy/Automatic-Circuit-Discovery/blob/main/notebooks/colabs/ACDC_Main_Demo.ipynb))

- ![GitHub Repo stars](https://img.shields.io/github/stars/stanfordnlp/pyvene) [**Pyvene**](https://github.com/stanfordnlp/pyvene): A Library for Understanding and Improving PyTorch Models via Interventions. ([Paper](https://arxiv.org/pdf/2403.07809), [Demo](https://colab.research.google.com/github/stanfordnlp/pyvene/blob/main/pyvene_101.ipynb))

- ![GitHub Repo stars](https://img.shields.io/github/stars/stanfordnlp/pyreft) [**pyreft**](https://github.com/stanfordnlp/pyreft): A Powerful, Efficient and Interpretable fine-tuning method. ([Paper](https://arxiv.org/pdf/2404.03592), [Demo](https://colab.research.google.com/github/stanfordnlp/pyreft/blob/main/main_demo.ipynb))

- ![GitHub Repo stars](https://img.shields.io/github/stars/vgel/repeng) [**repeng**](https://github.com/vgel/repeng): A Python library for generating control vectors with representation engineering. ([Paper](https://arxiv.org/pdf/2310.01405), [Blog](https://vgel.me/posts/representation-engineering/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/google-deepmind/penzai) [**Penzai**](https://github.com/google-deepmind/penzai): a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. ([Paper](https://openreview.net/attachment?id=KVSgEXrMDH&name=pdf), [Doc](https://penzai.readthedocs.io/en/stable/), [Tutorial](https://penzai.readthedocs.io/en/stable/notebooks/how_to_think_in_penzai.html))

- ![GitHub Repo stars](https://img.shields.io/github/stars/rachtibat/LRP-eXplains-Transformers) [**LXT: LRP eXplains Transformers**](https://github.com/rachtibat/LRP-eXplains-Transformers): Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). ([Paper](https://arxiv.org/pdf/2402.05602), [Doc](https://lxt.readthedocs.io/en/latest/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/AlignmentResearch/tuned-lens) [**Tuned Lens**](https://github.com/AlignmentResearch/tuned-lens): Tools for understanding how transformer predictions are built layer-by-layer. ([Paper](https://arxiv.org/pdf/2303.08112), [Doc](https://tuned-lens.readthedocs.io/en/latest/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/inseq-team/inseq) [**Inseq**](https://github.com/inseq-team/inseq): Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. ([Paper](http://arxiv.org/abs/2302.13942), [Doc](https://inseq.org/en/latest/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/shap/shap) [**shap**](https://github.com/shap/shap): Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. ([Paper](https://papers.nips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html), [Doc](https://shap.readthedocs.io/en/latest/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/pytorch/captum) [**captum**](https://captum.ai/): Model interpretability and understanding library for PyTorch ([Paper](https://arxiv.org/abs/2009.07896), [Doc](https://captum.ai/docs/introduction))

# Awesome Interpretability Blogs & Videos

- [A Barebones Guide to Mechanistic Interpretability Prerequisites](https://www.neelnanda.io/mechanistic-interpretability/prereqs)

- [Concrete Steps to Get Started in Transformer Mechanistic Interpretability](https://www.neelnanda.io/mechanistic-interpretability/getting-started)

- [An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers](https://www.neelnanda.io/mechanistic-interpretability/favourite-papers)

- [An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2](https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite-1)

- [200 Concrete Open Problems in Mechanistic Interpretability](https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability)

- [3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning](https://youtu.be/wjZofJX0v4M?si=ZzfZh0kYLZMV8I-8)

- [3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning](https://youtu.be/eMlx5fFNoYc?si=6sEeo0CnCOnFWU0g)

- [3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning](https://youtu.be/9-Jl0dxWQs8?si=xuf9XIV7AieZDOYA)

# Awesome Interpretability Tutorials

- ![GitHub Repo stars](https://img.shields.io/github/stars/callummcdougall/ARENA_3.0) [ARENA 3.0](https://github.com/callummcdougall/ARENA_3.0): understand mechanistic interpretability using TransformerLens.

- ![GitHub Repo stars](https://img.shields.io/github/stars/interpretingdl/eacl2024_transformer_interpretability_tutorial) [EACL24: Transformer-specific Interpretability](https://projects.illc.uva.nl/indeep/tutorial/) ([Github](https://github.com/interpretingdl/eacl2024_transformer_interpretability_tutorial))

- [ICML24: Physics of Language Models](https://physics.allen-zhu.com/home) ([Youtube](https://youtu.be/yBL7J0kgldU?si=KP0mlA7Oy0of2tUj))

- [NAACL24: Explanations in the Era of Large Language Models](https://explanation-llm.github.io/)

# Awesome Interpretability Forums & Worhshops

- [AI Alignment Forum](https://www.alignmentforum.org/)

- [LessWrong](https://www.lesswrong.com/)

- [Mechanistic Interpretability Workshop 2024 ICML](https://icml2024mi.pages.dev/) ([Accepted papers](https://openreview.net/group?id=ICML.cc/2024/Workshop/MI#tab-accept-oral))

- [Attributing Model Behavior at Scale Workshop 2023 NeurIPS](https://attrib-workshop.cc/) ([Accepted papers](https://openreview.net/group?id=NeurIPS.cc/2023/Workshop/ATTRIB&referrer=%5BHomepage%5D(%2F)#tab-accept-oral))

- [BlackboxNLP 2023 EMNLP](https://blackboxnlp.github.io/2023/) ([Accepted papers](https://aclanthology.org/events/emnlp-2023/#2023blackboxnlp-1))

# Awesome Interpretability Tools

- ![GitHub Repo stars](https://img.shields.io/github/stars/openai/transformer-debugger) [Transformer Debugger](https://github.com/openai/transformer-debugger): investigate specific behaviors of small LLMs 

- ![GitHub Repo stars](https://img.shields.io/github/stars/facebookresearch/llm-transparency-tool) [LLM Transparency Tool](https://github.com/facebookresearch/llm-transparency-tool) ([Demo](https://huggingface.co/spaces/facebook/llm-transparency-tool-demo)) 

- ![GitHub Repo stars](https://img.shields.io/github/stars/callummcdougall/sae_vis) [sae_vis](https://github.com/callummcdougall/sae_vis): a tool to replicate Anthropic's sparse autoencoder visualisations ([Demo](https://colab.research.google.com/drive/1oqDS35zibmL1IUQrk_OSTxdhcGrSS6yO?usp=drive_link)) 

- [**Neuronpedia**](https://www.neuronpedia.org/): an open platform for interpretability research. ([Doc](https://docs.neuronpedia.org/))

- ![GitHub Repo stars](https://img.shields.io/github/stars/FlorianDietz/comgra) [Comgra](https://github.com/FlorianDietz/comgra): A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. ([Paper](https://openreview.net/attachment?id=TcMmriVrgs&name=pdf))

# Awesome Interpretability Programs

- [**ML Alignment & Theory Scholars (MATS)**](https://www.matsprogram.org/): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

# Awesome Interpretability Papers

## Survey Papers

|  Title  |   Venue  |   Date   |   Code   |

|:--------|:--------:|:--------:|:--------:|

| [**Knowledge Mechanisms in Large Language Models: A Survey and Perspective**](https://arxiv.org/pdf/2407.15017) 
| EMNLP | 2024-10-06 | - | - |

|[**Attention Heads of Large Language Models: A Survey**](https://arxiv.org/pdf/2409.03752)| arXiv | 2024-09-06 | [Github](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) |

|[**Internal Consistency and Self-Feedback in Large Language Models: A Survey**](https://arxiv.org/pdf/2407.14507)| arXiv | 2024-07-22 | [Github](https://github.com/IAAR-Shanghai/ICSFSurvey) [Paper List](https://www.yuque.com/zhiyu-n2wnm/ugzwgf/gmqfkfigd6xw26eg) |

| [**Relational Composition in Neural Networks: A Survey and Call to Action**](https://openreview.net/attachment?id=zzCEiUIPk9&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

|[**From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP**](https://arxiv.org/pdf/2406.12618)| arXiv | 2024-06-18 | - |

|[**A Primer on the Inner Workings of Transformer-based Language Models**](https://arxiv.org/pdf/2405.00208)| arXiv | 2024-05-02 | - |

|[**Mechanistic Interpretability for AI Safety -- A Review**](https://arxiv.org/pdf/2404.14082)| arXiv | 2024-04-22 | - |

|[**From Understanding to Utilization: A Survey on Explainability for Large Language Models**](https://arxiv.org/pdf/2401.12874v2)| arXiv | 2024-02-22 | - |

|[**Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks**](https://arxiv.org/pdf/2207.13243)| arXiv | 2023-08-18 | - |

## Position Papers

|  Title  |   Venue  |   Date   |   Code   |

|:--------|:--------:|:--------:|:--------:|

|[**Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience**](https://openreview.net/pdf?id=66KmnMhGU5)| ICML | 2024-06-25 | - |

|[**Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience**](https://arxiv.org/pdf/2406.01352v1)| ICML | 2024-06-03 | - |

|[**Interpretability Needs a New Paradigm**](https://arxiv.org/pdf/2405.05386v1)| arXiv | 2024-05-08 | - |

|[**Position Paper: Toward New Frameworks for Studying Model Representations**](https://arxiv.org/pdf/2402.03855v1)| arXiv | 2024-02-06 | - |

|[**Rethinking Interpretability in the Era of Large Language Models**](https://arxiv.org/pdf/2402.01761v1)| arXiv | 2024-01-30 | - |

## Interpretable Analysis of LLMs

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions**](https://arxiv.org/pdf/2402.15055) 
| EMNLP | 2024-10-23 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/apartresearch/seqcont_circuits) 
 [**Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models**](https://arxiv.org/pdf/2311.04131) 
| EMNLP | 2024-10-04 | [Github](https://github.com/apartresearch/seqcont_circuits) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ydyjya/LLM-IHS-Explanation) 
 [**How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States**](https://arxiv.org/pdf/2406.05644) 
| EMNLP | 2024-10-01 | [Github](https://github.com/ydyjya/LLM-IHS-Explanation) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/facebookresearch/llm-transparency-tool) 
 [**Information Flow Routes: Automatically Interpreting Language Models at Scale**](https://arxiv.org/pdf/2403.00824) 
| EMNLP | 2024-10-01 | [Github](https://github.com/facebookresearch/llm-transparency-tool) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/Z1zs/MMNeuron) 
 [**MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model**](https://arxiv.org/pdf/2406.11193) 
| EMNLP | 2024-10-01 | [Github](https://github.com/Z1zs/MMNeuron) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/zepingyu0512/arithmetic-mechanism) 
 [**Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis**](https://arxiv.org/pdf/2409.14144) 
| EMNLP | 2024-09-12 | [Github](https://github.com/zepingyu0512/arithmetic-mechanism) | - |

| [**Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models**](https://arxiv.org/abs/2408.02416) 
| - | 2024-08-05 | [Github](https://github.com/liangzid/PromptExtractionEval) | - |

| [**Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically**](https://openreview.net/attachment?id=YwLgSimUIT&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/JasonGross/guarantees-based-mechanistic-interpretability) 
 [**Compact Proofs of Model Performance via Mechanistic Interpretability**](https://openreview.net/attachment?id=4B5Ovl9MLE&name=pdf) 
| MechInterp@ICML | 2024-07-15 | [Github](https://github.com/JasonGross/guarantees-based-mechanistic-interpretability/) | - |

| [**Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks**](https://openreview.net/attachment?id=gz0r3w71zQ&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching**](https://openreview.net/attachment?id=0ku2hIm4BS&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models**](https://openreview.net/attachment?id=DRrzq93Y5Y&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**What Makes and Breaks Safety Fine-tuning? Mechanistic Study**](https://openreview.net/attachment?id=BS2CbUkJpy&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Using Degeneracy in the Loss Landscape for Mechanistic Interpretability**](https://openreview.net/attachment?id=D8MDzUVlWA&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Loss in the Crowd: Hidden Breakthroughs in Language Model Training**](https://openreview.net/attachment?id=Os3z6Oczuu&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Robust Knowledge Unlearning via Mechanistic Localizations**](https://openreview.net/attachment?id=06pNzrEjnH&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Language Models Linearly Represent Sentiment**](https://openreview.net/attachment?id=Xsf6dOOMMc&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/hannamw/eap-ig-faithfulness) 
 [**Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms**](https://openreview.net/attachment?id=grXgesr5dT&name=pdf) 
| MechInterp@ICML | 2024-07-15 | [Github](https://github.com/hannamw/eap-ig-faithfulness) | - |

| [**Learning and Unlearning of Fabricated Knowledge in Language Models**](https://openreview.net/attachment?id=R5Q5lANcjY&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Faithful and Fast Influence Function via Advanced Sampling**](https://openreview.net/attachment?id=TTVPbaxXjR&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Hypothesis Testing the Circuit Hypothesis in LLMs**](https://openreview.net/attachment?id=ibSNv9cldu&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/KihoPark/LLM_Categorical_Hierarchical_Representations) 
 [**The Geometry of Categorical and Hierarchical Concepts in Large Language Models**](https://openreview.net/attachment?id=KXuYjuBzKo&name=pdf) 
| MechInterp@ICML | 2024-07-15 | [Github](https://github.com/KihoPark/LLM_Categorical_Hierarchical_Representations) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/huangxt39/InversionView) 
 [**InversionView: A General-Purpose Method for Reading Information from Neural Activations**](https://openreview.net/attachment?id=P7MW0FahEq&name=pdf) 
| MechInterp@ICML | 2024-07-15 | [Github](https://github.com/huangxt39/InversionView) | - |

| [**Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks**](https://openreview.net/attachment?id=pJs3ZiKBM5&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning**](https://arxiv.org/pdf/2407.03779) 
| arXiv | 2024-07-04 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/Betswish/MIRAGE) 
 [**Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation**](https://arxiv.org/pdf/2406.13663) 
| arXiv | 2024-07-01 | [Github](https://github.com/Betswish/MIRAGE) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/eliahuhorwitz/Spectral-DeTuning) 
 [**Recovering the Pre-Fine-Tuning Weights of Generative Models**](https://arxiv.org/pdf/2402.10208) 
| ICML | 2024-07-01 | [Github](https://github.com/eliahuhorwitz/Spectral-DeTuning) | [Blog](https://vision.huji.ac.il/spectral_detuning/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/sfeucht/footprints) 
 [**Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs**](https://arxiv.org/pdf/2406.20086) 
| arXiv | 2024-06-28 | [Github](https://github.com/sfeucht/footprints) | [Blog](https://footprints.baulab.info/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/jacobdunefsky/ObservablePropagation) 
 [**Observable Propagation: Uncovering Feature Vectors in Transformers**](https://openreview.net/pdf?id=ETNx4SekbY) 
| ICML | 2024-06-25 | [Github](https://github.com/jacobdunefsky/ObservablePropagation) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/DanielSc4/Dynamic-Activation-Composition) 
 [**Multi-property Steering of Large Language Models with Dynamic Activation Composition**](https://arxiv.org/pdf/2406.17563) 
| arXiv | 2024-06-25 | [Github](https://github.com/DanielSc4/Dynamic-Activation-Composition) | - |

| [**What Do the Circuits Mean? A Knowledge Edit View**](https://arxiv.org/pdf/2406.17241) 
| arXiv | 2024-06-25 | - | - |

| [**Confidence Regulation Neurons in Language Models**](https://arxiv.org/pdf/2406.16254) 
| arXiv | 2024-06-24 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/JasonGross/guarantees-based-mechanistic-interpretability) 
 [**Compact Proofs of Model Performance via Mechanistic Interpretability**](https://arxiv.org/pdf/2406.11779) 
| arXiv | 2024-06-24 | [Github](https://github.com/JasonGross/guarantees-based-mechanistic-interpretability/) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/BatsResearch/cross-lingual-detox) 
 [**Preference Tuning For Toxicity Mitigation Generalizes Across Languages**](https://arxiv.org/pdf/2406.16235) 
| arXiv | 2024-06-23 | [Github](https://github.com/BatsResearch/cross-lingual-detox) | - |

| [**Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models**](https://arxiv.org/pdf/2406.16033) 
| arXiv | 2024-06-23 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/dhgottesman/keen_estimating_knowledge_in_llms) 
 [**Estimating Knowledge in Large Language Models Without Generating a Single Token**](https://arxiv.org/pdf/2406.12673) 
| arXiv | 2024-06-18 | [Github](https://github.com/dhgottesman/keen_estimating_knowledge_in_llms) | - |

| [**Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations**](https://arxiv.org/pdf/2403.18167v2) 
| arXiv | 2024-06-17 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/jacobdunefsky/transcoder_circuits) 
 [**Transcoders Find Interpretable LLM Feature Circuits**](https://arxiv.org/pdf/2406.11944) 
| MechInterp@ICML | 2024-06-17 | [Github](https://github.com/jacobdunefsky/transcoder_circuits) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/JasonForJoy/Model-Editing-Hurt) 
 [**Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue**](https://arxiv.org/pdf/2401.04700) 
| arXiv | 2024-06-16 | [Github](https://github.com/JasonForJoy/Model-Editing-Hurt) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/kdu4108/measureLM) 
 [**Context versus Prior Knowledge in Language Models**](https://arxiv.org/pdf/2404.04633) 
| ACL | 2024-06-16 | [Github](https://github.com/kdu4108/measureLM) | - |

| [**Talking Heads: Understanding Inter-layer Communication in Transformer Language Models**](https://arxiv.org/pdf/2406.09519) 
| arXiv | 2024-06-13 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/FarnoushRJ/MambaLRP) 
 [**MambaLRP: Explaining Selective State Space Sequence Models**](https://arxiv.org/pdf/2406.07592) 
| arXiv | 2024-06-11 | [Github](https://github.com/FarnoushRJ/MambaLRP) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/PAIR-code/interpretability) 
 [**Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models**](https://arxiv.org/pdf/2401.06102) 
| ICML | 2024-06-06 | [Github](https://github.com/PAIR-code/interpretability/tree/master/patchscopes/code) | [Blog](https://pair-code.github.io/interpretability/patchscopes/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/francescortu/comp-mech) 
 [**Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals**](https://arxiv.org/pdf/2402.11655) 
| ACL | 2024-06-06 | [Github](https://github.com/francescortu/comp-mech) | - |

| [**Learned feature representations are biased by complexity, learning order, position, and more**](https://arxiv.org/pdf/2405.05847) 
| arXiv | 2024-06-06 | [Demo](https://gist.github.com/lampinen-dm/b6541019ef4cf2988669ab44aa82460b) | - |

| [**Iteration Head: A Mechanistic Study of Chain-of-Thought**](https://arxiv.org/pdf/2406.02128v1) 
| arXiv | 2024-06-05 | - | - |

| [**Activation Addition: Steering Language Models Without Optimization**](https://arxiv.org/pdf/2308.10248) 
| arXiv | 2024-06-04 | [Code](https://zenodo.org/records/8215277) | - |

| [**Interpretability Illusions in the Generalization of Simplified Models**](https://arxiv.org/pdf/2312.03656) 
| arXiv | 2024-06-04 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/k-amara/syntax-shap) 
 [**SyntaxShap: Syntax-aware Explainability Method for Text Generation**](https://arxiv.org/pdf/2402.09259) 
| arXiv | 2024-06-03 | [Github](https://github.com/k-amara/syntax-shap)| [Blog](https://syntaxshap.ivia.ch/) |

| [**Calibrating Reasoning in Language Models with Internal Consistency**](https://arxiv.org/pdf/2405.18711) 
| arXiv | 2024-05-29 | - | - |

| [**Black-Box Access is Insufficient for Rigorous AI Audits**](https://arxiv.org/pdf/2401.14446) 
| FAccT | 2024-05-29 | - | - |

| [**Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting**](https://arxiv.org/pdf/2406.00053) 
| arXiv | 2024-05-28 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/samuelperezdi/nuclr-icml) 
 [**From Neurons to Neutrons: A Case Study in Interpretability**](https://arxiv.org/pdf/2405.17425) 
| ICML | 2024-05-27 | [Github](https://github.com/samuelperezdi/nuclr-icml)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/OSU-NLP-Group/GrokkedTransformer) 
 [**Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization**](https://arxiv.org/pdf/2405.15071) 
| MechInterp@ICML | 2024-05-27 | [Github](https://github.com/OSU-NLP-Group/GrokkedTransformer)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/starship006/backup_research) 
 [**Explorations of Self-Repair in Language Models**](https://arxiv.org/pdf/2402.15390v2) 
| ICML | 2024-05-26 | [Github](https://github.com/starship006/backup_research)| - |

| [**Emergence of a High-Dimensional Abstraction Phase in Language Transformers**](https://arxiv.org/pdf/2405.15471) 
| arXiv | 2024-05-24 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ruizheliUOA/Anchored_Bias_GPT2) 
 [**Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions**](https://arxiv.org/pdf/2405.03205v2) 
| arXiv | 2024-05-23 | [Github](https://github.com/ruizheliUOA/Anchored_Bias_GPT2)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/JoshEngels/MultiDimensionalFeatures) 
 [**Not All Language Model Features Are Linear**](https://arxiv.org/pdf/2405.14860) 
| arXiv | 2024-05-23 | [Github](https://github.com/JoshEngels/MultiDimensionalFeatures)| - |

| [**Using Degeneracy in the Loss Landscape for Mechanistic Interpretability**](https://arxiv.org/pdf/2405.10927) 
| arXiv | 2024-05-20 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/AIRI-Institute/LLM-Microscope) 
 [**Your Transformer is Secretly Linear**](https://arxiv.org/pdf/2405.12250) 
| arXiv | 2024-05-19 | [Github](https://github.com/AIRI-Institute/LLM-Microscope)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/AndreasMadsen/llm-introspection) 
 [**Are self-explanations from Large Language Models faithful?**](https://arxiv.org/pdf/2401.07927) 
| ACL | 2024-05-16 | [Github](https://github.com/AndreasMadsen/llm-introspection)| - |

| [**Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models**](https://arxiv.org/pdf/2402.04614) 
| arXiv | 2024-05-14 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/nrimsky/CAA) 
 [**Steering Llama 2 via Contrastive Activation Addition**](https://arxiv.org/pdf/2312.06681) 
| arXiv | 2024-05-07 | [Github](https://github.com/nrimsky/CAA)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/jgcarrasco/acronyms_paper) 
 [**How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability**](https://arxiv.org/pdf/2405.04156) 
| AISTATS | 2024-05-07 | [Github](https://github.com/jgcarrasco/acronyms_paper)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/joykirat18/How-To-Think-Step-by-Step) 
 [**How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning**](https://arxiv.org/pdf/2402.18312) 
| arXiv | 2024-05-06 | [Github](https://github.com/joykirat18/How-To-Think-Step-by-Step)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/jmerullo/circuit_reuse) 
 [**Circuit Component Reuse Across Tasks in Transformer Language Models**](https://arxiv.org/pdf/2310.08744) 
| ICLR | 2024-05-06 | [Github](https://github.com/jmerullo/circuit_reuse)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/DFKI-NLP/LLMCheckup) 
 [**LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations**](https://arxiv.org/pdf/2401.12576) 
| HCI+NLP@NAACL | 2024-04-24 | [Github](https://github.com/DFKI-NLP/LLMCheckup)| - |

| [**How to use and interpret activation patching**](https://arxiv.org/pdf/2404.15255v1) 
| arXiv | 2024-04-23 | - | - |

| [**Understanding Addition in Transformers**](https://arxiv.org/pdf/2310.13121v9) 
| arXiv | 2024-04-23 | - | - |

| [**Towards Uncovering How Large Language Model Works: An Explainability Perspective**](https://arxiv.org/pdf/2402.10688) 
| arXiv | 2024-04-15 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/aadityasingh/icl-dynamics) 
 [**What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation**](https://arxiv.org/pdf/2404.07129) 
| ICML | 2024-04-10 | [Github](https://github.com/aadityasingh/icl-dynamics)| - |

| [**Does Transformer Interpretability Transfer to RNNs?**](https://arxiv.org/pdf/2404.05971v1) 
| arXiv | 2024-04-09 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/arnab-api/romba) 
 [**Locating and Editing Factual Associations in Mamba**](https://arxiv.org/pdf/2404.03646) 
| arXiv | 2024-04-04 | [Github](https://github.com/arnab-api/romba)| [Demo](https://github.com/arnab-api/romba/tree/master/demo) |

| [**Eliciting Latent Knowledge from Quirky Language Models**](https://arxiv.org/pdf/2312.01037) 
| ME-FoMo@ICLR | 2024-04-03 | - | - |

| [**Do language models plan ahead for future tokens?**](https://arxiv.org/pdf/2404.00859) 
| arXiv | 2024-04-01 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/saprmarks/feature-circuits) 
 [**Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models**](https://arxiv.org/pdf/2403.19647v2) 
| arXiv | 2024-03-31 | [Github](https://github.com/saprmarks/feature-circuits)| [Demo](https://feature-circuits.xyz/) |

| [**Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms**](https://arxiv.org/pdf/2403.17806) 
| arXiv | 2024-03-26 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/frankniujc/kn_thesis) 
 [**What does the Knowledge Neuron Thesis Have to do with Knowledge?**](https://openreview.net/pdf?id=2HJRwwbV3G) 
| ICLR | 2024-03-16 | [Github](https://github.com/frankniujc/kn_thesis)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/wesg52/world-models) 
 [**Language Models Represent Space and Time**](https://arxiv.org/pdf/2310.02207) 
| ICLR | 2024-03-04 | [Github](https://github.com/wesg52/world-models)| - |

| [**AtP\*: An efficient and scalable method for localizing LLM behaviour to components**](https://arxiv.org/pdf/2403.00745) 
| arXiv | 2024-03-01 | - | - |

| [**A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task**](https://arxiv.org/pdf/2402.11917) 
| arXiv | 2024-02-28 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ericwtodd/function_vectors) 
 [**Function Vectors in Large Language Models**](https://arxiv.org/pdf/2310.15213) 
| ICLR | 2024-02-25 | [Github](https://github.com/ericwtodd/function_vectors)| [Blog](https://functions.baulab.info/) |

| [**A Language Model's Guide Through Latent Space**](https://arxiv.org/pdf/2402.14433) 
| arXiv | 2024-02-22 | - | - |

| [**Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model**](https://arxiv.org/pdf/2311.04131v3) 
| arXiv | 2024-02-22 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/Nix07/finetuning) 
 [**Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking**](https://arxiv.org/pdf/2402.14811) 
| ICLR | 2024-02-22 | [Github](https://github.com/Nix07/finetuning)| [Blog](https://finetuning.baulab.info/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/abhika-m/FAVA) 
 [**Fine-grained Hallucination Detection and Editing for Language Models**](https://arxiv.org/pdf/2401.06855) 
| arXiv | 2024-02-21 | [Github](https://github.com/abhika-m/FAVA)| [Blog](https://fine-grained-hallucination.github.io/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/AnasHimmi/Hallucination-Detection-Score-Aggregation) 
 [**Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation**](https://arxiv.org/pdf/2402.13331) 
| arXiv | 2024-02-20 | [Github](https://github.com/AnasHimmi/Hallucination-Detection-Score-Aggregation)| - |

| [**Identifying Semantic Induction Heads to Understand In-Context Learning**](https://arxiv.org/pdf/2402.13055) 
| arXiv | 2024-02-20 | - | - |

| [**Backward Lens: Projecting Language Model Gradients into the Vocabulary Space**](https://arxiv.org/pdf/2402.12865) 
| arXiv | 2024-02-20 | - | - |

| [**Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models**](https://arxiv.org/pdf/2402.07543) 
| ACML | 2024-02-12 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/john-hewitt/model-editing-canonical-examples) 
 [**Model Editing with Canonical Examples**](https://arxiv.org/pdf/2402.06155) 
| arXiv | 2024-02-09 | [Github](https://github.com/john-hewitt/model-editing-canonical-examples)| - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ejmichaud/neural-verification) 
 [**Opening the AI black box: program synthesis via mechanistic interpretability**](https://arxiv.org/pdf/2402.05110) 
| arXiv | 2024-02-07 | [Github](https://github.com/ejmichaud/neural-verification)| - |

| [**INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection**](https://arxiv.org/pdf/2402.03744) 
| ICLR | 2024-02-06 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/berlino/seq_icl) 
 [**In-Context Language Learning: Architectures and Algorithms**](https://arxiv.org/pdf/2401.12973) 
| arXiv | 2024-01-30 | [Github](https://github.com/berlino/seq_icl)| - |

| [**Gradient-Based Language Model Red Teaming**](https://arxiv.org/pdf/2401.16656) 
| EACL | 2024-01-30 | [Github](https://github.com/google-research/google-research/tree/master/gbrt)| - |

| [**The Calibration Gap between Model and Human Confidence in Large Language Models**](https://arxiv.org/pdf/2401.13835) 
| arXiv | 2024-01-24 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/wesg52/universal-neurons) 
 [**Universal Neurons in GPT2 Language Models**](https://arxiv.org/pdf/2401.12181) 
| arXiv | 2024-01-22 | [Github](https://github.com/wesg52/universal-neurons)| - |

| [**The mechanistic basis of data dependence and abrupt learning in an in-context classification task**](https://openreview.net/pdf?id=aN4Jf6Cx69) 
| ICLR | 2024-01-16 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/dannyallover/overthinking_the_truth) 
 [**Overthinking the Truth: Understanding how Language Models Process False Demonstrations**](https://openreview.net/pdf?id=Tigr1kMDZy) 
| ICLR | 2024-01-16 | [Github](https://github.com/dannyallover/overthinking_the_truth)| - |

| [**Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks**](https://openreview.net/pdf?id=A0HKeKl4Nl) 
| ICLR | 2024-01-16 | - | - |

| [**Feature emergence via margin maximization: case studies in algebraic tasks**](https://openreview.net/pdf?id=i9wDX850jR) 
| ICLR | 2024-01-16 | - | - |

| [**Successor Heads: Recurring, Interpretable Attention Heads In The Wild**](https://openreview.net/pdf?id=kvcbV8KQsi) 
| ICLR | 2024-01-16 | - | - |

| [**Towards Best Practices of Activation Patching in Language Models: Metrics and Methods**](https://openreview.net/pdf?id=Hf17y6u9BC) 
| ICLR | 2024-01-16 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ajyl/dpo_toxic) 
 [**A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity**](https://arxiv.org/pdf/2401.01967) 
| ICML | 2024-01-03 |  [Github](https://github.com/ajyl/dpo_toxic) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ed1d1a8d/prompt-injection-interp) 
 [**Forbidden Facts: An Investigation of Competing Objectives in Llama-2**](https://arxiv.org/pdf/2312.08793) 
| ATTRIB@NeurIPS | 2023-12-31 |  [Github](https://github.com/ed1d1a8d/prompt-injection-interp) | [Blog](https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/saprmarks/geometry-of-truth) 
 [**The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets**](https://arxiv.org/pdf/2310.06824) 
| arXiv | 2023-12-08 |  [Github](https://github.com/saprmarks/geometry-of-truth) | [Blog](https://saprmarks.github.io/geometry-of-truth/dataexplorer/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/amakelov/activation-patching-illusion) 
 [**Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching**](https://arxiv.org/pdf/2311.17030) 
| ATTRIB@NeurIPS | 2023-12-06 |  [Github](https://github.com/amakelov/activation-patching-illusion) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/understanding-search/structured-representations-maze-transformers) 
 [**Structured World Representations in Maze-Solving Transformers**](https://arxiv.org/pdf/2312.02566) 
| UniReps@NeurIPS | 2023-12-05 |  [Github](https://github.com/understanding-search/structured-representations-maze-transformers) | - |

| [**Generating Interpretable Networks using Hypernetworks**](https://arxiv.org/pdf/2312.03051) 
| arXiv | 2023-12-05 |  - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/fjzzq2002/pizza) 
 [**The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks**](https://arxiv.org/pdf/2306.17844) 
| NeurIPS | 2023-11-21 | [Github](https://github.com/fjzzq2002/pizza) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/Aaquib111/edge-attribution-patching) 
 [**Attribution Patching Outperforms Automated Circuit Discovery**](https://arxiv.org/pdf/2310.10348) 
| ATTRIB@NeurIPS | 2023-11-20 | [Github](https://github.com/Aaquib111/edge-attribution-patching) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/google-deepmind/tracr) 
 [**Tracr: Compiled Transformers as a Laboratory for Interpretability**](https://arxiv.org/pdf/2301.05062) 
| NeurIPS | 2023-11-03 | [Github](https://github.com/google-deepmind/tracr) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/hannamw/gpt2-greater-than) 
 [**How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model**](https://arxiv.org/pdf/2305.00586) 
| NeurIPS | 2023-11-02 | [Github](https://github.com/hannamw/gpt2-greater-than) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/princeton-nlp/TransformerPrograms) 
 [**Learning Transformer Programs**](https://arxiv.org/pdf/2306.01128) 
| NeurIPS | 2023-10-31 | [Github](https://github.com/princeton-nlp/TransformerPrograms) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ArthurConmy/Automatic-Circuit-Discovery) 
 [**Towards Automated Circuit Discovery for Mechanistic Interpretability**](https://arxiv.org/pdf/2304.14997) 
| NeurIPS | 2023-10-28 | [Github](https://github.com/ArthurConmy/Automatic-Circuit-Discovery) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/yifan-h/MechanisticProbe) 
 [**Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models**](https://arxiv.org/pdf/2310.14491) 
| EMNLP | 2023-10-23 | [Github](https://github.com/yifan-h/MechanisticProbe) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/likenneth/honest_llama) 
 [**Inference-Time Intervention: Eliciting Truthful Answers from a Language Model**](https://arxiv.org/pdf/2306.03341) 
| NeurIPS | 2023-10-20 | [Github](https://github.com/likenneth/honest_llama) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/mechanistic-interpretability-grokking/progress-measures-paper) 
 [**Progress measures for grokking via mechanistic interpretability**](https://arxiv.org/pdf/2301.05217) 
| ICLR | 2023-10-19 | [Github](https://github.com/mechanistic-interpretability-grokking/progress-measures-paper) | [Blog](https://www.neelnanda.io/grokking-paper) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/callummcdougall/SERI-MATS-2023-Streamlit-pages) 
 [**Copy Suppression: Comprehensively Understanding an Attention Head**](https://arxiv.org/pdf/2310.04625) 
| arXiv | 2023-10-06 | [Github](https://github.com/callummcdougall/SERI-MATS-2023-Streamlit-pages) | [Blog & Demo](https://copy-suppression.streamlit.app/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/google/belief-localization) 
 [**Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models**](https://openreview.net/pdf?id=EldbUlZtbd) 
| NeurIPS | 2023-09-21 | [Github](https://github.com/google/belief-localization) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/frankaging/align-transformers) 
 [**Interpretability at Scale: Identifying Causal Mechanisms in Alpaca**](https://openreview.net/pdf?id=nRfClnMhVX) 
| NeurIPS | 2023-09-21 | [Github](https://github.com/frankaging/align-transformers) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ajyl/mech_int_othelloGPT) 
 [**Emergent Linear Representations in World Models of Self-Supervised Sequence Models**](https://arxiv.org/pdf/2309.00941) 
| BlackboxNLP@EMNLP | 2023-09-07 | [Github](https://github.com/ajyl/mech_int_othelloGPT) | [Blog](https://www.alignmentforum.org/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/wesg52/sparse-probing-paper) 
 [**Finding Neurons in a Haystack: Case Studies with Sparse Probing**](https://arxiv.org/pdf/2305.01610) 
| arXiv | 2023-06-02 | [Github](https://github.com/wesg52/sparse-probing-paper) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/yangalan123/Amortized-Interpretability) 
 [**Efficient Shapley Values Estimation by Amortization for Text Classification**](https://arxiv.org/pdf/2305.19998) 
| ACL | 2023-05-31 | [Github](https://github.com/yangalan123/Amortized-Interpretability) | [Video](https://youtu.be/CsCRU8Hzpms?si=LMY4_TvEzi5D88OR) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/bilal-chughtai/rep-theory-mech-interp) 
 [**A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations**](https://arxiv.org/pdf/2302.03025) 
| ICML | 2023-05-24 | [Github](https://github.com/bilal-chughtai/rep-theory-mech-interp) | - |

| [**Localizing Model Behavior with Path Patching**](https://arxiv.org/pdf/2304.05969) 
| arXiv | 2023-05-16 | - | - |

| [**Language models can explain neurons in language models**](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html) 
| OpenAI | 2023-05-09 | - | - |

| [**N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models**](https://arxiv.org/pdf/2304.12918) 
| ICLR Workshop | 2023-04-22 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/redwoodresearch/Easy-Transformer) 
 [**Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small**](https://arxiv.org/pdf/2211.00593) 
| ICLR | 2023-01-20 | [Github](https://github.com/redwoodresearch/Easy-Transformer) | - |

| [**Interpreting Neural Networks through the Polytope Lens**](https://arxiv.org/pdf/2211.12312) 
| arXiv | 2022-11-22 | - | - |

| [**Scaling Laws and Interpretability of Learning from Repeated Data**](https://arxiv.org/pdf/2205.10487) 
| arXiv | 2022-05-21 | - | - |

| [**In-context Learning and Induction Heads**](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) 
| Anthropic | 2022-03-08 | - | - |

| [**A Mathematical Framework for Transformer Circuits**](https://transformer-circuits.pub/2021/framework/index.html) 
| Anthropic | 2021-12-22 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/tech-srl/RASP) 
 [**Thinking Like Transformers**](https://arxiv.org/pdf/2211.00593) 
| ICML | 2021-07-19 | [Github](https://github.com/tech-srl/RASP) | [Mini Tutorial](https://docs.google.com/presentation/d/1oIPHP_7qjsrnrDb3kdZIUZt-wQofkiQl/edit?usp=sharing&ouid=111912319459945992784&rtpof=true&sd=true) |

## SAE, Dictionary Learning and Superposition

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task**](https://openreview.net/attachment?id=JdrVuEQih5&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models**](https://openreview.net/attachment?id=qzsDKwGJyB&name=pdf) 
| MechInterp@ICML | 2024-07-15 | - | - |

| [**Interpreting Attention Layer Outputs with Sparse Autoencoders**](https://arxiv.org/pdf/2406.17759v1) 
| MechInterp@ICML | 2024-06-25 | - | [Demo](https://robertzk.github.io/circuit-explorer) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ApolloResearch/e2e_sae) 
 [**Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning**](https://arxiv.org/pdf/2405.12241) 
| MechInterp@ICML | 2024-05-24 | [Github](https://github.com/ApolloResearch/e2e_sae) | - |

| [**Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis**](https://arxiv.org/pdf/2405.14277) 
| arXiv | 2024-05-23 | - | - |

| [**Automatically Identifying Local and Global Circuits with Linear Computation Graphs**](https://arxiv.org/pdf/2405.13868v1) 
| arXiv | 2024-05-22 | - | - |

| [**Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet**](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) 
| Anthropic | 2024-05-21 | - | [Demo](https://transformer-circuits.pub/2024/scaling-monosemanticity/umap.html?targetId=34m_31164353) |

| [**Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models**](https://arxiv.org/pdf/2405.12522) 
| arXiv | 2024-05-21 | - | - |

| [**Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control**](https://arxiv.org/pdf/2405.08366) 
| arXiv | 2024-05-20 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/ApolloResearch/rib) 
 [**The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks**](https://arxiv.org/pdf/2405.10928) 
| arXiv | 2024-05-20 | [Github](https://github.com/ApolloResearch/rib) | - |

| [**Improving Dictionary Learning with Gated Sparse Autoencoders**](https://arxiv.org/pdf/2404.16014v2) 
| arXiv | 2024-04-30 | - | - |

| [**Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers**](https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/towards-multimodal-interpretability-learning-sparse-2) 
| LessWrong | 2024-04-29 | - | [Demo](https://sae-explorer.streamlit.app/) |

| [**Activation Steering with SAEs**](https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team#Activation_Steering_with_SAEs) 
| LessWrong | 2024-04-19 | - | - |

| [**SAE reconstruction errors are (empirically) pathological**](https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological) 
| LessWrong | 2024-03-29 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/evanhanders/superposition-geometry-toys) 
 [**Sparse autoencoders find composed features in small toy models**](https://www.lesswrong.com/posts/a5wwqza2cY3W7L9cj/sparse-autoencoders-find-composed-features-in-small-toy) 
| LessWrong | 2024-03-14 | [Github](https://github.com/evanhanders/superposition-geometry-toys) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/RobertHuben/othellogpt_sparse_autoencoders) 
 [**Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT**](https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board) 
| LessWrong | 2024-03-05 | [Github](https://github.com/RobertHuben/othellogpt_sparse_autoencoders) | - |

| [**Do sparse autoencoders find "true features"?**](https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features) 
| LessWrong | 2024-02-12 | - | - |

| [**Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT**](https://arxiv.org/pdf/2402.12201) 
| arXiv | 2024-02-19 | - | - |

| [**Toward A Mathematical Framework for Computation in Superposition**](https://www.lesswrong.com/posts/2roZtSr5TGmLjXMnT/toward-a-mathematical-framework-for-computation-in) 
| LessWrong | 2024-01-18 | - | - |

| [**Sparse Autoencoders Work on Attention Layer Outputs**](https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/sparse-autoencoders-work-on-attention-layer-outputs) 
| LessWrong | 2024-01-16 | - | [Demo](https://colab.research.google.com/drive/10zBOdozYR2Aq2yV9xKs-csBH2olaFnsq?usp=sharing) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/HoagyC/sparse_coding) 
 [**Sparse Autoencoders Find Highly Interpretable Features in Language Models**](https://openreview.net/pdf?id=F76bwRSLeK) 
| ICLR | 2024-01-16 | [Github](https://github.com/HoagyC/sparse_coding) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/taufeeque9/codebook-features) 
 [**Codebook Features: Sparse and Discrete Interpretability for Neural Networks**](https://arxiv.org/pdf/2310.17230) 
| arXiv | 2023-10-26 | [Github](https://github.com/taufeeque9/codebook-features) | [Demo](https://colab.research.google.com/github/taufeeque9/codebook-features/blob/main/tutorials/code_intervention.ipynb) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/neelnanda-io/1L-Sparse-Autoencoder) 
 [**Towards Monosemanticity: Decomposing Language Models With Dictionary Learning**](https://transformer-circuits.pub/2023/monosemantic-features/index.html) 
| Anthropic | 2023-10-04 | [Github](https://github.com/neelnanda-io/1L-Sparse-Autoencoder) | [Demo-1](https://transformer-circuits.pub/2023/monosemantic-features/vis/index.html), [Demo-2](https://transformer-circuits.pub/2023/monosemantic-features/vis/a1.html), [Tutorial](https://colab.research.google.com/drive/1u8larhpxy8w4mMsJiSBddNOzFGj7_RTn?usp=sharing) |

| [**Polysemanticity and Capacity in Neural Networks**](https://arxiv.org/pdf/2210.01892) 
| arXiv | 2023-07-12 | - | - |

| [**Distributed Representations: Composition & Superposition**](https://transformer-circuits.pub/2023/superposition-composition/index.html) 
| Anthropic | 2023-05-04 | - | - |

| [**Superposition, Memorization, and Double Descent**](https://transformer-circuits.pub/2023/toy-double-descent/index.html) 
| Anthropic | 2023-01-05 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/adamjermyn/toy_model_interpretability) 
 [**Engineering Monosemanticity in Toy Models**](https://arxiv.org/pdf/2211.09169) 
| arXiv | 2022-11-16 | [Github](https://github.com/adamjermyn/toy_model_interpretability) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/anthropics/toy-models-of-superposition) 
 [**Toy Models of Superposition**](https://transformer-circuits.pub/2022/toy_model/index.html) 
| Anthropic | 2022-09-14 | [Github](https://github.com/anthropics/toy-models-of-superposition) | [Demo](https://colab.research.google.com/github/anthropics/toy-models-of-superposition/blob/main/toy_models.ipynb) |

| [**Softmax Linear Units**](https://transformer-circuits.pub/2022/solu/index.html) 
| Anthropic | 2022-06-27 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/zeyuyun1/TransformerVis) 
 [**Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors**](https://arxiv.org/pdf/2103.15949) 
| DeeLIO@NAACL | 2021-03-29 | [Github](https://github.com/zeyuyun1/TransformerVis) | - |

| [**Zoom In: An Introduction to Circuits**](https://distill.pub/2020/circuits/zoom-in/) 
| Distill | 2020-03-10 | - | - |

## Interpretability in Vision LLMs

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**Dissecting Query-Key Interaction in Vision Transformers**](https://openreview.net/attachment?id=CsF3PwBN6N&name=pdf) 
| MechInterp@ICML | 2024-06-25 | - | - |

| [**Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models**](https://openreview.net/attachment?id=50SMcZ8QQf&name=pdf) 
| MechInterp@ICML | 2024-06-25 | - | - |

| [**Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP**](https://openreview.net/attachment?id=DwhvppIZsD&name=pdf) 
| MechInterp@ICML | 2024-06-25 | - | - |

| [**The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision**](https://openreview.net/attachment?id=IGnoozsfj1&name=pdf) 
| MechInterp@ICML | 2024-06-25 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/google-research/fooling-feature-visualizations) 
 [**Don’t trust your eyes: on the (un)reliability of feature visualizations**](https://openreview.net/pdf?id=s0Jvdolv2I) 
| ICML | 2024-06-25 | [Github](https://github.com/google-research/fooling-feature-visualizations/) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/wrudman/NOTICE) 
 [**What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation**](https://arxiv.org/pdf/2406.16320) 
| arXiv | 2024-06-24 | [Github](https://github.com/wrudman/NOTICE) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/maxdreyer/PURE) 
 [**PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits**](https://arxiv.org/pdf/2404.06453v1) 
| XAI4CV@CVPR | 2024-04-09 | [Github](https://github.com/maxdreyer/PURE) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/AI4LIFE-GROUP/SpLiCE) 
 [**Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)**](https://arxiv.org/pdf/2402.10376v1) 
| arXiv | 2024-02-16 | [Github](https://github.com/AI4LIFE-GROUP/SpLiCE) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/martinagvilas/vit-cls_emb) 
 [**Analyzing Vision Transformers for Image Classification in Class Embedding Space**](https://openreview.net/pdf?id=hwjmEZ8561) 
| NeurIPS | 2023-09-21 | [Github](https://github.com/martinagvilas/vit-cls_emb) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability) 
 [**Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP**](https://arxiv.org/pdf/2308.14179) 
| CLVL@ICCV | 2023-08-27 | [Github](https://github.com/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/brendel-group/imi) 
 [**Scale Alone Does not Improve Mechanistic Interpretability in Vision Models**](https://arxiv.org/pdf/2307.05471) 
| NeurIPS | 2023-07-11 | [Github](https://github.com/brendel-group/imi) | [Blog](https://brendel-group.github.io/imi/) |

## Benchmarking Interpretability

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**Benchmarking Mental State Representations in Language Models**](https://arxiv.org/pdf/2406.17513) 
| MechInterp@ICML | 2024-06-25 | - | - |

| [**A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains**](https://arxiv.org/pdf/2402.00559) 
| ACL | 2024-05-21 | [Dataset](https://huggingface.co/datasets/google/reveal) | [Blog](https://reveal-dataset.github.io/) |

| ![GitHub Repo stars](https://img.shields.io/github/stars/explanare/ravel) 
 [**RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations**](https://arxiv.org/pdf/2402.17700) 
| arXiv | 2024-02-27 | [Github](https://github.com/explanare/ravel) | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/aryamanarora/causalgym) 
 [**CausalGym: Benchmarking causal interpretability methods on linguistic tasks**](https://arxiv.org/pdf/2402.12560) 
| arXiv | 2024-02-19 | [Github](https://github.com/aryamanarora/causalgym) | - |

## Enhancing Interpretability

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability**](https://arxiv.org/pdf/2401.03646) 
| arXiv | 2024-01-08 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/KindXiaoming/BIMT) 
 [**Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability**](https://arxiv.org/pdf/2305.08746) 
| arXiv | 2023-06-06 | [Github](https://github.com/KindXiaoming/BIMT) | - |

## Others

|  Title  |   Venue  |   Date   |   Code   |   Blog   |

|:--------|:--------:|:--------:|:--------:|:--------:|

| [**An introduction to graphical tensor notation for mechanistic interpretability**](https://arxiv.org/pdf/2402.01790) 
| arXiv | 2024-02-02 | - | - |

| ![GitHub Repo stars](https://img.shields.io/github/stars/arjunkaruvally/emt_variable_binding) 
 [**Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks**](https://arxiv.org/pdf/2310.02430) 
| arXiv | 2023-10-03 | [Github](https://github.com/arjunkaruvally/emt_variable_binding) | - |

# Other Awesome Interpretability Resources

- [**Daily Picks in Interpretability & Analysis of LMs**](https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9)

- ![GitHub Repo stars](https://img.shields.io/github/stars/JShollaj/awesome-llm-interpretability) [**Awesome LLM Interpretability**](https://github.com/JShollaj/awesome-llm-interpretability)

- ![GitHub Repo stars](https://img.shields.io/github/stars/zepingyu0512/awesome-llm-understanding-mechanism) [**awesome papers for understanding LLM mechanism**](https://github.com/zepingyu0512/awesome-llm-understanding-mechanism)

- ![GitHub Repo stars](https://img.shields.io/github/stars/IAAR-Shanghai/Awesome-Attention-Heads) [**Awesome-Attention-Heads**](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads)

- ![GitHub Repo stars](https://img.shields.io/github/stars/cooperleong00/Awesome-LLM-Interpretability) [**Awesome-LLM-Interpretability**](https://github.com/cooperleong00/Awesome-LLM-Interpretability)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models

Awesome Lists containing this project

README