awesome-llm-interpretability

A curated list of Large Language Model (LLM) Interpretability resources.
https://github.com/JShollaj/awesome-llm-interpretability

Last synced: about 4 hours ago
JSON representation

Table of Contents
- LLM Interpretability Tools
  - The Learning Interpretability Tool - an open-source platform for visualization and understanding of ML models, supports classification, refression, and generative models (text & image data); includes saliency methods, attention attribution, counter-facturals, TCAV, embedding visualizations, and facets style data analysis.
  - Comgra - Comgra helps you analyze and debug neural networks in pytorch.
  - Pythia - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers.
  - Phoenix - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook.
  - Floom
  - Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior.
  - Fmr.ai - AI interpretability and explainability platform.
  - Attention Analysis - Analyzing attention maps from BERT transformer.
  - SuperICL - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models.
  - Git Re-Basin - Code release for "Git Re-Basin: Merging Models modulo Permutation Symmetries.”
  - Functionary - Chat language model that can interpret and execute functions/plugins.
  - Sparse Autoencoder - Sparse Autoencoder for Mechanistic Interpretability.
  - Rome - Locating and editing factual associations in GPT.
  - Inseq - Interpretability for sequence generation models.
  - Vanna - Abstractions to use RAG to generate SQL with any LLM
  - TransformerViz - Interative tool to visualize transformer model by its latent space.
  - Awesome-Attention-Heads - A carefully compiled list that summarizes the diverse functions of the attention heads.
  - Neuron Viewer - Tool for viewing neuron activations and explanations.
  - LLM Visualization - Visualizing LLMs in low level.
  - Copy Suppression - Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs.
  - SpellGPT - Explores GPT-3’s ability to spell own token strings.
  - TransformerLens - A Library for Mechanistic Interpretability of Generative Language Models.
  - ecco - A python library for exploring and explaining Natural Language Processing models using interactive visualizations.
- LLM Interpretability Papers
  - A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task - Identifies backward chaining circuits in a transformer trained to perform pathfinding in a tree.
  - Interpretability Illusions in the Generalization of Simplified Models
  - Self-Influence Guided Data Reweighting for Language Model Pre-training - An application of training data attribution methods to re-weight training data and improve performance.
  - Data Similarity is Not Enough to Explain Language Model Performance - Discusses the limits of embedding models to explain data effective selection.
  - Post Hoc Explanations of Language Models Can Improve Language Models - Evaluates language-model generated explanations ability to also improve model quality.
  - Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models - highlights the limits of Causal Tracing: how a fact is stored in an LLM can be changed by editing weights in a different location than where Causal Tracing suggests.
  - Finding Neurons in a Haystack: Case Studies with Sparse Probing - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).
  - Copy Suppression: Comprehensively Understanding an Attention Head - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.
  - Linear Representations of Sentiment in Large Language Models - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.
  - Emergent world representations: Exploring a sequence model trained on a synthetic task - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.
  - Towards Automated Circuit Discovery for Mechanistic Interpretability - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.
  - The Quantization Model of Neural Scaling - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
  - A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations - Examines small neural networks to understand how they learn group compositions, using representation theory.
  - Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias - Causal mediation analysis as a method for interpreting neural models in natural language processing.
  - How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.
  - Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.
  - Language models can explain neurons in language models - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.
  - Emergent Linear Representations in World Models of Self-Supervised Sequence Models - Linear representations in a world model of Othello-playing sequence models.
  - "Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model" - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.
  - "Let's Verify Step by Step" - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.
  - "Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory" - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.
  - "Interpretability Illusions in the Generalization of Simplified Models" - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.
  - "The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models" - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.
  - "Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition" - Investigates how LLMs perform the task of mathematical addition.
  - "Measuring Feature Sparsity in Language Models" - Develops metrics to evaluate the success of sparse coding techniques in language model activations.
  - Toy Models of Superposition - Investigates how models represent more features than dimensions, especially when features are sparse.
  - Spine: Sparse interpretable neural embeddings - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.
  - Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors - Introduces a novel method for visualizing transformer networks using dictionary learning.
  - On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron - Critically examines the effectiveness of the "Sentiment Neuron”.
  - Engineering monosemanticity in toy models - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.
  - Polysemanticity and capacity in neural networks - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.
  - An Overview of Early Vision in InceptionV1 - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
  - Visualizing and measuring the geometry of BERT - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.
  - Neurons in Large Language Models: Dead, N-gram, Positional - An analysis of neurons in large language models, focusing on the OPT family.
  - Interpretability in the Wild: GPT-2 small (arXiv) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.
  - Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.
  - The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets.
  - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.
  - Representation Engineering: A Top-Down Approach to AI Transparency - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.
  - Explaining black box text modules in natural language with language models - Natural language explanations for LLM attention heads, evaluated using synthetic text
  - N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models - Explain each LLM neuron as a graph
  - Augmenting Interpretable Models with LLMs during Training - Use LLMs to build interpretable classifiers of text data
  - ChainPoll: A High Efficacy Method for LLM Hallucination Detection - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.
  - Post Hoc Explanations of Language Models Can Improve Language Models - Evaluates language-model generated explanations ability to also improve model quality.
  - "Successor Heads: Recurring, Interpretable Attention Heads In The Wild" - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s.
  - "Large Language Models Are Not Robust Multiple Choice Selectors" - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection bias”.
  - ChainPoll: A High Efficacy Method for LLM Hallucination Detection - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.
  - Emergent and Predictable Memorization in Large Language Models - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.
  - Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition - based, metric-learner approximations of neural network models and hard-attention mechanisms that can be constructed with task-specific inductive biases for effective semi-supervised learning (i.e., feature detection). These mechanisms combine to yield effective methods for interpretability-by-exemplar over the representation space of neural models.
  - Similarity-Distance-Magnitude Universal Verification - aware verification and interpretability-by-exemplar as intrinsic properties. See the blog post ["The Determinants of Controllable AGI"](https://raw.githubusercontent.com/allenschmaltz/Resolute_Resolutions/master/volume5/volume5.pdf) for a high-level overview of the broader implications.
- LLM Survey Paper
  - A Survey of Large Language Models - . This survey paper provides
- LLM Interpretability Articles
  - Do Machine Learning Models Memorize or Generalize? - an interactive visualization exploring the phenopmena known as Grokking (VISxAI hall of fame)
  - What Have Language Models Learned? - an interactive visualization to undertsand how large language models work, and understand the nature of their biases (VISxAI hall of fame)
  - A New Approach to Computation Reimagines Artificial Intelligenceg - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.
  - Interpreting GPT: the logit lens - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions.
  - A Mechanistic Interpretability Analysis of Grokking - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.
  - 200 Concrete Open Problems in Mechanistic Interpretability - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.
  - Evaluating LLMs is a minefield - Challenges in assessing the performance and biases of large language models (LLMs) like GPT.
  - Attribution Patching: Activation Patching At Industrial Scale - Method that uses gradients for a linear approximation of activation patching in neural networks.
  - Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.
  - A circuit for Python docstrings in a 4-layer attention-only transformer - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
  - Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - Survey on mechanistic interpretability
  - Discovering Latent Knowledge in Language Models Without Supervision - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings.
- LLM Interpretability Groups
  - Alignment Lab AI - Group of researchers focusing on AI alignment.
  - Nous Research - Research group discussing various topics on interpretability.
  - EleutherAI - Non-profit AI research lab that focuses on interpretability and alignment of large models.

Programming Languages

Python 11 Jupyter Notebook 4 TeX 1 HTML 1 C# 1 TypeScript 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-llm-interpretability

Table of Contents

LLM Interpretability Tools

LLM Interpretability Papers

LLM Survey Paper

LLM Interpretability Articles

LLM Interpretability Groups