Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
Last synced: 4 days ago
JSON representation
-
Uncategorized
-
Uncategorized
- 3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning
- ICML24: Physics of Language Models
- NAACL24: Explanations in the Era of Large Language Models
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
- GitHub Repo stars - source-sparse-autoencoders-for-all-residual-stream))
- Github Repo stars
- GitHub Repo stars - chapter1-transformer-interp.streamlit.app/[1.2]_Intro_to_Mech_Interp), [Demo](https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Main_Demo.ipynb))
- GitHub Repo stars - team/nnsight): enables interpreting and manipulating the internals of deep learned models. ([Doc](https://nnsight.net/documentation/), [Tutorial](https://nnsight.net/tutorials/), [Paper](https://arxiv.org/abs/2407.14561))
- GitHub Repo stars
- GitHub Repo stars - tuning method. ([Paper](https://arxiv.org/pdf/2404.03592), [Demo](https://colab.research.google.com/github/stanfordnlp/pyreft/blob/main/main_demo.ipynb))
- GitHub Repo stars - engineering/))
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
- GitHub Repo stars - eXplains-Transformers): Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). ([Paper](https://arxiv.org/pdf/2402.05602), [Doc](https://lxt.readthedocs.io/en/latest/))
- 200 Concrete Open Problems in Mechanistic Interpretability
- GitHub Repo stars - lens): Tools for understanding how transformer predictions are built layer-by-layer. ([Paper](https://arxiv.org/pdf/2303.08112), [Doc](https://tuned-lens.readthedocs.io/en/latest/))
- GitHub Repo stars - team/inseq): Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. ([Paper](http://arxiv.org/abs/2302.13942), [Doc](https://inseq.org/en/latest/))
- GitHub Repo stars - Abstract.html), [Doc](https://shap.readthedocs.io/en/latest/))
- GitHub Repo stars
- A Barebones Guide to Mechanistic Interpretability Prerequisites
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
- GitHub Repo stars
- GitHub Repo stars - specific Interpretability](https://projects.illc.uva.nl/indeep/tutorial/) ([Github](https://github.com/interpretingdl/eacl2024_transformer_interpretability_tutorial))
- AI Alignment Forum
- LessWrong
- GitHub Repo stars - debugger): investigate specific behaviors of small LLMs
- GitHub Repo stars
- **Neuronpedia**
- **ML Alignment & Theory Scholars (MATS)**
- GitHub Repo stars - deepmind/penzai): a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. ([Paper](https://openreview.net/attachment?id=KVSgEXrMDH&name=pdf), [Doc](https://penzai.readthedocs.io/en/stable/), [Tutorial](https://penzai.readthedocs.io/en/stable/notebooks/how_to_think_in_penzai.html))
- Mechanistic Interpretability Workshop 2024 ICML - accept-oral))
- Attributing Model Behavior at Scale Workshop 2023 NeurIPS - accept-oral))
- BlackboxNLP 2023 EMNLP - 2023/#2023blackboxnlp-1))
- GitHub Repo stars
-
-
Survey Papers
- **Attention Heads of Large Language Models: A Survey** - 09-06 | [Github](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) |
- **From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP** - 06-18 | - |
- **A Primer on the Inner Workings of Transformer-based Language Models** - 05-02 | - |
- **Mechanistic Interpretability for AI Safety -- A Review** - 04-22 | - |
- **From Understanding to Utilization: A Survey on Explainability for Large Language Models** - 02-22 | - |
- **Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks** - 08-18 | - |
- **Knowledge Mechanisms in Large Language Models: A Survey and Perspective** - 10-06 | - | - |
- **Internal Consistency and Self-Feedback in Large Language Models: A Survey** - 07-22 | [Github](https://github.com/IAAR-Shanghai/ICSFSurvey) [Paper List](https://www.yuque.com/zhiyu-n2wnm/ugzwgf/gmqfkfigd6xw26eg) |
- **Relational Composition in Neural Networks: A Survey and Call to Action** - 07-15 | - | - |
-
Interpretable Analysis of LLMs
- **Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning** - 07-04 | - | - |
- **What Do the Circuits Mean? A Knowledge Edit View** - 06-25 | - | - |
- GitHub Repo stars - 03-16 | [Github](https://github.com/frankniujc/kn_thesis)| - |
- **Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models** - | 2024-08-05 | [Github](https://github.com/liangzid/PromptExtractionEval) | - |
- GitHub Repo stars - 06-25 | [Github](https://github.com/jacobdunefsky/ObservablePropagation) | - |
- GitHub Repo stars - 10-01 | [Github](https://github.com/facebookresearch/llm-transparency-tool) | - |
- GitHub Repo stars - 06-28 | [Github](https://github.com/sfeucht/footprints) | [Blog](https://footprints.baulab.info/) |
- GitHub Repo stars - based Answer Attribution for Trustworthy Retrieval-Augmented Generation**](https://arxiv.org/pdf/2406.13663) <br>| arXiv | 2024-07-01 | [Github](https://github.com/Betswish/MIRAGE) | - |
- GitHub Repo stars - Fine-Tuning Weights of Generative Models**](https://arxiv.org/pdf/2402.10208) <br>| ICML | 2024-07-01 | [Github](https://github.com/eliahuhorwitz/Spectral-DeTuning) | [Blog](https://vision.huji.ac.il/spectral_detuning/) |
- GitHub Repo stars - property Steering of Large Language Models with Dynamic Activation Composition**](https://arxiv.org/pdf/2406.17563) <br>| arXiv | 2024-06-25 | [Github](https://github.com/DanielSc4/Dynamic-Activation-Composition) | - |
- **Confidence Regulation Neurons in Language Models** - 06-24 | - | - |
- GitHub Repo stars - 06-23 | [Github](https://github.com/BatsResearch/cross-lingual-detox) | - |
- **Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models** - 06-23 | - | - |
- GitHub Repo stars - 06-18 | [Github](https://github.com/dhgottesman/keen_estimating_knowledge_in_llms) | - |
- **Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations** - 06-17 | - | - |
- GitHub Repo stars - 06-17 | [Github](https://github.com/jacobdunefsky/transcoder_circuits) | - |
- GitHub Repo stars - 06-16 | [Github](https://github.com/JasonForJoy/Model-Editing-Hurt) | - |
- GitHub Repo stars - 06-16 | [Github](https://github.com/kdu4108/measureLM) | - |
- **Talking Heads: Understanding Inter-layer Communication in Transformer Language Models** - 06-13 | - | - |
- GitHub Repo stars - 06-11 | [Github](https://github.com/FarnoushRJ/MambaLRP) | - |
- GitHub Repo stars - 06-06 | [Github](https://github.com/PAIR-code/interpretability/tree/master/patchscopes/code) | [Blog](https://pair-code.github.io/interpretability/patchscopes/) |
- GitHub Repo stars - 06-06 | [Github](https://github.com/francescortu/comp-mech) | - |
- **Learned feature representations are biased by complexity, learning order, position, and more** - 06-06 | [Demo](https://gist.github.com/lampinen-dm/b6541019ef4cf2988669ab44aa82460b) | - |
- **Iteration Head: A Mechanistic Study of Chain-of-Thought** - 06-05 | - | - |
- **Activation Addition: Steering Language Models Without Optimization** - 06-04 | [Code](https://zenodo.org/records/8215277) | - |
- **Interpretability Illusions in the Generalization of Simplified Models** - 06-04 | - | - |
- GitHub Repo stars - aware Explainability Method for Text Generation**](https://arxiv.org/pdf/2402.09259) <br>| arXiv | 2024-06-03 | [Github](https://github.com/k-amara/syntax-shap)| [Blog](https://syntaxshap.ivia.ch/) |
- **Calibrating Reasoning in Language Models with Internal Consistency** - 05-29 | - | - |
- **Black-Box Access is Insufficient for Rigorous AI Audits** - 05-29 | - | - |
- **Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting** - 05-28 | - | - |
- GitHub Repo stars - 05-27 | [Github](https://github.com/samuelperezdi/nuclr-icml)| - |
- GitHub Repo stars - 05-27 | [Github](https://github.com/OSU-NLP-Group/GrokkedTransformer)| - |
- GitHub Repo stars - Repair in Language Models**](https://arxiv.org/pdf/2402.15390v2) <br>| ICML | 2024-05-26 | [Github](https://github.com/starship006/backup_research)| - |
- **Emergence of a High-Dimensional Abstraction Phase in Language Transformers** - 05-24 | - | - |
- GitHub Repo stars - 2's Multiple-Choice Questions**](https://arxiv.org/pdf/2405.03205v2) <br>| arXiv | 2024-05-23 | [Github](https://github.com/ruizheliUOA/Anchored_Bias_GPT2)| - |
- GitHub Repo stars - 05-23 | [Github](https://github.com/JoshEngels/MultiDimensionalFeatures)| - |
- **Using Degeneracy in the Loss Landscape for Mechanistic Interpretability** - 05-20 | - | - |
- GitHub Repo stars - 05-19 | [Github](https://github.com/AIRI-Institute/LLM-Microscope)| - |
- GitHub Repo stars - explanations from Large Language Models faithful?**](https://arxiv.org/pdf/2401.07927) <br>| ACL | 2024-05-16 | [Github](https://github.com/AndreasMadsen/llm-introspection)| - |
- **Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models** - 05-14 | - | - |
- GitHub Repo stars - 05-07 | [Github](https://github.com/nrimsky/CAA)| - |
- GitHub Repo stars - 2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability**](https://arxiv.org/pdf/2405.04156) <br>| AISTATS | 2024-05-07 | [Github](https://github.com/jgcarrasco/acronyms_paper)| - |
- GitHub Repo stars - by-step: A mechanistic understanding of chain-of-thought reasoning**](https://arxiv.org/pdf/2402.18312) <br>| arXiv | 2024-05-06 | [Github](https://github.com/joykirat18/How-To-Think-Step-by-Step)| - |
- GitHub Repo stars - 05-06 | [Github](https://github.com/jmerullo/circuit_reuse)| - |
- GitHub Repo stars - Explanations**](https://arxiv.org/pdf/2401.12576) <br>| HCI+NLP@NAACL | 2024-04-24 | [Github](https://github.com/DFKI-NLP/LLMCheckup)| - |
- **How to use and interpret activation patching** - 04-23 | - | - |
- **Understanding Addition in Transformers** - 04-23 | - | - |
- **Towards Uncovering How Large Language Model Works: An Explainability Perspective** - 04-15 | - | - |
- GitHub Repo stars - context learning circuits and their formation**](https://arxiv.org/pdf/2404.07129) <br>| ICML | 2024-04-10 | [Github](https://github.com/aadityasingh/icl-dynamics)| - |
- **Does Transformer Interpretability Transfer to RNNs?** - 04-09 | - | - |
- GitHub Repo stars - 04-04 | [Github](https://github.com/arnab-api/romba)| [Demo](https://github.com/arnab-api/romba/tree/master/demo) |
- **Eliciting Latent Knowledge from Quirky Language Models** - FoMo@ICLR | 2024-04-03 | - | - |
- **Do language models plan ahead for future tokens?** - 04-01 | - | - |
- GitHub Repo stars - 03-31 | [Github](https://github.com/saprmarks/feature-circuits)| [Demo](https://feature-circuits.xyz/) |
- **Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms** - 03-26 | - | - |
- GitHub Repo stars - 03-04 | [Github](https://github.com/wesg52/world-models)| - |
- **AtP\*: An efficient and scalable method for localizing LLM behaviour to components** - 03-01 | - | - |
- **A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task** - 02-28 | - | - |
- GitHub Repo stars - 02-25 | [Github](https://github.com/ericwtodd/function_vectors)| [Blog](https://functions.baulab.info/) |
- **A Language Model's Guide Through Latent Space** - 02-22 | - | - |
- **Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model** - 02-22 | - | - |
- GitHub Repo stars - Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking**](https://arxiv.org/pdf/2402.14811) <br>| ICLR | 2024-02-22 | [Github](https://github.com/Nix07/finetuning)| [Blog](https://finetuning.baulab.info/) |
- GitHub Repo stars - grained Hallucination Detection and Editing for Language Models**](https://arxiv.org/pdf/2401.06855) <br>| arXiv | 2024-02-21 | [Github](https://github.com/abhika-m/FAVA)| [Blog](https://fine-grained-hallucination.github.io/) |
- GitHub Repo stars - 02-20 | [Github](https://github.com/AnasHimmi/Hallucination-Detection-Score-Aggregation)| - |
- GitHub Repo stars - 02-09 | [Github](https://github.com/john-hewitt/model-editing-canonical-examples)| - |
- **Identifying Semantic Induction Heads to Understand In-Context Learning** - 02-20 | - | - |
- **Backward Lens: Projecting Language Model Gradients into the Vocabulary Space** - 02-20 | - | - |
- GitHub Repo stars - 02-07 | [Github](https://github.com/ejmichaud/neural-verification)| - |
- **Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models** - 02-12 | - | - |
- **INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection** - 02-06 | - | - |
- GitHub Repo stars - Context Language Learning: Architectures and Algorithms**](https://arxiv.org/pdf/2401.12973) <br>| arXiv | 2024-01-30 | [Github](https://github.com/berlino/seq_icl)| - |
- **Gradient-Based Language Model Red Teaming** - 01-30 | [Github](https://github.com/google-research/google-research/tree/master/gbrt)| - |
- **The Calibration Gap between Model and Human Confidence in Large Language Models** - 01-24 | - | - |
- GitHub Repo stars - 01-22 | [Github](https://github.com/wesg52/universal-neurons)| - |
- **The mechanistic basis of data dependence and abrupt learning in an in-context classification task** - 01-16 | - | - |
- GitHub Repo stars - 01-03 | [Github](https://github.com/ajyl/dpo_toxic) | - |
- GitHub Repo stars - 01-16 | [Github](https://github.com/dannyallover/overthinking_the_truth)| - |
- **Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks** - 01-16 | - | - |
- **Feature emergence via margin maximization: case studies in algebraic tasks** - 01-16 | - | - |
- **Successor Heads: Recurring, Interpretable Attention Heads In The Wild** - 01-16 | - | - |
- **Towards Best Practices of Activation Patching in Language Models: Metrics and Methods** - 01-16 | - | - |
- GitHub Repo stars - 2**](https://arxiv.org/pdf/2312.08793) <br>| ATTRIB@NeurIPS | 2023-12-31 | [Github](https://github.com/ed1d1a8d/prompt-injection-interp) | [Blog](https://www.lesswrong.com/posts/Ei8q37PB3cAky6kaK/) |
- GitHub Repo stars - 12-08 | [Github](https://github.com/saprmarks/geometry-of-truth) | [Blog](https://saprmarks.github.io/geometry-of-truth/dataexplorer/) |
- GitHub Repo stars - 12-06 | [Github](https://github.com/amakelov/activation-patching-illusion) | - |
- GitHub Repo stars - Solving Transformers**](https://arxiv.org/pdf/2312.02566) <br>| UniReps@NeurIPS | 2023-12-05 | [Github](https://github.com/understanding-search/structured-representations-maze-transformers) | - |
- **Generating Interpretable Networks using Hypernetworks** - 12-05 | - | - |
- GitHub Repo stars - 11-21 | [Github](https://github.com/fjzzq2002/pizza) | - |
- GitHub Repo stars - 11-20 | [Github](https://github.com/Aaquib111/edge-attribution-patching) | - |
- GitHub Repo stars - 11-03 | [Github](https://github.com/google-deepmind/tracr) | - |
- GitHub Repo stars - 2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model**](https://arxiv.org/pdf/2305.00586) <br>| NeurIPS | 2023-11-02 | [Github](https://github.com/hannamw/gpt2-greater-than) | - |
- GitHub Repo stars - 10-31 | [Github](https://github.com/princeton-nlp/TransformerPrograms) | - |
- GitHub Repo stars - Step Reasoning Capabilities of Language Models**](https://arxiv.org/pdf/2310.14491) <br>| EMNLP | 2023-10-23 | [Github](https://github.com/yifan-h/MechanisticProbe) | - |
- GitHub Repo stars - 09-21 | [Github](https://github.com/frankaging/align-transformers) | - |
- GitHub Repo stars - Time Intervention: Eliciting Truthful Answers from a Language Model**](https://arxiv.org/pdf/2306.03341) <br>| NeurIPS | 2023-10-20 | [Github](https://github.com/likenneth/honest_llama) | - |
- GitHub Repo stars - 10-19 | [Github](https://github.com/mechanistic-interpretability-grokking/progress-measures-paper) | [Blog](https://www.neelnanda.io/grokking-paper) |
- GitHub Repo stars - 10-06 | [Github](https://github.com/callummcdougall/SERI-MATS-2023-Streamlit-pages) | [Blog & Demo](https://copy-suppression.streamlit.app/) |
- GitHub Repo stars - Based Localization vs. Knowledge Editing in Language Models**](https://openreview.net/pdf?id=EldbUlZtbd) <br>| NeurIPS | 2023-09-21 | [Github](https://github.com/google/belief-localization) | - |
- GitHub Repo stars - Supervised Sequence Models**](https://arxiv.org/pdf/2309.00941) <br>| BlackboxNLP@EMNLP | 2023-09-07 | [Github](https://github.com/ajyl/mech_int_othelloGPT) | [Blog](https://www.alignmentforum.org/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world) |
- GitHub Repo stars - 06-02 | [Github](https://github.com/wesg52/sparse-probing-paper) | - |
- GitHub Repo stars - 05-31 | [Github](https://github.com/yangalan123/Amortized-Interpretability) | [Video](https://youtu.be/CsCRU8Hzpms?si=LMY4_TvEzi5D88OR) |
- GitHub Repo stars - 05-24 | [Github](https://github.com/bilal-chughtai/rep-theory-mech-interp) | - |
- **Localizing Model Behavior with Path Patching** - 05-16 | - | - |
- **Interpreting Neural Networks through the Polytope Lens** - 11-22 | - | - |
- **Language models can explain neurons in language models** - 05-09 | - | - |
- **N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models** - 04-22 | - | - |
- GitHub Repo stars - 2 small**](https://arxiv.org/pdf/2211.00593) <br>| ICLR | 2023-01-20 | [Github](https://github.com/redwoodresearch/Easy-Transformer) | - |
- **Scaling Laws and Interpretability of Learning from Repeated Data** - 05-21 | - | - |
- **In-context Learning and Induction Heads** - 03-08 | - | - |
- **A Mathematical Framework for Transformer Circuits** - 12-22 | - | - |
- GitHub Repo stars - 07-19 | [Github](https://github.com/tech-srl/RASP) | [Mini Tutorial](https://docs.google.com/presentation/d/1oIPHP_7qjsrnrDb3kdZIUZt-wQofkiQl/edit?usp=sharing&ouid=111912319459945992784&rtpof=true&sd=true) |
- GitHub Repo stars - 06-24 | [Github](https://github.com/JasonGross/guarantees-based-mechanistic-interpretability/) | - |
- **Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions** - 10-23 | - | - |
- GitHub Repo stars - 10-04 | [Github](https://github.com/apartresearch/seqcont_circuits) | - |
- GitHub Repo stars - 10-01 | [Github](https://github.com/ydyjya/LLM-IHS-Explanation) | - |
- GitHub Repo stars - Level Domain-Specific Interpretation in Multimodal Large Language Model**](https://arxiv.org/pdf/2406.11193) <br>| EMNLP | 2024-10-01 | [Github](https://github.com/Z1zs/MMNeuron) | - |
- GitHub Repo stars - 09-12 | [Github](https://github.com/zepingyu0512/arithmetic-mechanism) | - |
- GitHub Repo stars - 10-28 | [Github](https://github.com/ArthurConmy/Automatic-Circuit-Discovery) | - |
- **Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically** - 07-15 | - | - |
- **Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks** - 07-15 | - | - |
- **How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching** - 07-15 | - | - |
- **Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models** - 07-15 | - | - |
- **What Makes and Breaks Safety Fine-tuning? Mechanistic Study** - 07-15 | - | - |
- **Using Degeneracy in the Loss Landscape for Mechanistic Interpretability** - 07-15 | - | - |
- **Loss in the Crowd: Hidden Breakthroughs in Language Model Training** - 07-15 | - | - |
- **Robust Knowledge Unlearning via Mechanistic Localizations** - 07-15 | - | - |
- **Language Models Linearly Represent Sentiment** - 07-15 | - | - |
- GitHub Repo stars - 07-15 | [Github](https://github.com/hannamw/eap-ig-faithfulness) | - |
- **Learning and Unlearning of Fabricated Knowledge in Language Models** - 07-15 | - | - |
- **Faithful and Fast Influence Function via Advanced Sampling** - 07-15 | - | - |
- **Hypothesis Testing the Circuit Hypothesis in LLMs** - 07-15 | - | - |
- GitHub Repo stars - 07-15 | [Github](https://github.com/KihoPark/LLM_Categorical_Hierarchical_Representations) | - |
- GitHub Repo stars - Purpose Method for Reading Information from Neural Activations**](https://openreview.net/attachment?id=P7MW0FahEq&name=pdf) <br>| MechInterp@ICML | 2024-07-15 | [Github](https://github.com/huangxt39/InversionView) | - |
- **Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks** - 07-15 | - | - |
-
Position Papers
- **Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience** - 06-25 | - |
- **Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience** - 06-03 | - |
- **Interpretability Needs a New Paradigm** - 05-08 | - |
- **Position Paper: Toward New Frameworks for Studying Model Representations** - 02-06 | - |
- **Rethinking Interpretability in the Era of Large Language Models** - 01-30 | - |
-
SAE, Dictionary Learning and Superposition
- **Interpreting Attention Layer Outputs with Sparse Autoencoders** - 06-25 | - | [Demo](https://robertzk.github.io/circuit-explorer) |
- **Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis** - 05-23 | - | - |
- **Automatically Identifying Local and Global Circuits with Linear Computation Graphs** - 05-22 | - | - |
- **Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet** - 05-21 | - | [Demo](https://transformer-circuits.pub/2024/scaling-monosemanticity/umap.html?targetId=34m_31164353) |
- **Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models** - 05-21 | - | - |
- **Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control** - 05-20 | - | - |
- GitHub Repo stars - Relevant and Sparsely Interacting Features in Neural Networks**](https://arxiv.org/pdf/2405.10928) <br>| arXiv | 2024-05-20 | [Github](https://github.com/ApolloResearch/rib) | - |
- **Improving Dictionary Learning with Gated Sparse Autoencoders** - 04-30 | - | - |
- **Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers** - 04-29 | - | [Demo](https://sae-explorer.streamlit.app/) |
- **Activation Steering with SAEs** - 04-19 | - | - |
- **SAE reconstruction errors are (empirically) pathological** - 03-29 | - | - |
- GitHub Repo stars - autoencoders-find-composed-features-in-small-toy) <br>| LessWrong | 2024-03-14 | [Github](https://github.com/evanhanders/superposition-geometry-toys) | - |
- GitHub Repo stars - report-sparse-autoencoders-find-only-9-180-board) <br>| LessWrong | 2024-03-05 | [Github](https://github.com/RobertHuben/othellogpt_sparse_autoencoders) | - |
- **Sparse Autoencoders Work on Attention Layer Outputs** - 01-16 | - | [Demo](https://colab.research.google.com/drive/10zBOdozYR2Aq2yV9xKs-csBH2olaFnsq?usp=sharing) |
- **Do sparse autoencoders find "true features"?** - 02-12 | - | - |
- **Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT** - 02-19 | - | - |
- **Toward A Mathematical Framework for Computation in Superposition** - 01-18 | - | - |
- GitHub Repo stars - 01-16 | [Github](https://github.com/HoagyC/sparse_coding) | - |
- GitHub Repo stars - 10-26 | [Github](https://github.com/taufeeque9/codebook-features) | [Demo](https://colab.research.google.com/github/taufeeque9/codebook-features/blob/main/tutorials/code_intervention.ipynb) |
- GitHub Repo stars - circuits.pub/2023/monosemantic-features/index.html) <br>| Anthropic | 2023-10-04 | [Github](https://github.com/neelnanda-io/1L-Sparse-Autoencoder) | [Demo-1](https://transformer-circuits.pub/2023/monosemantic-features/vis/index.html), [Demo-2](https://transformer-circuits.pub/2023/monosemantic-features/vis/a1.html), [Tutorial](https://colab.research.google.com/drive/1u8larhpxy8w4mMsJiSBddNOzFGj7_RTn?usp=sharing) |
- **Polysemanticity and Capacity in Neural Networks** - 07-12 | - | - |
- **Distributed Representations: Composition & Superposition** - 05-04 | - | - |
- **Superposition, Memorization, and Double Descent** - 01-05 | - | - |
- GitHub Repo stars - 11-16 | [Github](https://github.com/adamjermyn/toy_model_interpretability) | - |
- GitHub Repo stars - circuits.pub/2022/toy_model/index.html) <br>| Anthropic | 2022-09-14 | [Github](https://github.com/anthropics/toy-models-of-superposition) | [Demo](https://colab.research.google.com/github/anthropics/toy-models-of-superposition/blob/main/toy_models.ipynb) |
- **Softmax Linear Units** - 06-27 | - | - |
- GitHub Repo stars - 03-29 | [Github](https://github.com/zeyuyun1/TransformerVis) | - |
- **Zoom In: An Introduction to Circuits** - 03-10 | - | - |
- **Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task** - 07-15 | - | - |
- **Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models** - 07-15 | - | - |
- GitHub Repo stars - to-End Sparse Dictionary Learning**](https://arxiv.org/pdf/2405.12241) <br>| MechInterp@ICML | 2024-05-24 | [Github](https://github.com/ApolloResearch/e2e_sae) | - |
-
Interpretability in Vision LLMs
- GitHub Repo stars - free Text-Image Corruption and Evaluation**](https://arxiv.org/pdf/2406.16320) <br>| arXiv | 2024-06-24 | [Github](https://github.com/wrudman/NOTICE) | - |
- GitHub Repo stars - 04-09 | [Github](https://github.com/maxdreyer/PURE) | - |
- GitHub Repo stars - 02-16 | [Github](https://github.com/AI4LIFE-GROUP/SpLiCE) | - |
- GitHub Repo stars - 09-21 | [Github](https://github.com/martinagvilas/vit-cls_emb) | - |
- GitHub Repo stars - Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP**](https://arxiv.org/pdf/2308.14179) <br>| CLVL@ICCV | 2023-08-27 | [Github](https://github.com/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability) | - |
- GitHub Repo stars - 07-11 | [Github](https://github.com/brendel-group/imi) | [Blog](https://brendel-group.github.io/imi/) |
- **Dissecting Query-Key Interaction in Vision Transformers** - 06-25 | - | - |
- **Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models** - 06-25 | - | - |
- **Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP** - 06-25 | - | - |
- **The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision** - 06-25 | - | - |
- GitHub Repo stars - 06-25 | [Github](https://github.com/google-research/fooling-feature-visualizations/) | - |
-
Benchmarking Interpretability
- **Benchmarking Mental State Representations in Language Models** - 06-25 | - | - |
- **A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains** - 05-21 | [Dataset](https://huggingface.co/datasets/google/reveal) | [Blog](https://reveal-dataset.github.io/) |
- GitHub Repo stars - 02-27 | [Github](https://github.com/explanare/ravel) | - |
- GitHub Repo stars - 02-19 | [Github](https://github.com/aryamanarora/causalgym) | - |
-
Enhancing Interpretability
- **Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability** - 01-08 | - | - |
- GitHub Repo stars - Inspired Modular Training for Mechanistic Interpretability**](https://arxiv.org/pdf/2305.08746) <br>| arXiv | 2023-06-06 | [Github](https://github.com/KindXiaoming/BIMT) | - |
-
Others
- **An introduction to graphical tensor notation for mechanistic interpretability** - 02-02 | - | - |
- GitHub Repo stars - 10-03 | [Github](https://github.com/arjunkaruvally/emt_variable_binding) | - |
- **Daily Picks in Interpretability & Analysis of LMs**
- GitHub Repo stars - llm-interpretability)
- GitHub Repo stars - llm-understanding-mechanism)
- GitHub Repo stars - Attention-Heads**](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads)
- GitHub Repo stars - LLM-Interpretability**](https://github.com/cooperleong00/Awesome-LLM-Interpretability)