An open API service indexing awesome lists of open source software.

https://github.com/chrisliu298/awesome-sparse-autoencoders

A resource repository of sparse autoencoders for large language models
https://github.com/chrisliu298/awesome-sparse-autoencoders

List: awesome-sparse-autoencoders

large-language-model llm mechanistic-interpretability sparse-autoencoder

Last synced: 7 months ago
JSON representation

A resource repository of sparse autoencoders for large language models

Awesome Lists containing this project

README

          

# Awesome Sparse Autoencoders


Awesome
GitHub stars
GitHub forks
GitHub issues
GitHub Last commit

This repository tracks the latest research on sparse autoencoders, specifically used for [mechanistic interpretability](https://www.neelnanda.io/mechanistic-interpretability/quickstart). The goal is to offer a comprehensive list of papers and resources relevant to the topic.

> [!NOTE]
> If you believe your paper, blog post, or other resources on sparse autoencoders are not included, or if you find a mistake, typo, or outdated information, please open an issue or submit a pull request. I will be happy to update the list.

## Papers

- [Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small](https://arxiv.org/abs/2409.04478)
- Author(s): Maheep Chaudhary, Atticus Geiger
- Date: 2024-09
- Venue: -
- Code: -
- [Residual Stream Analysis with Multi-Layer SAEs](https://arxiv.org/abs/2409.04185)
- Author(s): Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison
- Date: 2024-09
- Venue: -
- Code: -
- [Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2](https://arxiv.org/abs/2408.05147)
- Author(s): Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
- Date: 2024-08
- Venue: -
- Code: -
- [Disentangling Dense Embeddings with Sparse Autoencoders](https://arxiv.org/abs/2408.00657)
- Author(s): Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu
- Date: 2024-08
- Venue: -
- Code: -
- [Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models](https://arxiv.org/abs/2408.00113)
- Author(s): Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
- Date: 2024-08
- Venue: -
- Code: -
- [Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery](https://arxiv.org/abs/2407.14499)
- Author(s): Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele
- Date: 2024-07
- Venue: -
- Code: -
- [Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders](https://arxiv.org/abs/2407.14435)
- Author(s): Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
- Date: 2024-07
- Venue: -
- Code: -
- [Interpreting Attention Layer Outputs with Sparse Autoencoders](https://arxiv.org/abs/2406.17759)
- Author(s): Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
- Date: 2024-06
- Venue: -
- Code: -
- [Transcoders Find Interpretable LLM Feature Circuits](https://arxiv.org/abs/2406.11944)
- Author(s): Jacob Dunefsky, Philippe Chlenski, Neel Nanda
- Date: 2024-06
- Venue: -
- Code: -
- [Scaling and evaluating sparse autoencoders](https://arxiv.org/abs/2406.04093)
- Author(s): Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu
- Date: 2024-06
- Venue: -
- Code: [openai/sparse\_autoencoder](https://github.com/openai/sparse_autoencoder), [EleutherAI/sae: Sparse autoencoders](https://github.com/EleutherAI/sae)
- [Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents](https://arxiv.org/abs/2406.04028)
- Author(s): Yoann Poupart
- Date: 2024-06
- Venue: -
- Code: -
- [The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision](https://arxiv.org/abs/2406.03662)
- Author(s): Liv Gorton
- Date: 2024-06
- Venue: -
- Code: -
- [Not All Language Model Features Are Linear](https://arxiv.org/abs/2405.14860)
- Author(s): Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark
- Date: 2024-05
- Venue: -
- Code: -
- [Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models](https://arxiv.org/abs/2405.12522)
- Author(s): Charles O'Neill, Thang Bui
- Date: 2024-05
- Venue: -
- Code: -
- [Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control](https://arxiv.org/abs/2405.08366)
- Author(s): Aleksandar Makelov, George Lange, Neel Nanda
- Date: 2024-05
- Venue: -
- Code: -
- [Improving Dictionary Learning with Gated Sparse Autoencoders](https://arxiv.org/abs/2404.16014)
- Author(s): Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
- Date: 2024-04
- Venue: -
- Code: -
- [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2309.08600)
- Author(s): Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
- Date: 2023-09
- Venue: -
- Code: -

## Blog Posts

- [Extracting SAE task features for in-context learning — LessWrong](https://www.lesswrong.com/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-icl)
- Author(s): [Dmitrii Kharlapenko](https://www.lesswrong.com/users/dmitrii-kharlapenko?from=post_header), [neverix](https://www.lesswrong.com/users/neverix?from=post_header), [Neel Nanda](https://www.lesswrong.com/users/neel-nanda-1?from=post_header), [Arthur Conmy](https://www.lesswrong.com/users/arthur-conmy?from=post_header)
- Date: 2024-08-13
- [Self-explaining SAE features](https://www.lesswrong.com/posts/8ev6coxChSWcxCDy8/self-explaining-sae-features)
- Author(s): [Dmitrii Kharpalenko](https://www.lesswrong.com/users/dmitrii-kharlapenko?from=post_header), [neverix](https://www.lesswrong.com/users/neverix?from=post_header), [Neel Nanda](https://www.lesswrong.com/users/neel-nanda-1?from=post_header), [Arthur Conmy](https://www.lesswrong.com/users/arthur-conmy?from=post_header)
- Date: 2024-08-06
- [A primer on sparse autoencoders - by Nick Jiang](https://nickjiang.substack.com/p/a-primer-on-sparse-autoencoders)
- Author(s): Nick Jiang
- Date: 2024-07-03
- [An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability](https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html)
- Author(s): [Adam Karvonen](https://adamkarvonen.github.io/)
- Date: 2024-06-11
- [Finding Sparse Linear Connections between Features in LLMs](https://www.alignmentforum.org/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms)
- Author(s): [Logan Riggs Smith](https://www.alignmentforum.org/users/elriggs), [Sam Mitchell](https://www.alignmentforum.org/users/sam-mitchell), [Adam Kaufman](https://www.alignmentforum.org/users/eccentricity)
- Date: 2023-12-09
- [Sparse Autoencoders: Future Work](https://www.alignmentforum.org/posts/CkFBMG6A9ytkiXBDM/sparse-autoencoders-future-work)
- Author(s): [Logan Riggs Smith](https://www.alignmentforum.org/users/elriggs), [Aidan Ewart](https://www.alignmentforum.org/users/aidan-ewart)
- Date: 2024-09-21
- [Sparse Autoencoders Find Highly Interpretable Directions in Language Models](https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in)
- Author(s): [Logan Riggs](https://www.lesswrong.com/users/elriggs), [Hoagy](https://www.lesswrong.com/users/hoagy), [Aidan Ewart](https://www.lesswrong.com/users/aidan-ewart), [Robert_AIZI](https://www.lesswrong.com/users/robert_aizi)
- Date: 2023-09-21