https://github.com/chrisliu298/awesome-sparse-autoencoders

A resource repository of sparse autoencoders for large language models
https://github.com/chrisliu298/awesome-sparse-autoencoders

List: awesome-sparse-autoencoders

large-language-model llm mechanistic-interpretability sparse-autoencoder

Last synced: 7 months ago
JSON representation

A resource repository of sparse autoencoders for large language models

Host: GitHub
URL: https://github.com/chrisliu298/awesome-sparse-autoencoders
Owner: chrisliu298
License: apache-2.0
Created: 2024-08-05T01:40:39.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-09-11T13:18:42.000Z (over 1 year ago)
Last Synced: 2024-09-11T20:44:58.358Z (over 1 year ago)
Topics: large-language-model, llm, mechanistic-interpretability, sparse-autoencoder
Homepage:
Size: 8.79 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

ultimate-awesome - awesome-sparse-autoencoders - A resource repository of sparse autoencoders for large language models. (Other Lists / TeX Lists)

README

          # Awesome Sparse Autoencoders



 

 

 

 

 



This repository tracks the latest research on sparse autoencoders, specifically used for [mechanistic interpretability](https://www.neelnanda.io/mechanistic-interpretability/quickstart). The goal is to offer a comprehensive list of papers and resources relevant to the topic.

> [!NOTE]

> If you believe your paper, blog post, or other resources on sparse autoencoders are not included, or if you find a mistake, typo, or outdated information, please open an issue or submit a pull request. I will be happy to update the list.

## Papers

- [Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small](https://arxiv.org/abs/2409.04478)

  - Author(s): Maheep Chaudhary, Atticus Geiger

  - Date: 2024-09

  - Venue: -

  - Code: -

- [Residual Stream Analysis with Multi-Layer SAEs](https://arxiv.org/abs/2409.04185)

  - Author(s): Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

  - Date: 2024-09

  - Venue: -

  - Code: -

- [Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2](https://arxiv.org/abs/2408.05147)

  - Author(s): Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

  - Date: 2024-08

  - Venue: -

  - Code: -

- [Disentangling Dense Embeddings with Sparse Autoencoders](https://arxiv.org/abs/2408.00657)

  - Author(s): Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu

  - Date: 2024-08

  - Venue: -

  - Code: -

- [Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models](https://arxiv.org/abs/2408.00113)

  - Author(s): Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks

  - Date: 2024-08

  - Venue: -

  - Code: -

- [Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery](https://arxiv.org/abs/2407.14499)

  - Author(s): Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele

  - Date: 2024-07

  - Venue: -

  - Code: -

- [Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders](https://arxiv.org/abs/2407.14435)

  - Author(s): Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

  - Date: 2024-07

  - Venue: -

  - Code: -

- [Interpreting Attention Layer Outputs with Sparse Autoencoders](https://arxiv.org/abs/2406.17759)

  - Author(s): Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

  - Date: 2024-06

  - Venue: -

  - Code: -

- [Transcoders Find Interpretable LLM Feature Circuits](https://arxiv.org/abs/2406.11944)

  - Author(s): Jacob Dunefsky, Philippe Chlenski, Neel Nanda

  - Date: 2024-06

  - Venue: -

  - Code: -

- [Scaling and evaluating sparse autoencoders](https://arxiv.org/abs/2406.04093)

  - Author(s): Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu

  - Date: 2024-06

  - Venue: -

  - Code: [openai/sparse\_autoencoder](https://github.com/openai/sparse_autoencoder), [EleutherAI/sae: Sparse autoencoders](https://github.com/EleutherAI/sae)

- [Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents](https://arxiv.org/abs/2406.04028)

  - Author(s): Yoann Poupart

  - Date: 2024-06

  - Venue: -

  - Code: -

- [The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision](https://arxiv.org/abs/2406.03662)

  - Author(s): Liv Gorton

  - Date: 2024-06

  - Venue: -

  - Code: -

- [Not All Language Model Features Are Linear](https://arxiv.org/abs/2405.14860)

  - Author(s): Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

  - Date: 2024-05

  - Venue: -

  - Code: -

- [Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models](https://arxiv.org/abs/2405.12522)

  - Author(s): Charles O'Neill, Thang Bui

  - Date: 2024-05

  - Venue: -

  - Code: -

- [Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control](https://arxiv.org/abs/2405.08366)

  - Author(s): Aleksandar Makelov, George Lange, Neel Nanda

  - Date: 2024-05

  - Venue: -

  - Code: -

- [Improving Dictionary Learning with Gated Sparse Autoencoders](https://arxiv.org/abs/2404.16014)

  - Author(s): Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

  - Date: 2024-04

  - Venue: -

  - Code: -

- [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2309.08600)

  - Author(s): Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

  - Date: 2023-09

  - Venue: -

  - Code: -

## Blog Posts

- [Extracting SAE task features for in-context learning — LessWrong](https://www.lesswrong.com/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-icl)

  - Author(s): [Dmitrii Kharlapenko](https://www.lesswrong.com/users/dmitrii-kharlapenko?from=post_header), [neverix](https://www.lesswrong.com/users/neverix?from=post_header), [Neel Nanda](https://www.lesswrong.com/users/neel-nanda-1?from=post_header), [Arthur Conmy](https://www.lesswrong.com/users/arthur-conmy?from=post_header)

  - Date: 2024-08-13

- [Self-explaining SAE features](https://www.lesswrong.com/posts/8ev6coxChSWcxCDy8/self-explaining-sae-features)

  - Author(s): [Dmitrii Kharpalenko](https://www.lesswrong.com/users/dmitrii-kharlapenko?from=post_header), [neverix](https://www.lesswrong.com/users/neverix?from=post_header), [Neel Nanda](https://www.lesswrong.com/users/neel-nanda-1?from=post_header), [Arthur Conmy](https://www.lesswrong.com/users/arthur-conmy?from=post_header)

  - Date: 2024-08-06

- [A primer on sparse autoencoders - by Nick Jiang](https://nickjiang.substack.com/p/a-primer-on-sparse-autoencoders)

  - Author(s): Nick Jiang

  - Date: 2024-07-03

- [An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability](https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html)

  - Author(s): [Adam Karvonen](https://adamkarvonen.github.io/)

  - Date: 2024-06-11

- [Finding Sparse Linear Connections between Features in LLMs](https://www.alignmentforum.org/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms)

  - Author(s): [Logan Riggs Smith](https://www.alignmentforum.org/users/elriggs), [Sam Mitchell](https://www.alignmentforum.org/users/sam-mitchell), [Adam Kaufman](https://www.alignmentforum.org/users/eccentricity)

  - Date: 2023-12-09

- [Sparse Autoencoders: Future Work](https://www.alignmentforum.org/posts/CkFBMG6A9ytkiXBDM/sparse-autoencoders-future-work)

  - Author(s): [Logan Riggs Smith](https://www.alignmentforum.org/users/elriggs), [Aidan Ewart](https://www.alignmentforum.org/users/aidan-ewart)

  - Date: 2024-09-21

- [Sparse Autoencoders Find Highly Interpretable Directions in Language Models](https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in)

  - Author(s): [Logan Riggs](https://www.lesswrong.com/users/elriggs), [Hoagy](https://www.lesswrong.com/users/hoagy), [Aidan Ewart](https://www.lesswrong.com/users/aidan-ewart), [Robert_AIZI](https://www.lesswrong.com/users/robert_aizi)

  - Date: 2023-09-21

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chrisliu298/awesome-sparse-autoencoders

Awesome Lists containing this project

README