awesome-sparse-autoencoders
A resource repository of sparse autoencoders for large language models
https://github.com/chrisliu298/awesome-sparse-autoencoders
Last synced: 15 days ago
JSON representation
-
Papers
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
- Residual Stream Analysis with Multi-Layer SAEs
- Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
- Disentangling Dense Embeddings with Sparse Autoencoders
- openai/sparse\_autoencoder
- Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
- The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
- Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- Not All Language Model Features Are Linear
- Interpreting Attention Layer Outputs with Sparse Autoencoders
- Transcoders Find Interpretable LLM Feature Circuits
- Scaling and evaluating sparse autoencoders
- Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
- Improving Dictionary Learning with Gated Sparse Autoencoders
- Sparse Autoencoders Find Highly Interpretable Features in Language Models
- openai/sparse\_autoencoder
- Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
- The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- Interpreting Attention Layer Outputs with Sparse Autoencoders
- Transcoders Find Interpretable LLM Feature Circuits
- Scaling and evaluating sparse autoencoders
-
Blog Posts
- Extracting SAE task features for in-context learning — LessWrong
- Sparse Autoencoders Find Highly Interpretable Directions in Language Models
- Logan Riggs - ewart), [Robert_AIZI](https://www.lesswrong.com/users/robert_aizi)
- Extracting SAE task features for in-context learning — LessWrong
- Self-explaining SAE features
- Dmitrii Kharpalenko - nanda-1?from=post_header), [Arthur Conmy](https://www.lesswrong.com/users/arthur-conmy?from=post_header)
- A primer on sparse autoencoders - by Nick Jiang
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability
- Adam Karvonen
- Finding Sparse Linear Connections between Features in LLMs
- Logan Riggs Smith - ewart)
- Sparse Autoencoders: Future Work
- Sparse Autoencoders Find Highly Interpretable Directions in Language Models
- Self-explaining SAE features
- A primer on sparse autoencoders - by Nick Jiang
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability
- Adam Karvonen
- Finding Sparse Linear Connections between Features in LLMs
- Sparse Autoencoders: Future Work
- Logan Riggs - ewart), [Robert_AIZI](https://www.lesswrong.com/users/robert_aizi)