Papers

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Chaudhary & Geiger (2024)

evaluation, factual-knowledge

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, Rajamanoharan, Conmy et al. (DeepMind) (2024)

open-source, tooling, deepmind, gemma

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Braun, Taylor, Goldowsky-Dill, Sharkey (2024)

training-efficiency, end-to-end

Improving Dictionary Learning with Gated Sparse Autoencoders

Rajamanoharan, Conmy, Smith et al. (DeepMind) (2024)

architecture, gated-sae, deepmind

Improving Steering Vectors by Targeting Sparse Autoencoder Features

Chalnev, Siu, Conmy (2024)

steering, applications

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

Simon & Zou (2024)

beyond-text, protein, biology

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, Lieberum, Sonnerat et al. (DeepMind) (2024)

architecture, jumprelu, deepmind

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

He et al. (2024)

open-source, tooling, llama

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Karvonen et al. (2024)

evaluation, benchmarks

Not All Language Model Features Are Linear

Engels, Liao, Michaud, Gurnee, Tegmark (2024)

representation-geometry, non-linear

SAELens: A library for training and analyzing sparse autoencoders

Bloom, Tigges, Chanin (2024)

open-source, tooling, library

Scaling Automatic Neuron Description

Choi, Huang, Meng et al. (Transluce) (2024)

automated-interpretability, scaling

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton, Conerly, Marcus et al. (Anthropic) (2024)

foundational, scaling, anthropic

Scaling and Evaluating Sparse Autoencoders

Gao, Dupré la Tour, Tillman et al. (OpenAI) (2024)

foundational, scaling, evaluation, openai

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bussmann, Pearce, Leask, Bloom, Sharkey, Nanda (2024)

representation-geometry, meta-sae

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Lindsey, Templeton, Marcus et al. (Anthropic) (2024)

crosscoder, anthropic

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, Rager, Michaud, Belinkov, Bau, Mueller (2024)

circuits, causal-analysis, applications

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Makelov, Lange, Nanda (2024)

evaluation, benchmarks

Transcoders Find Interpretable LLM Feature Circuits

Dunefsky, Chlenski, Nanda (2024)

transcoder, circuits

Language Models Can Explain Neurons in Language Models

Bills, Cammarata, Mossing et al. (OpenAI) (2023)

automated-interpretability, openai