Papers

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Shu, Wu, Zhao, Rai, Yao, Liu, Du (2025)

survey

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Gould et al. (2025)

automated-interpretability

From Mechanistic Interpretability to Mechanistic Biology: SAEs on Protein Language Models

Adams, Bai, Lee, Yu, AlQuraishi (2025)

beyond-text, protein, biology

Gemma Scope 2 (SAEs on Gemma 3 family)

DeepMind (2025)

open-source, tooling, deepmind, gemma

Interpretability Illusions with Sparse Autoencoders

Anonymous (2025)

critical-perspectives, robustness

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Bussmann, Nabeshima, Karvonen, Nanda (2025)

architecture, matryoshka

Low-Rank Adapting Models for Sparse Autoencoders

Chen, Engels, Tegmark (2025)

training-efficiency, lora

Negative Results for SAEs on Downstream Tasks (GDM Mech Interp Team Progress Update 2)

Smith et al. (DeepMind) (2025)

critical-perspectives, negative-results, deepmind

On the Biology of a Large Language Model (Attribution Graphs)

Lindsey, Gurnee, Ameisen et al. (Anthropic) (2025)

circuits, attribution, anthropic

Open Problems in Mechanistic Interpretability

Sharkey, Chughtai, Batson et al. (2025)

critical-perspectives, open-problems

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

Demircan, Saanum, Jagadish, Binz, Schulz (2025)

beyond-text, reinforcement-learning

Sparse Autoencoders Trained on the Same Data Learn Different Features

Paulo & Belrose (2025)

evaluation, reproducibility

Transcoders Beat Sparse Autoencoders for Interpretability

Paulo, Mallen, Belrose (2025)

transcoder, evaluation

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Chanin, Wilken-Smith, Dulka, Bhatnagar, Bloom (2024)

evaluation, feature-splitting

Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

Farrell et al. (2024)

unlearning, applications

Automatically Interpreting Millions of Features in Large Language Models

Paulo, Mallen, Juang, Belrose (2024)

automated-interpretability, scaling

BatchTopK Sparse Autoencoders

Bussmann, Leask, Nanda (2024)

architecture, topk

Decomposing the Dark Matter of Sparse Autoencoders

Engels, Riggs, Tegmark (2024)

representation-geometry, dark-matter

Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Ghilardi, Belotti, Molinari (2024)

training-efficiency, scaling

Evaluating Feature Steering: A Case Study in Mitigating Social Biases

Durmus, Tamkin et al. (Anthropic) (2024)

steering, bias, applications, anthropic