A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Shu, Wu, Zhao, Rai, Yao, Liu, Du (2025)
survey
Shu, Wu, Zhao, Rai, Yao, Liu, Du (2025)
survey
Gould et al. (2025)
automated-interpretability
Adams, Bai, Lee, Yu, AlQuraishi (2025)
beyond-text, protein, biology
DeepMind (2025)
open-source, tooling, deepmind, gemma
Anonymous (2025)
critical-perspectives, robustness
Bussmann, Nabeshima, Karvonen, Nanda (2025)
architecture, matryoshka
Chen, Engels, Tegmark (2025)
training-efficiency, lora
Smith et al. (DeepMind) (2025)
critical-perspectives, negative-results, deepmind
Lindsey, Gurnee, Ameisen et al. (Anthropic) (2025)
circuits, attribution, anthropic
Sharkey, Chughtai, Batson et al. (2025)
critical-perspectives, open-problems
Demircan, Saanum, Jagadish, Binz, Schulz (2025)
beyond-text, reinforcement-learning
Paulo & Belrose (2025)
evaluation, reproducibility
Paulo, Mallen, Belrose (2025)
transcoder, evaluation
Chanin, Wilken-Smith, Dulka, Bhatnagar, Bloom (2024)
evaluation, feature-splitting
Farrell et al. (2024)
unlearning, applications
Paulo, Mallen, Juang, Belrose (2024)
automated-interpretability, scaling
Bussmann, Leask, Nanda (2024)
architecture, topk
Engels, Riggs, Tegmark (2024)
representation-geometry, dark-matter
Ghilardi, Belotti, Molinari (2024)
training-efficiency, scaling
Durmus, Tamkin et al. (Anthropic) (2024)
steering, bias, applications, anthropic