Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Ewart, Riggs, Huben, Sharkey (2023)

Tags: foundational, interpretability

Abstract

We train sparse autoencoders on language model activations and find that the learned features are highly interpretable, providing evidence that sparse dictionary learning can extract meaningful representations.