Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, Ewart, Riggs, Huben, Sharkey (2023)
Tags: foundational, interpretability
Abstract
We train sparse autoencoders on language model activations and find that the learned features are highly interpretable, providing evidence that sparse dictionary learning can extract meaningful representations.