Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, Ewart, Riggs, Huben, Sharkey (2023)
foundational, interpretability
Cunningham, Ewart, Riggs, Huben, Sharkey (2023)
foundational, interpretability
Bricken, Templeton, Batson et al. (Anthropic) (2023)
foundational, dictionary-learning, anthropic
Elhage, Hume, Olsson et al. (Anthropic) (2022)
foundations, architecture, anthropic
Elhage, Hume, Olsson et al. (Anthropic) (2022)
foundations, superposition, anthropic
Makhzani & Frey (2013)
foundations, architecture