Papers

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Ewart, Riggs, Huben, Sharkey (2023)

foundational, interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken, Templeton, Batson et al. (Anthropic) (2023)

foundational, dictionary-learning, anthropic

Softmax Linear Units (SoLU)

Elhage, Hume, Olsson et al. (Anthropic) (2022)

foundations, architecture, anthropic

Toy Models of Superposition

Elhage, Hume, Olsson et al. (Anthropic) (2022)

foundations, superposition, anthropic

k-Sparse Autoencoders

Makhzani & Frey (2013)

foundations, architecture