Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Bricken, Templeton, Batson et al. (Anthropic) (2023)
Tags: foundational, dictionary-learning, anthropic
Abstract
We apply sparse autoencoders to decompose the activations of a one-layer transformer into interpretable features, finding that many features correspond to clear, monosemantic concepts.