Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken, Templeton, Batson et al. (Anthropic) (2023)

Tags: foundational, dictionary-learning, anthropic

Abstract

We apply sparse autoencoders to decompose the activations of a one-layer transformer into interpretable features, finding that many features correspond to clear, monosemantic concepts.