Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. (Anthropic) (2024)
Tags: foundational, scaling, anthropic
Abstract
We scale sparse autoencoders to Claude 3 Sonnet, extracting millions of interpretable features from a production language model, demonstrating that the approach works at scale.