Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton, Conerly, Marcus et al. (Anthropic) (2024)

Tags: foundational, scaling, anthropic

Abstract

We scale sparse autoencoders to Claude 3 Sonnet, extracting millions of interpretable features from a production language model, demonstrating that the approach works at scale.