Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, Rager, Michaud, Belinkov, Bau, Mueller (2024)
Tags: circuits, causal-analysis, applications
Abstract
We use SAE features to discover interpretable causal circuits in language models, enabling targeted editing of model behavior through feature-level interventions.