Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Gould et al. (2025)
Tags: automated-interpretability
Abstract
We propose output-centric feature descriptions that characterize SAE features by their effect on model outputs, complementing traditional input-centric approaches.