Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Gould et al. (2025)

Tags: automated-interpretability

Abstract

We propose output-centric feature descriptions that characterize SAE features by their effect on model outputs, complementing traditional input-centric approaches.