Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Makelov, Lange, Nanda (2024)
Tags: evaluation, benchmarks
Abstract
We develop principled evaluation methods for sparse autoencoders, measuring how well SAE features capture ground-truth concepts and enable model control.