Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Makelov, Lange, Nanda (2024)

Tags: evaluation, benchmarks

Abstract

We develop principled evaluation methods for sparse autoencoders, measuring how well SAE features capture ground-truth concepts and enable model control.