Interpretability Illusions with Sparse Autoencoders

Anonymous (2025)

Tags: critical-perspectives, robustness

Abstract

We demonstrate that sparse autoencoders can produce interpretability illusions, where features appear interpretable but do not robustly represent the concepts they seem to capture.