Interpretability Illusions with Sparse Autoencoders
Anonymous (2025)
Tags: critical-perspectives, robustness
Abstract
We demonstrate that sparse autoencoders can produce interpretability illusions, where features appear interpretable but do not robustly represent the concepts they seem to capture.