Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Braun, Taylor, Goldowsky-Dill, Sharkey (2024)
Tags: training-efficiency, end-to-end
Abstract
We propose end-to-end sparse dictionary learning that trains SAEs to identify features that are functionally important for the model's downstream behavior.