Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Braun, Taylor, Goldowsky-Dill, Sharkey (2024)

Tags: training-efficiency, end-to-end

Abstract

We propose end-to-end sparse dictionary learning that trains SAEs to identify features that are functionally important for the model's downstream behavior.