Automatically Interpreting Millions of Features in Large Language Models
Paulo, Mallen, Juang, Belrose (2024)
Tags: automated-interpretability, scaling
Abstract
We develop scalable methods to automatically interpret millions of SAE features in large language models using automated description and scoring.