Automatically Interpreting Millions of Features in Large Language Models

Paulo, Mallen, Juang, Belrose (2024)

Tags: automated-interpretability, scaling

Abstract

We develop scalable methods to automatically interpret millions of SAE features in large language models using automated description and scoring.