Transcoders Find Interpretable LLM Feature Circuits

Dunefsky, Chlenski, Nanda (2024)

Tags: transcoder, circuits

Abstract

We introduce transcoders, which learn to map between MLP input and output activations, enabling the discovery of interpretable feature circuits in language models.