Transcoders Find Interpretable LLM Feature Circuits
Dunefsky, Chlenski, Nanda (2024)
Tags: transcoder, circuits
Abstract
We introduce transcoders, which learn to map between MLP input and output activations, enabling the discovery of interpretable feature circuits in language models.