Language Models Can Explain Neurons in Language Models
Bills, Cammarata, Mossing et al. (OpenAI) (2023)
Tags: automated-interpretability, openai
Abstract
We use GPT-4 to automatically generate and score explanations for neurons in GPT-2, demonstrating that language models can assist in interpreting neural network internals.