Language Models Can Explain Neurons in Language Models

Bills, Cammarata, Mossing et al. (OpenAI) (2023)

Tags: automated-interpretability, openai

Abstract

We use GPT-4 to automatically generate and score explanations for neurons in GPT-2, demonstrating that language models can assist in interpreting neural network internals.