Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
Demircan, Saanum, Jagadish, Binz, Schulz (2025)
Tags: beyond-text, reinforcement-learning
Abstract
We use SAEs to discover that large language models implement temporal difference learning-like computations, connecting LLM internals to reinforcement learning theory.