Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Karvonen et al. (2024)
Tags: evaluation, benchmarks
Abstract
We use board game models with known ground-truth features to measure progress in dictionary learning methods for interpretability.