Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Karvonen et al. (2024)

Read paper

Tags: evaluation, benchmarks

Abstract

We use board game models with known ground-truth features to measure progress in dictionary learning methods for interpretability.