Evaluating Feature Steering: A Case Study in Mitigating Social Biases

Durmus, Tamkin et al. (Anthropic) (2024)

Read paper

Tags: steering, bias, applications, anthropic

Abstract

We evaluate using SAE features to steer model behavior, specifically for mitigating social biases, and assess the effectiveness and limitations of feature-based steering.