Evaluating Feature Steering: A Case Study in Mitigating Social Biases
Durmus, Tamkin et al. (Anthropic) (2024)
Tags: steering, bias, applications, anthropic
Abstract
We evaluate using SAE features to steer model behavior, specifically for mitigating social biases, and assess the effectiveness and limitations of feature-based steering.