Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

Farrell et al. (2024)

Tags: unlearning, applications

Abstract

We use sparse autoencoders to identify and remove specific knowledge from language models, demonstrating a new approach to machine unlearning.