Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
Farrell et al. (2024)
Tags: unlearning, applications
Abstract
We use sparse autoencoders to identify and remove specific knowledge from language models, demonstrating a new approach to machine unlearning.