Papers
Topics
Authors
Recent
Search
2000 character limit reached

Applying sparse autoencoders to unlearn knowledge in language models

Published 25 Oct 2024 in cs.LG and cs.AI | (2410.19278v2)

Abstract: We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from LLMs. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it LLMs. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Unlearning via rmu is mostly shallow, 2024. URL https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow.
  2. Machine unlearning, 2020. URL https://arxiv.org/abs/1912.03817.
  3. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  4. Sparse autoencoders find highly interpretable features in language models. 2023. URL https://arxiv.org/abs/2309.08600.
  5. Who’s harry potter? approximate unlearning in llms. 2023. URL https://arxiv.org/abs/2310.02238.
  6. Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093.
  7. Gemma Team. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295.
  8. Gemma Team. Gemma 2: Improving open language models at a practical size, 2024b. URL https://arxiv.org/abs/2408.00118.
  9. Managing catastrophic misuse without robust ais, 2024. URL https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais.
  10. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, March 2023. ISSN 1557-7341. doi: 10.1145/3571730. URL http://dx.doi.org/10.1145/3571730.
  11. The WMDP benchmark: Measuring and reducing malicious use with unlearning, 2024. URL https://arxiv.org/abs/2403.03218.
  12. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/2408.05147.
  13. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. URL https://arxiv.org/abs/2403.19647.
  14. Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262.
  15. Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843.
  16. Andrew Ng. Sparse autoencoder. CS294A Lecture Notes, 2011. Unpublished lecture notes.
  17. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
  18. Improving alignment and robustness with circuit breakers, 2024. URL https://arxiv.org/abs/2406.04313.
  19. An adversarial perspective on machine unlearning for ai safety, 2024. URL https://arxiv.org/abs/2409.18025.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.