2000 character limit reached
How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers
Published 1 Feb 2025 in cs.LG, cs.AI, and cs.CY | (2503.17365v2)
Abstract: Recent incidents highlight safety risks in LLMs, motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures demonstrated less improvement in harm detection after abliteration. These results suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.