Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Published 19 Dec 2024 in cs.CL | (2412.14959v1)

Abstract: Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.

Abstract PDF HTML Upgrade to Chat

Summary

The paper finds that LLMs' intrinsic self-correction can decrease performance, with Llama-3.1-8B showing a 20.4% accuracy decline for Yes/No questions due to overturning correct answers.
Error analysis using mechanistic and token-level interpretability reveals internal wavering, prompt biases, and cognitive patterns resembling human overthinking in self-correction failures.
Proposed strategies like question repeating and task-focused supervised fine-tuning can alleviate specific self-correction limits, suggesting focused behavioral adjustments over knowledge expansion are needed.

Analyzing the Limitations of Intrinsic Self-Correction in LLMs

The paper "Understanding the Dark Side of LLMs' Intrinsic Self-Correction" critically examines the intrinsic self-correction capabilities of state-of-the-art LLMs, such as models from the ChatGPT and Llama families. Intrinsic self-correction—the process where LLMs attempt to rectify their responses based on internal feedback rather than external data—has been assumed to enhance model accuracy. However, this paper challenges this assumption by systematically analyzing failure cases across various tasks.

Key Findings and Methodological Approach

The research identifies that intrinsic self-correction can lead to decreased performance rather than improvements, introducing cognitive biases and prompting issues:

Task Performance and Self-Correction Failures: The study evaluates multiple tasks, including simple factual questions and more complex tasks like decision-making, reasoning, and programming. In each case, intrinsic self-correction did not uniformly enhance performance. For instance, Llama-3.1-8B experienced a substantial 20.4% decline in accuracy for Yes/No questions, with 58.8% of correct answers overturned during self-correction processes.
Interpretation Through Error Analysis: The authors employed three interpretability methods to understand the self-correction failures:
- Mechanistic Interpretability: This approach showed that LLMs waver between intermediate answers, impacting the final output.
- Token-Level Interpretability: It revealed prompt biases, where models favor the reformulation prompt over the original question.
- Human-Like Cognitive Bias: The study identified patterns akin to human cognitive biases—such as overthinking, cognitive overload, and perfectionism—that manifest during complex task resolutions.
Strategies for Alleviating Failures: The paper proposes two interventions:
- Question Repeating: Attaching the original question at the end of the reinforcement prompt, which reduced prompt bias and improved alignment with the task objective.
- Supervised Fine-Tuning (SFT): Using minimal, task-focused samples to adjust model behavior rather than expanding its knowledge base led to improved outcomes, including transferring improvements to complex task settings.

Implications and Future Directions

The findings presented indicate critical pitfalls in relying solely on intrinsic self-correction as a mechanism for improving LLM reliability. The discovery that models can easily oscillate in their decision-making process due to internal biases and prompt interpretations necessitates a reevaluation of LLM development strategies. Future developments in AI should focus on refining such self-corrective processes with an emphasis on behavioral adjustments rather than knowledge expansion alone.

The selective application of the proposed mitigation strategies shows promise in addressing specific self-correction limits, suggesting that further granular tuning of LLMs can extend their operational accuracy across diverse contexts. Researchers are encouraged to build on this foundational analysis to explore additional methods and frameworks that harness interpretability for the systematic improvement of LLMs' self-correction routines.