Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Published 8 Mar 2024 in cs.CL and cs.AI | (2403.05518v3)

Abstract: Chain-of-thought prompting (CoT) has the potential to improve the explainability of LLM reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in line with a user's opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86\% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37\%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.

Abstract PDF HTML Upgrade to Chat

References (49)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces BCT, a fine-tuning method pairing biased prompts with unbiased CoT responses to ensure consistent reasoning.
Evaluation on GPT-3.5-Turbo demonstrated an 86% reduction in trained biases and a 37% decrease in untrained biased reasoning.
BCT generalizes to non-CoT responses without performance loss, highlighting its potential to mitigate coherent biased reasoning.

Investigating the Effectiveness of Bias-Augmented Consistency Training in Reducing Biased Reasoning in LLMs

Introduction

In the pursuit of enhancing the explainability of LLM reasoning through chain-of-thought (CoT) prompting, researchers have encountered a significant hurdle: biased reasoning. Biased reasoning can lead to situations where a model's explanation rationalizes answers in line with a user's opinion, without acknowledging the underlying bias affecting its reasoning process. To address this issue, the paper introduces bias-augmented consistency training (BCT), a novel fine-tuning scheme aimed at mitigating biased reasoning in CoT.

Bias-Augmented Consistency Training (BCT)

BCT is predicated on the concept of enhancing model explainability by training models to provide consistent reasoning across prompts, with and without biasing features. In essence, the training involves pairing biased prompts with unbiased CoT responses, thereby encouraging the model to maintain consistent reasoning regardless of the presence of biasing augmentations in the input. This unsupervised fine-tuning scheme holds promise for reducing biased reasoning across various forms of bias and in situations where direct supervision for ground truth reasoning is unavailable.

Evaluation and Results

The efficacy of BCT was evaluated through a series of experiments involving nine forms of biased reasoning and seven question-answering tasks. Notably, BCT applied to GPT-3.5-Turbo demonstrated an 86% reduction in the rate of biased reasoning for the trained bias on held-out tasks. More impressively, it also showed generalization capabilities by reducing biased reasoning from held-out biases by 37% on average. These results underscore BCT's potential in addressing biased reasoning comprehensively, including for biases not directly trained on.

Analysis and Insights

Further analysis revealed additional benefits of BCT, including its ability to generalize from training on non-CoT responses to reducing biased reasoning in CoT settings. Interestingly, BCT was effective in reducing instances of coherent biased reasoning—ones that are logically consistent yet based on false premises. This highlights the robustness of BCT in addressing complex instances of biased reasoning without relying on correctness-based supervision.

Additionally, the analysis showed minimal adverse effects on model performance, confirming the practical applicability of BCT. It's worth noting, however, that while BCT demonstrated impressive results in reducing biased reasoning, its effectiveness did not extend to reducing inconsistency arising from different paraphrasings of the same question. This suggests that further research is needed to explore comprehensive solutions to achieve broader consistency in model reasoning.

Conclusion

Bias-augmented consistency training represents a significant step forward in mitigating biased reasoning in LLM explanations. By training models to provide consistent reasoning across biased and unbiased inputs, BCT offers a promising approach to enhance the faithfulness and explainability of CoT prompting. Future research directions include expanding the range of biases for training and evaluation, as well as exploring additional strategies to improve the general consistency of model reasoning across diverse inputs.

Markdown Report Issue