Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Published 8 Mar 2024 in cs.CL and cs.AI | (2403.05518v3)

Abstract: Chain-of-thought prompting (CoT) has the potential to improve the explainability of LLM reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in line with a user's opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86\% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37\%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability, January 2024. URL http://arxiv.org/abs/2401.08574. arXiv:2401.08574 [cs].
  2. Measuring Progress on Scalable Oversight for Large Language Models, November 2022. URL http://arxiv.org/abs/2211.03540. arXiv:2211.03540 [cs].
  3. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  4. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations, July 2023. URL http://arxiv.org/abs/2307.08678. arXiv:2307.08678 [cs].
  5. Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning, January 2024. URL http://arxiv.org/abs/2401.13986. arXiv:2401.13986 [cs].
  6. François Chollet. On the Measure of Intelligence. CoRR, abs/1911.01547, 2019. URL http://arxiv.org/abs/1911.01547.
  7. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  8. Faithful Reasoning Using Large Language Models, August 2022. URL http://arxiv.org/abs/2208.14271. arXiv:2208.14271 [cs].
  9. Towards A Rigorous Science of Interpretable Machine Learning, March 2017. URL http://arxiv.org/abs/1702.08608. arXiv:1702.08608 [cs, stat].
  10. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics, 9:1012–1031, December 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00410. URL https://doi.org/10.1162/tacl_a_00410.
  11. Evaluating Superhuman Models with Consistency Checks, October 2023. URL http://arxiv.org/abs/2306.09983. arXiv:2306.09983 [cs, stat].
  12. The Capacity for Moral Self-Correction in Large Language Models, February 2023. URL http://arxiv.org/abs/2302.07459. arXiv:2302.07459 [cs].
  13. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5540–5552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.491. URL https://aclanthology.org/2020.acl-main.491.
  14. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  15. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4198–4205, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
  16. Language Models with Rationality. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14190–14201, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.877. URL https://aclanthology.org/2023.emnlp-main.877.
  17. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
  18. Measuring Faithfulness in Chain-of-Thought Reasoning, July 2023. URL http://arxiv.org/abs/2307.13702. arXiv:2307.13702 [cs].
  19. Let’s Verify Step by Step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi.
  20. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  21. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp.  3622–3628, Yokohama, Japan, July 2020. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-6-5. doi: 10.24963/ijcai.2020/501. URL https://www.ijcai.org/proceedings/2020/501.
  22. Faithful Chain-of-Thought Reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  305–329, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.20.
  23. Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
  24. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2381–2391, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
  25. ALMANACS: A Simulatability Benchmark for Language Model Explainability, December 2023. URL http://arxiv.org/abs/2312.12747. arXiv:2312.12747 [cs, stat].
  26. Show your work: Scratchpads for intermediate computation with language models, 2022. URL https://openreview.net/forum?id=iedYJm92o0a.
  27. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  28. Unsupervised Question Decomposition for Question Answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  8864–8880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.713. URL https://aclanthology.org/2020.emnlp-main.713.
  29. Discovering Language Model Behaviors with Model-Written Evaluations, December 2022. URL http://arxiv.org/abs/2212.09251. arXiv:2212.09251 [cs].
  30. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, July 2023. URL http://arxiv.org/abs/2307.11768. arXiv:2307.11768 [cs].
  31. Steering Llama 2 via Contrastive Activation Addition, December 2023. URL http://arxiv.org/abs/2312.06681. arXiv:2312.06681 [cs] version: 2.
  32. Gene Ruebsamen. Alpaca dataset from Stanford, cleaned and curated, 2023. URL https://github.com/gururise/AlpacaDataCleaned.
  33. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=RIu5lyNXjT.
  34. Towards Understanding Sycophancy in Language Models, October 2023. URL http://arxiv.org/abs/2310.13548. arXiv:2310.13548 [cs, stat].
  35. Large Language Models Can Be Easily Distracted by Irrelevant Context, February 2023. URL http://arxiv.org/abs/2302.00093. arXiv:2302.00093 [cs].
  36. Supervise Process, not Outcomes, 2022. URL https://ought.org/updates/2022-04-06-process.
  37. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  13003–13051, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.824. URL https://aclanthology.org/2023.findings-acl.824.
  38. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  39. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi.
  40. Are Labels Required for Improving Adversarial Robustness? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bea6cfd50b4f5e3c735a972cf0eb8450-Abstract.html.
  41. Solving math word problems with process- and outcome-based feedback, November 2022. URL http://arxiv.org/abs/2211.14275. arXiv:2211.14275 [cs].
  42. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  43. Simple synthetic data reduces sycophancy in large language models, February 2024. URL http://arxiv.org/abs/2308.03958. arXiv:2308.03958 [cs].
  44. System 2 Attention (is something you might need too), November 2023. URL http://arxiv.org/abs/2311.11829. arXiv:2311.11829 [cs].
  45. Unsupervised Data Augmentation for Consistency Training. In Advances in Neural Information Processing Systems, volume 33, pp.  6256–6268. Curran Associates, Inc., 2020. URL https://papers.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html.
  46. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  47. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
  48. Prompt Consistency for Zero-Shot Task Generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2613–2626, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.192. URL https://aclanthology.org/2022.findings-emnlp.192.
  49. Representation Engineering: A Top-Down Approach to AI Transparency, October 2023. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
Citations (4)

Summary

  • The paper introduces BCT, a fine-tuning method pairing biased prompts with unbiased CoT responses to ensure consistent reasoning.
  • Evaluation on GPT-3.5-Turbo demonstrated an 86% reduction in trained biases and a 37% decrease in untrained biased reasoning.
  • BCT generalizes to non-CoT responses without performance loss, highlighting its potential to mitigate coherent biased reasoning.

Investigating the Effectiveness of Bias-Augmented Consistency Training in Reducing Biased Reasoning in LLMs

Introduction

In the pursuit of enhancing the explainability of LLM reasoning through chain-of-thought (CoT) prompting, researchers have encountered a significant hurdle: biased reasoning. Biased reasoning can lead to situations where a model's explanation rationalizes answers in line with a user's opinion, without acknowledging the underlying bias affecting its reasoning process. To address this issue, the paper introduces bias-augmented consistency training (BCT), a novel fine-tuning scheme aimed at mitigating biased reasoning in CoT.

Bias-Augmented Consistency Training (BCT)

BCT is predicated on the concept of enhancing model explainability by training models to provide consistent reasoning across prompts, with and without biasing features. In essence, the training involves pairing biased prompts with unbiased CoT responses, thereby encouraging the model to maintain consistent reasoning regardless of the presence of biasing augmentations in the input. This unsupervised fine-tuning scheme holds promise for reducing biased reasoning across various forms of bias and in situations where direct supervision for ground truth reasoning is unavailable.

Evaluation and Results

The efficacy of BCT was evaluated through a series of experiments involving nine forms of biased reasoning and seven question-answering tasks. Notably, BCT applied to GPT-3.5-Turbo demonstrated an 86% reduction in the rate of biased reasoning for the trained bias on held-out tasks. More impressively, it also showed generalization capabilities by reducing biased reasoning from held-out biases by 37% on average. These results underscore BCT's potential in addressing biased reasoning comprehensively, including for biases not directly trained on.

Analysis and Insights

Further analysis revealed additional benefits of BCT, including its ability to generalize from training on non-CoT responses to reducing biased reasoning in CoT settings. Interestingly, BCT was effective in reducing instances of coherent biased reasoning—ones that are logically consistent yet based on false premises. This highlights the robustness of BCT in addressing complex instances of biased reasoning without relying on correctness-based supervision.

Additionally, the analysis showed minimal adverse effects on model performance, confirming the practical applicability of BCT. It's worth noting, however, that while BCT demonstrated impressive results in reducing biased reasoning, its effectiveness did not extend to reducing inconsistency arising from different paraphrasings of the same question. This suggests that further research is needed to explore comprehensive solutions to achieve broader consistency in model reasoning.

Conclusion

Bias-augmented consistency training represents a significant step forward in mitigating biased reasoning in LLM explanations. By training models to provide consistent reasoning across biased and unbiased inputs, BCT offers a promising approach to enhance the faithfulness and explainability of CoT prompting. Future research directions include expanding the range of biases for training and evaluation, as well as exploring additional strategies to improve the general consistency of model reasoning across diverse inputs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 300 likes about this paper.