Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Abstract: Chain-of-thought prompting (CoT) has the potential to improve the explainability of LLM reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in line with a user's opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86\% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37\%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.
- Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability, January 2024. URL http://arxiv.org/abs/2401.08574. arXiv:2401.08574 [cs].
- Measuring Progress on Scalable Oversight for Large Language Models, November 2022. URL http://arxiv.org/abs/2211.03540. arXiv:2211.03540 [cs].
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations, July 2023. URL http://arxiv.org/abs/2307.08678. arXiv:2307.08678 [cs].
- Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning, January 2024. URL http://arxiv.org/abs/2401.13986. arXiv:2401.13986 [cs].
- François Chollet. On the Measure of Intelligence. CoRR, abs/1911.01547, 2019. URL http://arxiv.org/abs/1911.01547.
- Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
- Faithful Reasoning Using Large Language Models, August 2022. URL http://arxiv.org/abs/2208.14271. arXiv:2208.14271 [cs].
- Towards A Rigorous Science of Interpretable Machine Learning, March 2017. URL http://arxiv.org/abs/1702.08608. arXiv:1702.08608 [cs, stat].
- Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics, 9:1012–1031, December 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00410. URL https://doi.org/10.1162/tacl_a_00410.
- Evaluating Superhuman Models with Consistency Checks, October 2023. URL http://arxiv.org/abs/2306.09983. arXiv:2306.09983 [cs, stat].
- The Capacity for Moral Self-Correction in Large Language Models, February 2023. URL http://arxiv.org/abs/2302.07459. arXiv:2302.07459 [cs].
- Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5540–5552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.491. URL https://aclanthology.org/2020.acl-main.491.
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
- Language Models with Rationality. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 14190–14201, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.877. URL https://aclanthology.org/2023.emnlp-main.877.
- Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
- Measuring Faithfulness in Chain-of-Thought Reasoning, July 2023. URL http://arxiv.org/abs/2307.13702. arXiv:2307.13702 [cs].
- Let’s Verify Step by Step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 3622–3628, Yokohama, Japan, July 2020. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-6-5. doi: 10.24963/ijcai.2020/501. URL https://www.ijcai.org/proceedings/2020/501.
- Faithful Chain-of-Thought Reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 305–329, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.20.
- Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
- ALMANACS: A Simulatability Benchmark for Language Model Explainability, December 2023. URL http://arxiv.org/abs/2312.12747. arXiv:2312.12747 [cs, stat].
- Show your work: Scratchpads for intermediate computation with language models, 2022. URL https://openreview.net/forum?id=iedYJm92o0a.
- Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- Unsupervised Question Decomposition for Question Answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8864–8880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.713. URL https://aclanthology.org/2020.emnlp-main.713.
- Discovering Language Model Behaviors with Model-Written Evaluations, December 2022. URL http://arxiv.org/abs/2212.09251. arXiv:2212.09251 [cs].
- Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, July 2023. URL http://arxiv.org/abs/2307.11768. arXiv:2307.11768 [cs].
- Steering Llama 2 via Contrastive Activation Addition, December 2023. URL http://arxiv.org/abs/2312.06681. arXiv:2312.06681 [cs] version: 2.
- Gene Ruebsamen. Alpaca dataset from Stanford, cleaned and curated, 2023. URL https://github.com/gururise/AlpacaDataCleaned.
- Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=RIu5lyNXjT.
- Towards Understanding Sycophancy in Language Models, October 2023. URL http://arxiv.org/abs/2310.13548. arXiv:2310.13548 [cs, stat].
- Large Language Models Can Be Easily Distracted by Irrelevant Context, February 2023. URL http://arxiv.org/abs/2302.00093. arXiv:2302.00093 [cs].
- Supervise Process, not Outcomes, 2022. URL https://ought.org/updates/2022-04-06-process.
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.824. URL https://aclanthology.org/2023.findings-acl.824.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi.
- Are Labels Required for Improving Adversarial Robustness? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bea6cfd50b4f5e3c735a972cf0eb8450-Abstract.html.
- Solving math word problems with process- and outcome-based feedback, November 2022. URL http://arxiv.org/abs/2211.14275. arXiv:2211.14275 [cs].
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Simple synthetic data reduces sycophancy in large language models, February 2024. URL http://arxiv.org/abs/2308.03958. arXiv:2308.03958 [cs].
- System 2 Attention (is something you might need too), November 2023. URL http://arxiv.org/abs/2311.11829. arXiv:2311.11829 [cs].
- Unsupervised Data Augmentation for Consistency Training. In Advances in Neural Information Processing Systems, volume 33, pp. 6256–6268. Curran Associates, Inc., 2020. URL https://papers.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html.
- HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
- Prompt Consistency for Zero-Shot Task Generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 2613–2626, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.192. URL https://aclanthology.org/2022.findings-emnlp.192.
- Representation Engineering: A Top-Down Approach to AI Transparency, October 2023. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.