Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness

Published 23 May 2025 in cs.AI | (2505.17406v1)

Abstract: LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.

Abstract PDF Upgrade to Chat

Summary

An Evaluation Framework for Assessing the Robustness of Reasoning in LLMs

In the realm of advanced AI-driven research, the opacity of decision-making processes in large language models (LLMs) presents substantial challenges across domains such as education and healthcare, where reliable reasoning is paramount. The academic paper authored by Enyi Jiang et al. delves into this issue by introducing MATCHA, an innovative framework designed to evaluate the consistency and robustness of Chain-of-Thought (CoT) prompting techniques used in LLMs.

CoT prompting has been heralded for its ability to enhance the reasoning capabilities of LLMs, thereby facilitating their application to complex tasks involving multi-step inferencing and mathematical proofs. However, the mechanisms underlying these models remain poorly understood, especially concerning the alignment between generated reasoning and answers. The authors identify key vulnerabilities in CoT reasoning, particularly under perturbations to input data, which may result in LLMs producing incorrect or nonsensical reasoning while still arriving at correct answers.

MATCHA serves as a diagnostic tool engineered to dissect the fragility of LLM reasoning processes. It employs both token-level and embedding-level perturbations to systematically evaluate how alterations in input can lead to inconsistencies between reasoning and answers. The framework's primary contribution lies in its ability to create metrics that assess CoT robustness variably across different LLM architectures and reasoning scenarios.

MATCHA’s token-level perturbations are crafted through random token insertion followed by gradient-informed replacement, targeting the regions of input with the highest impact on reasoning accuracy. Embedding-level perturbations, on the other hand, manipulate the vector space of input representations imperceptibly, maintaining the correct answer while distorting internal reasoning processes. With these methodologies, MATCHA exploits the reasoning-specific vulnerabilities of LLMs, providing insights that extend beyond traditional adversarial attacks focused on output correctness alone.

The research delineates an evaluation using MATCHA across multiple open-source LLMs, including Llama-3-8B, Mistral-7B, and DeepSeek-R1-7B, revealing a pronounced sensitivity to both token and embedding perturbations in multi-step and commonsense reasoning tasks. Notably, perturbations show variable success rates in breaking reasoning-accuracy alignment, indicating that different LLM architectures have distinct weaknesses when subjected to systematic perturbations.

Moreover, the study challenges existing assumptions about answer-reasoning consistency within LLMs by demonstrating non-trivial transfer rates of adversarial examples from open-source models to black-box LLMs such as GPT-3.5-turbo and GPT-4. This cross-model transferability underscores the pervasive nature of the vulnerabilities identified and calls for advancements in training methodologies to bolster reasoning-answer alignment.

Significantly, the findings from MATCHA have implications for both practical application and theoretical development in AI research. Understanding and addressing the reasoning vulnerabilities exposed by perturbations is crucial for improving the trustworthiness and robustness of LLMs, especially in critical fields requiring high-stakes decision-making and reasoning processes. As the paper asserts, the pursuit of enhanced reasoning-driven architectures demands continued exploration and refinement of CoT techniques to ensure the reliability and efficacy of LLM systems.

In conclusion, Jiang et al.'s work contributes valuable insights into the fragilities of LLM reasoning, offering a framework that not only assesses robustness but also proposes methodologies to guide the evolution of reasoning-enhanced AI systems. Future avenues in this area may involve refining perturbation techniques, developing better coherence-assessment metrics, and investigating architectural modifications that fortify reasoning mechanisms against semantic perturbations while maintaining answer integrity.