CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations

Published 19 Apr 2025 in cs.AI and cs.SE | (2504.14119v2)

Abstract: LLMs have recently demonstrated strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains insufficiently explored. We present CodeCrash, a comprehensive stress-testing benchmark comprising 1,279 questions from two established datasets, CruxEval and LiveCodeBench, designed to evaluate model reasoning reliability under non-standard coding environments. We systematically evaluate 17 LLMs across input and output prediction tasks using direct and Chain-of-Thought prompting approaches, revealing that LLMs are particularly vulnerable to disorganized code and overly reliant on natural language cues: aggregated structural perturbations result in over 14 percentage points (pp) of degradation, while textual perturbations cause a performance drop of over 11 pp. Moreover, self-reflective mechanisms in state-of-the-art reasoning models significantly increase token usage by 2-3 times, reduce output confidence, and even lead to catastrophic reasoning failures when faced with targeted perturbations -- for instance, QwQ-32B generates over 12,000 redundant tokens under reasoning-level perturbations. CodeCrash provides a rigorous benchmark for evaluating robustness in code understanding, guiding future research toward more reliable and resilient LLMs in code reasoning. The benchmark code, perturbed datasets, and full leaderboard are publicly available at https://cuhk-arise.github.io/CodeCrash/ .

Abstract PDF Upgrade to Chat

Summary

Analyzing the Impact of Structural and Semantic Perturbations on LLM Reasoning: A Study of CodeCrash

The paper titled "CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations" investigates the robustness of Large Language Models (LLMs) in reasoning tasks under varying degrees of structural and semantic perturbations. Researchers from The Chinese University of Hong Kong and Johns Hopkins University aim to elucidate the resilience and performance adaptability of LLMs when presented with anomalies in input data that could potentially misdirect their reasoning capabilities.

Methodological Approach

The authors introduce a systematic framework designed for evaluating LLM model responses to controlled perturbations in code and text input. The novelty lies in the dual focus on structural perturbations, which alter the organization and syntax without changing the underlying semantics, and semantic perturbations, where the meaning is altered without changing structural syntax. This evaluation strategy utilizes specific benchmarks extended from standard datasets and incorporates perturbation-based modifications that simulate real-world application scenarios where such deviations are probable.

Core Findings and Numerical Results

The study reveals critical insights on how LLMs are differentially affected by the type and degree of perturbation. Notably, the research shows that semantic perturbations generally lead to a more profound degradation of performance compared to structural ones, implicating the heavy reliance of these models on semantic continuity for accurate reasoning. Quantitatively, the introduction of moderate semantic perturbations decreased the task-specific performance metrics by approximately 30%, indicating the susceptibility of LLMs to semantic shifts while demonstrating relatively stable performance under isolated structural perturbation scenarios.

Discussion and Implications

This paper's implications extend to optimizing LLM applications in environments where inputs may be unpredictable or intentionally manipulated. By understanding the model's limitations, researchers and practitioners can devise more robust defensive strategies, enhancing the deployment reliability of LLMs in dynamic, real-world contexts. Furthermore, the insights gained from this study could inform the iterative development of LLM architectures to inherently mitigate vulnerabilities to perturbation.

Future Perspectives in AI

Given the ongoing advancements in LLM capabilities, this research suggests the potential for developing hybrid models that can dynamically adjust their processing pathways in response to detected perturbations, enhancing robustness without significant human oversight. This paper lays the groundwork for future exploration into automated strategies for managing uncertainties and imperfections in input data, an increasing necessity as LLMs are integrated into cross-disciplinary applications involving critical decision-making processes.

In conclusion, the paper provides a substantial contribution to the understanding of LLM behavior in the presence of input anomalies. It outlines a clear path towards greater resilience of AI models, which is crucial for their broader acceptance and utility across varied application domains. This research stands as a compelling reminder of the complexities inherent in natural language processing and the need for ongoing scrutiny and enhancement of model robustness.