Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Published 25 Nov 2025 in cs.SE | (2511.20933v1)

Abstract: LLMs are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple \textit{Verification} to \textit{Guided} and \textit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50\%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing \textit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that DeepSeek-R1 models achieve robust cohesion detection (F1-score up to 0.825) but experience significant performance drops for coupling in open-ended tasks.
The methodology uses controlled noise injection and varied prompting strategies based on cognitive load to simulate realistic software design challenges in Java code.
Findings emphasize the need for enhanced reasoning frameworks to enable LLMs to autonomously handle complex software design evaluations.

Hierarchical Evaluation of Software Design Capabilities of LLMs of Code

Introduction

The rapid integration of LLMs in software engineering tasks has prompted researchers to assess their understanding of critical software design principles. The study "Hierarchical Evaluation of Software Design Capabilities of LLMs of Code" (2511.20933) seeks to elucidate the proficiency of LLMs in understanding two fundamental design principles: cohesion and coupling. This paper critically evaluates the DeepSeek-R1 model family (14B, 32B, 70B), identifying areas of strength and fragility in reasoning, particularly under varying levels of cognitive guidance and contextual complexity.

Methodology

This study employs a structured, multifaceted experimental approach encompassing several methodological steps:

Data Collection and Code Transformation: Utilizing a selection of open-source Java projects, the authors systematically generate code exhibiting defined manifestations of high coupling and low cohesion. This controlled experimental setup minimizes data contamination risks and offers a platform for reproducible results.
Figure 1: An overview of the transformation logic for producing loosely cohesive classes.
Prompt Design: Three prompting strategies align with cognitive load variations: Verification, Guided Generation, and Open-ended Generation. This framework, inspired by Bloom's Taxonomy, enables the exploration of models' reasoning capabilities across distinct levels of complexity.
Controlled Distortion: By injecting noise into the code through irrelevant methods or distractor classes, the authors simulate realistic environments that test the robustness of LLMs in more challenging contexts.
Evaluation Metrics: The study employs F1-score and Adjusted Rand Index (ARI) to measure efficacy in different tasks, effectively capturing the models' performance across varying task formulations.

Results

Basic Understanding

The results reveal that under optimal, low-noise conditions, the LLMs demonstrate a reasonable understanding of cohesion and coupling. The $70$B variant achieves an F1-score of up to $0.825$ for cohesion and $0.899$ for coupling, indicating a solid baseline capability for recognizing poorly designed code in simple contexts.

Cognitive Load Impact

Performance varies significantly with the level of guidance. As tasks progress from verification to open-ended generation, models exhibit increasing difficulty, particularly in scenarios requiring full autonomy. Notably, the $70$B variant's performance on coupling drops by over $31\%$ between Verification and Open-ended tasks (Figure 2).

Figure 2: Impact of the cognitive load on the performance of each LLM on the cohesion experiments.

Robustness to Noise

Analysis underlines the models' susceptibility to contextual noise. Coupling tasks, where reasoning about inter-module dependencies is tested, reveal F1 scores plummeting by more than $50\%$ as distortion ratios escalate. Conversely, models maintain stability on cohesion tasks, except in highly autonomous, noisy settings (Figure 3).

Figure 3: Performance of DeepSeek-R1 14B, 32B and 70B on the cohesion task under each distortion level.

Reasoning Trace Analysis

Cognitive shortcutting is evident in reasoning tasks, with models generating inexplicably shorter Chain-of-Thought traces in open-ended scenarios for coupling. In cohesion tasks, trace lengths increase with cognitive demand, reflecting an effort to grapple with complexity.

Implications and Future Directions

While LLMs exhibit foundational knowledge necessary to assist developers in maintaining code quality, the study identifies clear limitations in their reasoning capabilities, particularly in scaling complexity. This highlights the need for enhanced program understanding frameworks to guide LLMs in environments rich with contextual noise.

Future research should focus on extending these evaluations to other programming languages and paradigms, aiming to develop heuristics enabling LLMs to effectively manage combinatorial challenges. Further studies on leveraging these models for active problem-solving tasks, such as autonomous code refactoring, could reveal deeper insights into their potential roles in software engineering workflows.

Conclusion

The evaluation of DeepSeek-R1 models accentuates their current utility as powerful but constrained tools in software design, limited by their reliance on guidance and vulnerability to noise. Understanding and addressing these constraints marks a pivotal step toward enabling LLMs as fully autonomous software engineering agents.