A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code

Published 19 Nov 2025 in cs.SE | (2511.15817v5)

Abstract: Recent advances in LLMs have accelerated their adoption in software engineering contexts. However, concerns persist about the structural quality of the code they produce. In particular, LLMs often replicate poor coding practices, introducing code smells (i.e., patterns that hinder readability, maintainability, or design integrity). Although prior research has examined the detection or repair of smells, we still lack a clear understanding of how and when these issues emerge in generated code. This paper addresses this gap by systematically measuring, explaining and mitigating smell propensity in LLM-generated code. We build on the Propensity Smelly Score (PSC), a probabilistic metric that estimates the likelihood of generating particular smell types, and establish its robustness as a signal of structural quality. Using PSC as an instrument for causal analysis, we identify how generation strategy, model size, model architecture and prompt formulation shape the structural properties of generated code. Our findings show that prompt design and architectural choices play a decisive role in smell propensity and motivate practical mitigation strategies that reduce its occurrence. A user study further demonstrates that PSC helps developers interpret model behavior and assess code quality, providing evidence that smell propensity signals can support human judgement. Taken together, our work lays the groundwork for integrating quality-aware assessments into the evaluation and deployment of LLMs for code.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Propensity Smelly Score (PSC) to quantify and address code smells in LLM-generated code.
It employs a structural causal model to evaluate factors like prompt formulation and model architecture that impact code quality.
Experimental results show that prompt-based mitigation strategies significantly improve structural code quality beyond traditional metrics.

Summary of "A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code" (2511.15817)

The paper investigates the systematic issues of code smell occurrences in code generated by LLMs, and proposes a structured causal framework to measure, explain, and mitigate these occurrences using a metric termed Propensity Smelly Score (PSC). It identifies key components influencing code smell generation and proposes strategies to improve code quality, integrating causal analysis and prompt-based interventions.

Introduction

LLMs have increasingly been used to automate diverse software engineering tasks ranging from code completion to test generation. However, the generated code often replicates poor coding practices, resulting in code smells that detract from readability and maintainability. Though traditional evaluations focus largely on functional correctness through similarity metrics like BLEU and CodeBLEU, these metrics fail to adequately capture structural and design quality concerns like code smells, which are pervasive due to inherited patterns from training corpora and architectural decisions.

Propensity Smelly Score (PSC)

PSC is introduced as a probabilistic metric that estimates the likelihood of LLM-generated code producing specific types of code smells. It aggregates token-level probabilities within a code snippet to provide a continuous measure of potential code smell presence. The robustness of PSC is validated through semantic-preserving transformations, ensuring it captures the inherent structural quality signal. It offers insights beyond surface-level correctness, aligning with deeper semantic evaluations of model behavior.

Figure 1: Propensity Smelly Score computation with examples of code smells.

Methodology

The methodology revolves around four core research questions: measuring smelly code propensity, explaining generation factors, mitigation strategies, and evaluating PSC’s practical value for developers. The structural causal model (SCM) evaluates elements like model architecture, size, generation strategy, and prompt type, quantifying their causal relationships with PSC. This informed design of prompt-based mitigation strategies to guide cleaner code generation practices.

Figure 2: Structural Causal Model overviewing treatment interventions.

Experimental Results

The paper's experimental section reveals robust correlations established through PSC that are absent with BLEU and CodeBLEU scores, underscoring PSC’s superior alignment with structural quality indicators. Causal analyses identify prompt formulation and model architecture as influential factors in smell propensity, justifying prompt-based mitigation strategies as practical enhancements, while highlighting limited efficacy of model size variations.

Figure 3: Information gain results comparing PSC with traditional metrics.

Discussion

The findings detail varying impacts of prompt structure, model architecture, and decoding strategies on code quality, establishing prompt-based strategies as particularly effective. Notably, architectural adjustments are shown to influence internal token distributions, affecting structural decisions in code synthesis.

Figure 4: Results from user study indicating the practical influence of PSC on developer judgement.

Conclusions and Future Work

The study establishes PSC as a robust measure for evaluating LLM-generated code for smells and advocates for its use to guide interventions in model design and generation strategies. Future extensions should explore PSC across diverse languages and model architectures, facilitating integrated quality assessments beyond code smells, and broadening its applicability to various programming contexts.

Overall, this work contributes significantly to understanding generation-induced code quality issues and offers actionable insights for improving the structural integrity of LLM-generated software. Future efforts should prioritize extending PSC to other languages and refining mitigation strategies to encompass comprehensive, multi-dimensional code evaluations.

Markdown Report Issue