Multi-Turn Code Generation
- Multi-turn code generation is an iterative process that decomposes complex coding tasks into sequential, feedback-driven steps for modular refinement.
- It leverages dependency-aware decomposition and explicit feedback loops to incrementally enhance functional correctness and constraint satisfaction.
- Empirical benchmarks show that multi-turn workflows can incur a 20–27 point drop in pass rates compared to single-turn, highlighting the need for robust model architectures.
Multi-turn code generation refers to the iterative, stepwise paradigm in which code is produced and refined across a series of conversational or feedback-driven interactions. This process mirrors real-world software engineering, where solutions are seldom authored monolithically but are decomposed into modular components, incrementally implemented, and repeatedly revised based on local tests, user feedback, or imposed constraints. Multi-turn workflows have become essential for benchmarking, analyzing, and improving LLMs in contexts ranging from competitive programming to full-stack application development. This article presents the foundational principles, benchmark methodologies, key empirical results, and ongoing research challenges in multi-turn code generation.
1. Formalization and Core Principles
In multi-turn code generation, a task is decomposed into a sequence of intermediate states, where each turn involves the model receiving context from previous turns and producing new or modified code segments. The canonical formalization, as instantiated in "CodeFlowBench," is as follows: suppose a problem is factored into subproblems , each with specification . At turn , the model receives the current function signature , any dependent helper signatures, prior implementations , and problem background , and must synthesize the next implementation : By contrast, single-turn code generation attempts to emit all code segments in a single decoding step.
Variants of this paradigm extend to settings involving external feedback (e.g., compiler errors, unit test failures, visual artifacts), multi-modal inputs (paired textual/visual instructions), or dynamic user queries over large codebases or front-end interfaces (Wang et al., 30 Apr 2025, Wu et al., 5 Dec 2025).
Multi-turn code generation is fundamentally distinguished by:
- Sequential conditioning on conversational or event-driven context.
- An explicit modeling of code dependencies, structural composition, and iterative revisions.
- The capacity for localized correction, self-repair, and longitudinal constraint adherence across turns.
2. Benchmarking Frameworks and Task Construction
Modern benchmarks canonically implement multi-turn code generation via structured pipelines capable of decomposing tasks, synthesizing feedback, and evaluating both functional correctness and higher-order requirements. The following characterize prevailing methodologies:
- Dependency-aware decomposition: "CodeFlowBench" implements AST-based analysis to decompose Codeforces solutions into function-level subproblems (up to 5,258 tasks), extracting topological orders and constructing dependency graphs. Each function is paired with a specification and evaluated with dedicated unit tests (Wang et al., 30 Apr 2025).
- Real-world context and containerization: "CodeAssistBench" sources multi-turn dialogues from real GitHub issues, reconstructs full environments and codebases, and evaluates model interventions via Dockerized containers and simulated user–maintainer interactions. Satisfaction conditions are explicitly tracked (Kim et al., 14 Jul 2025).
- Hierarchical, constraint-centric task synthesis: "MultiCodeIF" defines a taxonomy of 9 constraint categories and 27 types, supporting both single- and multi-level instruction chains, and injects iterative feedback through an automated validator-driven loop. Tasks span 14 languages and capture both functional and non-functional constraints (Duan et al., 1 Jul 2025).
- Multi-modal and usability-centric settings: "FronTalk" constructs dialogues combining textual and visual instructions per turn (e.g., annotated screenshots), and employs a web agent for validation and pairwise usability evaluation, measuring both feature implementation and UX quality (Wu et al., 5 Dec 2025).
- Security and code-diff workflows: "MT-Sec" converts existing single-turn secure coding tasks into multi-turn expansions, editing, or refactoring sequences, preserving original semantic requirements and test harnesses for correctness and security validation. Code-diff settings are specifically benchmarked for their heightened risk of specification drift (Rawal et al., 13 Oct 2025).
Central metrics include Pass@k (unit test success), constraint satisfaction rates, pass depth (APD), forgetting rate, usability scores, and security outcomes, typically measured after the final turn or iteratively across rounds.
3. Algorithmic Methods and Learning Paradigms
A spectrum of algorithmic approaches—and their associated learning theories—shape the state-of-the-art in multi-turn code generation:
- Single-step recoverable MDPs: Both μCode and COBALT leverage the observation that multi-turn code generation is a one-step recoverable process: from any intermediate code state, a correct solution is in principle reachable in one step. This enables off-policy contextual bandit optimization or imitation learning, circumventing the need for full-horizon RL and allowing direct utilization of single-step rewards (Jain et al., 27 Feb 2025, Chen et al., 3 Feb 2026).
- Reflective optimization and self-correction: "Murphy" extends group-based relative policy optimization (GRPO) with multi-turn rollouts and max-reward credit assignment, ensuring that models learn to expect and act on intermediate qualitative and quantitative feedback, propagating final-turn rewards up the rollout tree (Ekbote et al., 11 Nov 2025).
- Tree-structured program search: "Tree-of-Code" eschews strictly sequential multi-turn actions in favor of growing a tree of complete code programs, with successively refined branches prompted by failures. Each node reflects on errors and proposes new end-to-end candidates, improving both solution quality and diversity (Ni et al., 2024).
- Prompt engineering for style and robustness: Prompting strategies—combining abstract instructions ("minimal code only") and explicit exemplars ("example function")—exhibit significant influence on both code style persistence ("compression" and "expansion discipline") and accuracy across turns (Bohr, 17 Nov 2025, Zheng et al., 2024).
- Multi-modal and feedback-aware agents: Agent-based systems, such as "AceCoder," integrate critique loops over all historical instructions, employ web agent feedback for longitudinal feature validation, and iteratively re-prompt to suppress forgetting and drift in multi-turn UIs (Wu et al., 5 Dec 2025).
4. Empirical Results and Model Performance
Comprehensive experimentation across diverse benchmarks consistently demonstrates severe performance degradation in multi-turn code generation relative to single-turn baselines. Key empirical findings include:
| Model/Scenario | Multi-turn Pass@1 | Single-turn Pass@1 | Forgetting/Constraint Drop | Context |
|---|---|---|---|---|
| o1-mini (CFB) | 20.8% | 37.8% | — | (Wang et al., 30 Apr 2025) |
| GPT-4o-mini (CFB) | 13.8% | 22.0% | — | (Wang et al., 30 Apr 2025) |
| Deepseek-R1 (CFB) | 20.5% | 46.1% | — | (Wang et al., 30 Apr 2025) |
| Claude-3-7-Sonnet (MultiCodeIF) | 63.0% → 83.4% (4 turns) | — | +20.4 pts (via repair) | (Duan et al., 1 Jul 2025) |
| GPT-4o (FronTalk, PR) | 56.0% | — | 21.4% forgetting | (Wu et al., 5 Dec 2025) |
| Aider+GPT-5T (MT-Sec, CCS) | — | ~53% (ST) | –23% (MT vs ST) | (Rawal et al., 13 Oct 2025) |
| ChatGPT 4.1 Mini (CAB, Recent) | ≤16.49% | 70–83% (StackOverflow) | — | (Kim et al., 14 Jul 2025) |
Notably, models exhibit a rapid collapse in pass rate as the dependency graph becomes more complex (Dependency Structure Complexity > 1.2) or as hierarchical constraint counts grow (e.g., MultiCodeIF HSR drops from 54.5% to 18.8% in multi-level tasks). Integrated feedback loops and multi-turn repair, when employed, significantly boost constraint satisfaction (e.g., +20.4% in 4 repair rounds for Claude-3-7-Sonnet). However, functional correctness often degrades by 20–27 points from single- to multi-turn on security-sensitive and code-diff workflows (Rawal et al., 13 Oct 2025).
5. Error Taxonomy and Failure Modes
Careful analyses across benchmarks have consistently identified three dominant multi-turn failure types (Wang et al., 30 Apr 2025, Duan et al., 1 Jul 2025, Wu et al., 5 Dec 2025):
- Incomplete Reasoning (IR): Models generate "happy-path" logic, neglecting edge case handling or system-wide optimality.
- Insufficient Globalization (IG): Local solutions omit shared state management, required imports, or propagate stale context.
- Instruction Misinterpretation (IM): Misuse of previously defined helpers or misalignment with evolving requirements.
Forgetting previously satisfied features, especially in multi-modal or UI-driven workflows, is a persistent issue, with baseline rates as high as 21.4% for GPT-4o in FronTalk, reduced to 0.4% via agent-based critique (Wu et al., 5 Dec 2025). Insecure or functionally incorrect code diffs often increase as iterative patching propagates undetected errors (Rawal et al., 13 Oct 2025).
Most error modes—particularly IG and IM—emerge uniquely in multi-turn settings and are not remediated by switching to single-turn paradigms. The IR Remediation Rate (fraction of IR errors fixable via single-turn rewriting) remains low (≲17%) (Wang et al., 30 Apr 2025).
6. Modeling Strategies and Future Research Directions
Emerging strategies and open research questions include:
- Dependency-Aware Training: Explicitly reward correct multi-step composition and cross-turn dependency tracking during pretraining or RL fine-tuning (Wang et al., 30 Apr 2025, Duan et al., 1 Jul 2025).
- Memory-Augmented Agentic Scaffolds: Track and serialize program state, routine specifications, and context windows to maintain coherence and enable global reasoning (Ni et al., 2024, Wang et al., 30 Apr 2025).
- Robust Style and Constraint Control: Combine instruction and example-based prompting to instill persistent stylistic or non-functional properties, as well as manage expansion discipline in enhanced implementations (Bohr, 17 Nov 2025).
- Offline Bandit Learning and Single-Step Recovery: Leverage one-step recoverability in the underlying MDP to train with contextual bandit objectives using offline-logged partial trajectories, avoiding instability of full online RL (Jain et al., 27 Feb 2025, Chen et al., 3 Feb 2026).
- Adversarial and Feedback-aware Model Validation: Systematically challenge models with perturbations (e.g., perturbed execution feedback) to prevent in-context reward hacking, and use composite feedback modalities to build generalizable robustness (Chen et al., 3 Feb 2026, Han et al., 27 Feb 2025).
- Hierarchical Constraint Embedding: Structure prompts and evaluator curricula to expose fine-grained, layered constraint schemas, increasing transparency and aiding in specification alignment across multiple turns (Duan et al., 1 Jul 2025).
7. Significance, Limitations, and Outlook
Multi-turn code generation benchmarks and algorithms now constitute the primary mechanism for assessing LLMs in tasks that more closely resemble real-world development and scientific workflows. The field has revealed that many state-of-the-art LLMs, while successful in one-shot code synthesis, face steep degradation in iterative and dependency-rich scenarios (20–27 point drops common across settings). Structured feedback, explicit dependency scaffolding, and prompt discipline are now known to substantially mitigate but not eliminate these gaps.
Key research directions include scaling benchmarks to repository-level assembly (hundreds of functions), expanding language and domain coverage, integrating semantic and security validation loops at every turn, and developing agentic architectures capable of robust, incremental, and constraint-compliant program synthesis over extended interactions (Wang et al., 30 Apr 2025, Duan et al., 1 Jul 2025, Rawal et al., 13 Oct 2025).
The persistent challenges of context tracking, cross-turn specification maintenance, and secure composition position multi-turn code generation as a central unsolved problem for code-focused LLM research. Advances in these areas will likely define the next wave of progress in automated software engineering and AI-assisted development.