Generalization, Depth, and Agentic Performance of Murphy

Ascertain whether Murphy, the multi-turn reflective reinforcement learning framework that extends Group Relative Policy Optimization (GRPO) by conditioning optimization on execution feedback, generalizes to settings with noisier feedback signals; investigate its behavior and efficacy under deeper multi-turn refinement chains; and characterize its impact on broader agentic performance dimensions, specifically robustness and alignment with developer intent.

Background

Murphy is proposed as a multi-turn RLVR algorithm that extends GRPO by incorporating iterative execution feedback during training for code generation tasks. The paper evaluates Murphy primarily in settings with structured, verifiable feedback (e.g., unit tests) and relatively shallow multi-turn refinement, focusing on improvements in pass@1 on coding benchmarks.

The authors note that their experiments are confined to structured feedback in code generation, which raises questions about the method’s applicability in environments with noisier or less reliable feedback, scalability to longer refinement chains, and performance on broader agentic criteria beyond benchmark accuracy—such as robustness and alignment with developer intent. These aspects remain explicitly identified as open questions in the paper's Limitations section.

References

Our experiments are limited to structured feedback in code generation, leaving open questions about generalization to noisier feedback, deeper refinement chains, and broader notions of agentic performance such as robustness and alignment with developer intent.

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation  (2511.07833 - Ekbote et al., 11 Nov 2025) in Section: Limitations