Generalization, Depth, and Agentic Performance of Murphy
Ascertain whether Murphy, the multi-turn reflective reinforcement learning framework that extends Group Relative Policy Optimization (GRPO) by conditioning optimization on execution feedback, generalizes to settings with noisier feedback signals; investigate its behavior and efficacy under deeper multi-turn refinement chains; and characterize its impact on broader agentic performance dimensions, specifically robustness and alignment with developer intent.
References
Our experiments are limited to structured feedback in code generation, leaving open questions about generalization to noisier feedback, deeper refinement chains, and broader notions of agentic performance such as robustness and alignment with developer intent.
— MURPHY: Multi-Turn GRPO for Self Correcting Code Generation
(2511.07833 - Ekbote et al., 11 Nov 2025) in Section: Limitations