- The paper introduces priority ranking to directly evaluate harness optimizer decisions, revealing frequent detrimental updates and poor self-assessment.
- It demonstrates that 44.8โ48.2% of optimizer modifications degrade performance, underscoring the need for step-level evaluation over traditional agent-centric metrics.
- The study shows a significant correlation between ranking performance and multi-step agent improvement, offering a cost-effective screening tool for CI/CD pipelines.
Direct Evaluation of Harness Optimizers: The Priority Ranking Approach
Background and Motivation
Harness optimization for LLM-based agents automates the iterative improvement of agent scaffoldsโincluding prompts, tool configurations, memory mechanisms, and workflowsโenabling substantial agent performance gains across domains such as software engineering, text-to-SQL, and customer support. However, prior evaluation paradigms have been agent-centric, relying exclusively on the end-task performance improvement (end-SR) of target agents. This indirect metric neglects the optimizerโs intermediate actions, often masking detrimental updates and providing no insight into whether optimization is guided by informed decisions or trial-and-error.
Empirical analyses in this work demonstrate that optimizer missteps are frequent and persistent: nearly half of harness updates degrade agent performance, and over 94% of non-prompt-related errors at intermediate optimization steps remain uncorrected in final agent configurations. State-of-the-art harness optimizers also exhibit near-random accuracy in predicting whether their own updates will be beneficial. These findings decisively invalidate the assumption that agent-centric evaluation proxies optimizer quality and establish the necessity for direct, step-level assessment of optimizer decision-making.
Priority Ranking: Methodological Design
To address the lack of actionable, direct metrics for harness optimizer evaluation, the paper introduces priority ranking. This evaluation protocol requires the optimizer to rank harness componentsโprompt, tool, memory, workflowโby their relative potential to impact agent performance if updated at the current optimization step. This reframing eschews the need for expensive full rollouts or manual harness audits, instead transforming the evaluation into a tractable, non-iterative language modeling task.
The core innovation lies in the SHOR dataset: 182 human-verified optimization scenarios spanning multiple domains and optimization stages. Each scenario is annotated by consensus among state-of-the-art coding agents and systematically filtered for inter-annotator consistency and meaningful performance gaps between ranked components.
Given a harness and its optimization trajectory, an optimizer is evaluated on its ability to recover the consensus priority ranking for the set of editable components. Metrics include Acc@1 (identifying the single highest-priority component) and NDCG (the quality of the entire ranking permutation).
Empirical Analysis and Findings
Error Analysis and Limitations of End-Improvement Evaluation
Detailed trajectory analyses confirm that:
- Erroneous Update Frequency: 44.8โ48.2% of optimizer-initiated harness modifications are detrimental to agent performance.
- Error Persistence: Intermediate errors in workflow, tool, and memory components persist to the final agent configuration in over 94% of cases.
- Poor Update Awareness: Optimizersโ prediction accuracy for self-evaluation of their actions is uniformly close to random (0.33โ0.56 depending on domain and model).
Therefore, the prevalent evaluation schema is fundamentally insufficient for ensuring harness optimizersโ reliability and correctness.
Optimizersโ performance on priority ranking reveals several key insights:
- General Weakness: SOTA optimizers struggle to identify the appropriate component(s) for prioritized updates, with the leading top-1 accuracy (Acc@1) at only 0.305, and even the highest overall NDCG values not indicating robust ranking competence.
- Lack of Cross-domain Generalization: Optimizers with high performance in one domain (e.g., software engineering) fail to transfer ranking performance to others (e.g., customer support or text-to-SQL).
- Invalidation of Agent Harness Quality as Optimizer Harness Proxy: Just because a harness exhibits strong agent performance does not imply it is an effective configuration for the optimizer itself; a configuration beneficial for a target agent may be suboptimal for use as an optimizer harness.
Predictive Value and Efficiency
Crucially, the paper establishes a statistically significant correlation (Pearson r=0.60, p=0.038) between an optimizerโs priority ranking performance and its realized ability to improve agent performance in multi-step optimization. This correlation holds across in-domain, out-of-domain, and various optimization stages (maximal for mid-stage harnesses, i.e., Tโ[6,10] iterations).
From a practical perspective, priority ranking is robustly efficient:
- Cost and Time: At least 8x cheaper and 17x faster than full-harness optimization rollouts for comparative evaluation.
- Predictive Utility: Offers a low-latency, actionable screening tool for optimizer selection and monitoring in CI/CD pipelines, supporting industrial agent deployment lifecycles.
Actionable Insights for Optimizer Development
A key experimental result is that providing optimizers with oracle information about which harness component contains a flaw improves their error resolution rate by 17โ51 percentage points (domain-dependent, up to 72 points). This implies that optimizers possess the architectural and functional capacity to execute repairs when given explicit prioritization cues, but lack effective diagnostic mechanisms for self-directed prioritization.
Consequently, the design and training of future harness optimizers should explicitly incorporate priority prediction submodulesโdecoupling the identification of optimization targets from the execution of updatesโto improve resilience, generalization, and effectiveness.
Implications for Theory and Industrial Practice
The transition from black-box end-improvement evaluation to step-level, component-wise prioritization establishes a new theoretical substrate for studying optimizer competence, bridging the gap between agentic system design and introspective optimizer diagnostics.
From an applied perspective, the results highlight silent failure risks in production automation scenarios: agents deployed with harnesses refined under optimizer-centric, agent-blind protocols may inherit undetected, persistent faults. Priority ranking can mitigate these risks, both as a pre-deployment screening gate and as a continuous quality-control measure in production workflows.
Moreover, as harness optimization becomes more prevalent in closed- and open-source agentic systems, the disconnect between agent harness quality and optimizer harness quality will have direct consequences for the reproducibility and portability of agentic research and benchmarks.
Limitations and Future Directions
Limitations include the granularity of harness component decomposition (four broad categories), single-agent focus (multi-agent/multi-harness systems remain unaddressed), and annotator homogeneity (limited to a set of SOTA coding agents). Further work should generalize the SHOR evaluation pipeline to finer-grained harness architectures, collaborative or competitive agent environments, and annotator pools with greater diversity.
Additionally, the paperโs framework does not address the mitigation of emergent safety risks in self-evolving agent systems, an area of growing importance in practical deployments.
Conclusion
This work introduces and justifies the necessity of direct, step-level evaluation for harness optimizers in agentic LLM systems. The proposed priority ranking task, underpinned by the SHOR dataset, enables scalable, interpretable, and accurate appraisal of optimizer competence while substantially reducing evaluation burden. Not only does priority ranking serve as an efficient proxy for multi-turn optimization ability, but it elucidates actionable methods for developing next-generation optimizers with explicit prioritization and diagnostic capabilities. These findings have immediate implications for theory development as well as for the design, deployment, and maintenance of industrial AI agent systems.
Reference: "Towards Direct Evaluation of Harness Optimizers via Priority Ranking" (2605.22505)