Attribution of provider-specific EvoScore preferences to training strategies

Determine whether the observed cross-provider differences in SWE-CI EvoScore ranking sensitivity to the gamma parameter—where some providers’ language models prefer short-term gains (gamma < 1) while others prefer long-term gains (gamma > 1)—are causally attributable to differences in the providers’ model training strategies, and ascertain whether the within-provider stability of these preferences indeed indicates stable internal training pipelines.

Background

In SWE-CI, EvoScore uses a gamma parameter to weight later iterations more heavily, enabling analysis of short-term versus long-term code maintenance performance. The authors vary gamma to examine how model rankings shift, revealing that some providers’ models perform better when early iterations are emphasized, whereas others excel when later iterations dominate.

They observe that these tendencies vary across providers but are relatively consistent within a provider’s model family. Based on this, they conjecture that cross-provider variation may stem from differences in training strategies, while within-provider consistency may reflect stable internal training pipelines. This conjecture invites verification and causal attribution.

References

We conjecture that this reflects differences in training strategies adopted by different providers, while the relative consistency within each provider suggests that their internal training pipelines remain largely stable.

— SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration (2603.03823 - Chen et al., 4 Mar 2026) in Observation 2, Section 4 (Results)

Attribution of provider-specific EvoScore preferences to training strategies

Background

References

Related Problems