Family-level pattern behind GLM-5’s early underperformance

Determine whether the initial underperformance of #1{glm-5} relative to its predecessor #1{glm-4.7} in the 3-day Kalshi paper-trading snapshot reflects a persistent model-family pattern within the GLM series or instead results from short-horizon variance, by evaluating #1{glm-5} over a sufficiently long horizon.

Background

In the 3-day Cohort 2 paper-trading snapshot, #1{glm-5} returned −4.09%, the weakest among active next-generation models, while its predecessor #1{glm-4.7} had strong relative performance in the primary cohort (finishing first overall on Kalshi).

The authors state it is unclear whether this discrepancy indicates a broader, provider-level or family-level pattern and suggest that longer evaluation is needed to draw conclusions.

References

Whether this represents a family-level pattern is unclear: its predecessor #1{glm-4.7} was tied for the best Phase 1 return among non-Grok models ($-7.2\%$) and ultimately finished first in overall Kalshi standings ($-16.0\%$).

— Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets (2604.07355 - Zhang et al., 28 Mar 2026) in Section 8, Cross-Generation Preliminary Comparison

Family-level pattern behind GLM-5’s early underperformance

Background

References

Related Problems