Persistence of provider-level characteristics across model generations

Ascertain whether provider-level characteristics are consistently expressed across model generations under the Prediction Arena evaluation by extending the assessment beyond the initial 3-day Cohort 2 snapshot to a full 30-day period.

Background

The paper provides only a brief, 3-day paper-trading snapshot for Cohort 2 models and cautions against drawing strong conclusions from such a short window.

The authors explicitly state that whether provider-level characteristics (e.g., tendencies or performance patterns) persist across generations cannot be concluded from the limited data and plan a longer evaluation.

References

Whether provider-level characteristics are consistently expressed across generations cannot be concluded from this window alone; a full 30-day evaluation of Cohort 2 is planned.

Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets  (2604.07355 - Zhang et al., 28 Mar 2026) in Section 11, Conclusion