PredictionMarketBench: Reproducible Trading Backtest
- PredictionMarketBench is a benchmark framework that standardizes reproducible, execution-realistic backtesting for both classical and LLM-based trading agents in prediction markets.
- It simulates market microstructure with detailed orderbook dynamics, real-world fee models, and agent-specific APIs, using episodes derived from Kalshi exchange data across crypto, weather, and sports domains.
- The framework delivers systematic metrics—such as equity curves, Sharpe ratios, drawdown, and fill ratios—that enable precise, apples-to-apples comparisons of diverse trading strategies.
PredictionMarketBench is a SWE-bench-style benchmark and framework designed for reproducible, execution-realistic backtesting of algorithmic and LLM-based trading agents operating on binary-outcome (YES/NO) prediction-market contracts. By standardizing episode construction, simulator architecture, agent APIs, and evaluation metrics, PredictionMarketBench enables systematic comparison of agent strategies under historical market microstructure, including limit-order book (LOB) dynamics, fee structures, and settlement realities. The framework ships with real-world episodes derived from Kalshi exchange data encompassing cryptocurrency, weather, and sports contracts, and provides baseline results for classical and LLM agents to highlight the impact of fee-awareness, execution method, and volatility on trading performance (Arora et al., 28 Jan 2026).
1. Design Principles and Framework Structure
PredictionMarketBench's design objective is to facilitate fair, version-controlled backtesting of trading agents by closely replicating actual trading conditions observed in live prediction markets. The framework enforces:
- Portability and Replayability: Each evaluation episode is fully encapsulated as a directory tree containing all required metadata, LOB snapshots, trade records, and settlement outcomes. This supports deterministic, replayable experiments and seamless integration of new episodes or data sources.
- Execution Realism: The framework implements a deterministic, event-driven simulator. Market replay respects all orderbook and trade event ordering, applies exchange-realistic maker/taker fee models, and enforces LOB queueing effects.
- Agent Abstraction: Agents—classical, rule-based, or tool-calling LLMs—interact with the simulation exclusively via a versioned AgentContext API, ensuring reproducibility and separation of concerns between agent logic and simulation internals.
- Standardized Metrics: Each run yields detailed trade logs, equity curves, return statistics, drawdown, Sharpe ratio, fees, slippage, and fill ratios, enabling apples-to-apples comparisons.
This structure enables systematic study of agent behaviors under identical market dynamics and friction models, controlling for confounding factors such as asynchronous data feeds or mismatched microstructure (Arora et al., 28 Jan 2026).
2. Episode Construction and Data Pipeline
Episodes are defined by ingesting three primary data streams (from Kalshi or compatible exchanges):
- Orderbook Updates: Time-sequenced snapshots of bid/ask prices and depths at multiple price levels.
- Trade Prints: All executed trades, including aggressor (taker) and resting (maker) identifiers, supporting passive and active order matching.
- Lifecycle and Settlement Events: Market opening/closure, trading halts, and binary-resolution settlement for each contract.
The pipeline executes the following transformation:
- Alignment: All records are mapped to a unified UTC timeline, using exchange timestamps and sequence numbers to ensure causal order.
- Windowing: Records are grouped by event identifier into episode windows (from contract open through settlement).
- Normalization: All price data are discretized to integer ticks (e.g., cents), and contract sizes normalized.
- Episode Directory Construction:
| File Name | Content | Purpose |
|---|---|---|
| metadata.json | Market/ticker info, config, bankroll, fees | Full episode and simulator parameterization |
| orderbook.parquet | Timestamped LOB snapshots, levels, sizes | Ground-truth for LOB state at agent step boundaries |
| trades.parquet | All historical fills, aggressor flags | Enables passive fill modeling, reconstructs microstructure |
| settlement.json | Terminal {YES,NO} resolution per ticker | Deterministic PnL and position marking at settlement |
The extensible format allows the addition of further data streams (cancellations, amendments) while preserving backward compatibility.
3. Execution-Realistic Simulator and Market Microstructure
The simulator core is an event-driven loop, advancing at discrete agent decision intervals (Δt, typically 300s). Each cycle:
- Ingests all market events since the previous agent step, updates LOB state, and maintains order queue positions.
- Exposes AgentContext to the agent, providing market state, cash, positions, and outstanding orders.
- Accepts new order/cancel submissions, applies standard maker/taker semantics, and executes LOB matching per pseudocode:
1 2 3 4 5 6 7 8 |
function handle_incoming_order(order): if order.type in {MARKET, IOC}: # Immediate taker: cross the orderbook, reduce qty, fill at queue elif order.type in {GTC, POST_ONLY}: if crosses_spread(order.price): # Partial fill as taker, remainder posts as maker (unless post-only) else: # Pure maker: join queue at price and timestamp |
Maker fees are calculated per contract as with . Taker fees use . PnL and cash changes respect side, execution price, and role; open positions are marked to terminal value at resolution ($1$ or $0$ per contract upon settlement).
4. Agent API and Strategy Evaluation
Agents implement a single on_step(context: AgentContext) method, relying on API tools to query market state, submit/cancel orders, and manage trading decisions. The available API ensures agent observability is strictly controlled and indicates:
- Active tickers and top-of-book quotes
- Full depth-N orderbook snapshot
- Current net position and cash
- Limit/market order placement, with time-in-force and amount
- Order cancellation
All agent actions are timestamped for strict causal replay. The simulator logs all activity, producing detailed trade and equity histories.
A sample Bollinger Bands strategy is included; it trades when the mid-price deviates by more than two standard deviations from the rolling mean, placing post-only limit orders for fee efficiency and risk management.
5. Benchmark Episodes and Baseline Analyses
Four representative episodes, derived from Kalshi orderbooks and settlements (January 2026), are provided:
| Episode | Domain | Tick. | LOB Snapshots | Trades | Duration (hrs) |
|---|---|---|---|---|---|
| KXBTCD-26JAN2017 | Crypto | 23 | 312k | 6.3k | 37.4 |
| KXHIGHNY-26JAN20 | Weather | 6 | 50k | 8k | 37.4 |
| KXNCAAF-26 | College Football | 2 | 8.3k | 171.8k | 37.4 |
| KXNFLGAME-26JAN11BUFJAC | NFL | 2 | 8k | 111.2k | 67.4 |
Evaluation metrics for each agent and episode include cumulative return , maximum drawdown (MaxDD), Sharpe ratio (with periodic return standardization), and fill ratio (executed/submitted contracts).
Baseline results for three strategies (starting bankroll \$1,000, 5-min cadence):
- RandomAgent: Low intensity (~20 trades/episode), negligible returns (–0.13%), near-zero drawdown, fee drag dominates.
- LLM Agent (gpt-4.1-nano): High frequency, large settlement losses and taker fees, total PnL –2.77%, large MaxDD (36%).
- Bollinger Bands (fee-aware): Post-only limit orders with 1.75% maker fee; +1.67% PnL overall, 3.18% MaxDD, strong performance in volatile BTC episode, near breakeven in others.
Analyses confirm that transaction costs and settlement exposure rapidly erode naive agent performance, and maker/fee management are critical for alpha preservation. High fill rates (>94%) were achieved by limit-order strategies; aggressive agents suffered from partial fills and fee-related losses.
6. Reproducibility, Extension, and Best Practices
PredictionMarketBench is fully open and version-controlled. Standard procedure for reproducibility and extension comprises:
- Clone repository and episodes:
git clone https://github.com/Oddpool/PredictionMarketBench.git - Install dependencies and download episodes archive into
./episodes/. - Execute the harness:
1 2 3 4 5
python run_benchmark.py \ --agent path/to/MyAgent.py \ --episodes ./episodes/ \ --cadence 300 \ --initial-bankroll 1000
- Outputs: Structured in
./results/, including trade CSVs, equity curves, and summary report. - To add new episodes, populate a directory with
{metadata.json, orderbook.parquet, trades.parquet, settlement.json}conforming to the schema. - Agents are implemented by subclassing
Agentin the designated module, overridingon_step(), and registering via module init.
The benchmark structure supports modular addition of new contract types, markets, fee models, and agent paradigms.
7. Experimental Design Considerations and Microstructure Insights
PredictionMarketBench's microstructure realism is complemented by a legacy of prior experimental work combining controlled market simulations with explicit liquidity and information-flow modeling (Brahma et al., 2010). Foundational results from experiments comparing the logarithmic market scoring rule (LMSR) and Bayesian market maker (BMM) underscore the importance of bounded loss, price stability, and adaptability to shocks:
- LMSR:
- Cost:
- Bounded loss:
- BMM:
- Spot price: Bayesian update of belief mean
- Adaptivity via sliding window over trade-inferred value intervals; unbounded loss in adversarial conditions
Performance metrics such as RMSD, PnL, shock-response time , and fill ratios provide rigorous comparison of agent and market maker behaviors. The experimental symmetry and episode encapsulation methods in PredictionMarketBench parallel best practices from this literature, ensuring robust, unbiased benchmarking (Brahma et al., 2010, Arora et al., 28 Jan 2026).