PredictionMarketBench: Reproducible Trading Backtest

Updated 3 February 2026

PredictionMarketBench is a benchmark framework that standardizes reproducible, execution-realistic backtesting for both classical and LLM-based trading agents in prediction markets.
It simulates market microstructure with detailed orderbook dynamics, real-world fee models, and agent-specific APIs, using episodes derived from Kalshi exchange data across crypto, weather, and sports domains.
The framework delivers systematic metrics—such as equity curves, Sharpe ratios, drawdown, and fill ratios—that enable precise, apples-to-apples comparisons of diverse trading strategies.

PredictionMarketBench is a SWE-bench-style benchmark and framework designed for reproducible, execution-realistic backtesting of algorithmic and LLM-based trading agents operating on binary-outcome (YES/NO) prediction-market contracts. By standardizing episode construction, simulator architecture, agent APIs, and evaluation metrics, PredictionMarketBench enables systematic comparison of agent strategies under historical market microstructure, including limit-order book (LOB) dynamics, fee structures, and settlement realities. The framework ships with real-world episodes derived from Kalshi exchange data encompassing cryptocurrency, weather, and sports contracts, and provides baseline results for classical and LLM agents to highlight the impact of fee-awareness, execution method, and volatility on trading performance (Arora et al., 28 Jan 2026).

1. Design Principles and Framework Structure

PredictionMarketBench's design objective is to facilitate fair, version-controlled backtesting of trading agents by closely replicating actual trading conditions observed in live prediction markets. The framework enforces:

Portability and Replayability: Each evaluation episode is fully encapsulated as a directory tree containing all required metadata, LOB snapshots, trade records, and settlement outcomes. This supports deterministic, replayable experiments and seamless integration of new episodes or data sources.
Execution Realism: The framework implements a deterministic, event-driven simulator. Market replay respects all orderbook and trade event ordering, applies exchange-realistic maker/taker fee models, and enforces LOB queueing effects.
Agent Abstraction: Agents—classical, rule-based, or tool-calling LLMs—interact with the simulation exclusively via a versioned AgentContext API, ensuring reproducibility and separation of concerns between agent logic and simulation internals.
Standardized Metrics: Each run yields detailed trade logs, equity curves, return statistics, drawdown, Sharpe ratio, fees, slippage, and fill ratios, enabling apples-to-apples comparisons.

This structure enables systematic study of agent behaviors under identical market dynamics and friction models, controlling for confounding factors such as asynchronous data feeds or mismatched microstructure (Arora et al., 28 Jan 2026).

2. Episode Construction and Data Pipeline

Episodes are defined by ingesting three primary data streams (from Kalshi or compatible exchanges):

Orderbook Updates: Time-sequenced snapshots of bid/ask prices and depths at multiple price levels.
Trade Prints: All executed trades, including aggressor (taker) and resting (maker) identifiers, supporting passive and active order matching.
Lifecycle and Settlement Events: Market opening/closure, trading halts, and binary-resolution settlement for each contract.

The pipeline executes the following transformation:

Alignment: All records are mapped to a unified UTC timeline, using exchange timestamps and sequence numbers to ensure causal order.
Windowing: Records are grouped by event identifier into episode windows (from contract open through settlement).
Normalization: All price data are discretized to integer ticks (e.g., cents), and contract sizes normalized.
Episode Directory Construction:

File Name	Content	Purpose
metadata.json	Market/ticker info, config, bankroll, fees	Full episode and simulator parameterization
orderbook.parquet	Timestamped LOB snapshots, levels, sizes	Ground-truth for LOB state at agent step boundaries
trades.parquet	All historical fills, aggressor flags	Enables passive fill modeling, reconstructs microstructure
settlement.json	Terminal {YES,NO} resolution per ticker	Deterministic PnL and position marking at settlement

The extensible format allows the addition of further data streams (cancellations, amendments) while preserving backward compatibility.

3. Execution-Realistic Simulator and Market Microstructure

The simulator core is an event-driven loop, advancing at discrete agent decision intervals (Δt, typically 300s). Each cycle:

Ingests all market events since the previous agent step, updates LOB state, and maintains order queue positions.
Exposes AgentContext to the agent, providing market state, cash, positions, and outstanding orders.
Accepts new order/cancel submissions, applies standard maker/taker semantics, and executes LOB matching per pseudocode:

function handle_incoming_order(order):
  if order.type in {MARKET, IOC}:
    # Immediate taker: cross the orderbook, reduce qty, fill at queue
  elif order.type in {GTC, POST_ONLY}:
    if crosses_spread(order.price):
      # Partial fill as taker, remainder posts as maker (unless post-only)
    else:
      # Pure maker: join queue at price and timestamp

Maker fees are calculated per contract as $f_{\mathrm{maker}}(p) = \alpha_{\mathrm{maker}} \cdot p \cdot (1 - p/100)$ with $\alpha_{\mathrm{maker}} = 0.0175$ . Taker fees use $\alpha_{\mathrm{taker}} = 0.07$ . PnL and cash changes respect side, execution price, and role; open positions are marked to terminal value at resolution ($1$ or $0$ per contract upon settlement).

4. Agent API and Strategy Evaluation

Agents implement a single on_step(context: AgentContext) method, relying on API tools to query market state, submit/cancel orders, and manage trading decisions. The available API ensures agent observability is strictly controlled and indicates:

Active tickers and top-of-book quotes
Full depth-N orderbook snapshot
Current net position and cash
Limit/market order placement, with time-in-force and amount
Order cancellation

All agent actions are timestamped for strict causal replay. The simulator logs all activity, producing detailed trade and equity histories.

A sample Bollinger Bands strategy is included; it trades when the mid-price deviates by more than two standard deviations from the rolling mean, placing post-only limit orders for fee efficiency and risk management.

5. Benchmark Episodes and Baseline Analyses

Four representative episodes, derived from Kalshi orderbooks and settlements (January 2026), are provided:

Episode	Domain	Tick.	LOB Snapshots	Trades	Duration (hrs)
KXBTCD-26JAN2017	Crypto	23	312k	6.3k	37.4
KXHIGHNY-26JAN20	Weather	6	50k	8k	37.4
KXNCAAF-26	College Football	2	8.3k	171.8k	37.4
KXNFLGAME-26JAN11BUFJAC	NFL	2	8k	111.2k	67.4

Evaluation metrics for each agent and episode include cumulative return $R = (E_T - E_0)/E_0 \times 100\%$ , maximum drawdown (MaxDD), Sharpe ratio (with periodic return standardization), and fill ratio (executed/submitted contracts).

Baseline results for three strategies (starting bankroll \$1,000, 5-min cadence):

RandomAgent: Low intensity (~20 trades/episode), negligible returns (–0.13%), near-zero drawdown, fee drag dominates.
LLM Agent (gpt-4.1-nano): High frequency, large settlement losses and taker fees, total PnL –2.77%, large MaxDD (36%).
Bollinger Bands (fee-aware): Post-only limit orders with 1.75% maker fee; +1.67% PnL overall, 3.18% MaxDD, strong performance in volatile BTC episode, near breakeven in others.

Analyses confirm that transaction costs and settlement exposure rapidly erode naive agent performance, and maker/fee management are critical for alpha preservation. High fill rates (>94%) were achieved by limit-order strategies; aggressive agents suffered from partial fills and fee-related losses.

6. Reproducibility, Extension, and Best Practices

PredictionMarketBench is fully open and version-controlled. Standard procedure for reproducibility and extension comprises:

Clone repository and episodes: git clone https://github.com/Oddpool/PredictionMarketBench.git
Install dependencies and download episodes archive into ./episodes/.

Execute the harness:

python run_benchmark.py \
  --agent path/to/MyAgent.py \
  --episodes ./episodes/ \
  --cadence 300 \
  --initial-bankroll 1000

Outputs: Structured in ./results/, including trade CSVs, equity curves, and summary report.
To add new episodes, populate a directory with {metadata.json, orderbook.parquet, trades.parquet, settlement.json} conforming to the schema.
Agents are implemented by subclassing Agent in the designated module, overriding on_step(), and registering via module init.

The benchmark structure supports modular addition of new contract types, markets, fee models, and agent paradigms.

7. Experimental Design Considerations and Microstructure Insights

PredictionMarketBench's microstructure realism is complemented by a legacy of prior experimental work combining controlled market simulations with explicit liquidity and information-flow modeling (Brahma et al., 2010). Foundational results from experiments comparing the logarithmic market scoring rule (LMSR) and Bayesian market maker (BMM) underscore the importance of bounded loss, price stability, and adaptability to shocks:

LMSR:
- Cost: $C(q) = b \cdot \ln(1 + \exp(q/b))$
- Bounded loss: $\leq b \ln 2$
BMM:
- Spot price: Bayesian update of belief mean $\mu_t$
- Adaptivity via sliding window $W$ over trade-inferred value intervals; unbounded loss in adversarial conditions

Performance metrics such as RMSD, PnL, shock-response time $T_{90}$ , and fill ratios provide rigorous comparison of agent and market maker behaviors. The experimental symmetry and episode encapsulation methods in PredictionMarketBench parallel best practices from this literature, ensuring robust, unbiased benchmarking (Brahma et al., 2010, Arora et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

PredictionMarketBench: A SWE-bench-Style Framework for Backtesting Trading Agents on Prediction Markets (2026)

Comparing Prediction Market Structures, With an Application to Market Making (2010)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PredictionMarketBench.