Kalshi-based episodes are a suite of trading environment reconstructions derived from real Kalshi market data, offering deterministic, event-driven simulations.
They integrate orderbook, trade, and lifecycle data using precise timestamp ordering to emulate realistic market microstructure and fee structures.
This platform enables researchers to rigorously backtest trading strategies and assess performance under transaction costs, settlement constraints, and volatile conditions.
Kalshi-based episodes are a standardized suite of trading environment reconstructions derived from the Kalshi CFTC-regulated U.S. prediction market, as implemented in the PredictionMarketBench framework. These episodes offer a deterministic, event-driven platform for evaluating algorithmic and LLM-based trading agents using replayed historical limit-order-book and trade data. Each episode encapsulates a distinct prediction market context—spanning cryptocurrency, weather, and sports—for the systematic backtesting of agent behaviors under realistic market microstructure, transaction cost, and settlement constraints (Arora et al., 28 Jan 2026).
1. Overview of Kalshi-based Episodes
The four Kalshi-based episodes are constructed directly from raw Kalshi market data streams, with sampling concentrated in January 2026. Each episode is characterized by its prediction domain, ticker structure, and event window:
Episode ID
Domain
Tickers
Duration
OB snaps
Trades
KXBTCD-26JAN2017
Crypto
23
37.4 h
311,998
6,283
KXHIGHNY-26JAN20
Weather
6
37.4 h
50,231
8,044
KXNCAAF-26
Sports (CFB)
2
37.4 h
8,320
171,786
KXNFLGAME-26JAN11BUFJAC
Sports (NFL)
2
67.4 h
8,047
111,160
KXBTCD-26JAN2017: Bitcoin daily high threshold prediction with 23 YES/NO contracts (“Did BTC close above X?”)ina≈37.4hourwindow.</li><li><strong>KXHIGHNY−26JAN20</strong>:NYCweatherepisodewith6discretehigh−temperaturethresholds(“WillNYChighexceedT°?”),spanningthesameinterval.</li><li><strong>KXNCAAF−26</strong>:CollegeFootballseason−longfutureswith2championshipoutcometickers;sametimespan,butextremelyhightradecount.</li><li><strong>KXNFLGAME−26JAN11BUFJAC</strong>:NFLsingle−gamespreadbet(Buffalovs.Jacksonville)over≈67.4hours.</li></ul><p>Eachepisodeencompassesraworderbookupdates,tradeprints,andsettlementstreamsforitsassociatedtickers.Allcontentispre−shardedbyeventidentifierandorganizedunderanepisodedirectorystructure(metadata.json,orderbook.parquet,trades.parquet,settlement.json).</p><h2class=′paper−heading′id=′data−extraction−and−state−construction′>2.DataExtractionandStateConstruction</h2><p>Episodesarebuiltfromthreemarketdatastreams—orderbookupdates,trades,andlifecycleevents—alignedtoaglobalUTCtimestampandfurtherdisambiguatedbysequencenumbers.Theepisode’sstateateachagentdecisiontimet(withagentcadence\Delta t,e.g.,5minutes)comprisesforeachtickeri:</p><ul><li>\mathrm{best\_bid}_i(t),\mathrm{best\_ask}_i(t)</li><li>Mid−price:m_i(t) = \frac{\mathrm{best\_bid}_i(t) + \mathrm{best\_ask}_i(t)}{2}</li><li>Top−Norderbookleveldepths:volumesatadjacentticksaround<ahref="https://www.emergentmind.com/topics/bombardier−beetle−optimizer−bbo"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">BBO</a></li><li>Historyvector:lastMmid−pricevalues[m_i(t-\Delta t\cdot j)]_{j=1}^M</li><li>Agent’sopenpositions\mathrm{pos}_i(t),cashbalance\mathrm{cash}(t)</li><li>Listofactive,unfilledorders(ID,side,price,size,time−in−force)</li></ul><p>Featureengineeringsupportsderivedmetricssuchas:</p><ul><li>Simplemovingaverage(SMA)ofmid−priceoverwindowW:</li></ul><p>\mathrm{SMA}_i(t) = \frac{1}{W}\sum_{j=0}^{W-1} m_i(t-j\Delta t)</p><ul><li>Rollingstandarddeviation\sigma_i(t)overthesamewindow.</li></ul><p>RawdataispresentedasParquetandJSONfiles.TimestampsandsequenceIDsensureunambiguousglobaleventordering.</p><h2class=′paper−heading′id=′agent−actions−execution−and−reward−structure′>3.AgentActions,Execution,andRewardStructure</h2><p>Theactioninterface(viaAgentContext<ahref="https://www.emergentmind.com/topics/geospatial−application−programming−interface−api"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">API</a>)exposes:</p><ul><li><code>submitlimitorder(ticker,side∈BUY,SELL,price,size,tif∈IOC,GTC,POSTONLY)</code></li><li><code>submitmarketorder(ticker,side,size)</code></li><li><code>cancelorder(orderid)</code></li></ul><p>Executionusesmaker/takerfeemodeling:</p><ul><li>Takerfee(p)(cents):f_{\mathrm{taker}}(p) = 0.07 \times p \times (1-p/100)</li><li>Makerfee(p):f_{\mathrm{maker}}(p) = 0.0175 \times p \times (1-p/100)</li><li>TransactioncostforsizeQatpricep:</li></ul><p>\mathrm{Cost} = f_{\mathrm{role}}(p) \times p \times Q</p><p>with\mathrm{role} \in \{\text{maker}, \text{taker}\}</p><p>Rewardateachtimestepis:</p><p>r_t = \sum_i (pos_i(t) - pos_i(t-\Delta t))\, m_i(t) - \mathrm{fees}_t</p><p>Atterminalsettlement,allopenpositionsaresettledatoutcomeo_i\in\{0,1\}with:</p><p>\mathrm{SettlementPL} = \sum_i pos_i(T)\, (o_i - m_i(T^-))</p><p>Totalepisodicrewardis\sum_t r_t + \mathrm{SettlementPL}$ and incorporates both market-to-market P&L, transaction costs, and settlement corrections.</p>
<h2 class='paper-heading' id='deterministic-replay-and-simulation-pipeline'>4. Deterministic Replay and Simulation Pipeline</h2>
<p>The environment employs a deterministic, event-driven simulator to ensure reproducibility and fair comparison across trading agents. The canonical replay pipeline executes as follows:</p>
<p>
for each episode_dir in episodes:
meta = load(metadata.json)
OB_events = stream(orderbook.parquet)
Trade_events = stream(trades.parquet)
Life_events = stream(settlement.json + lifecycle info)
E = merge_and_sort([OB_events, Trade_events, Life_events], key=(timestamp, sequence_number))
sim = Simulator(meta.fee_model, meta.execution_mode)
current_time = meta.start_time
event_ptr = 0whilenot sim.settlement_processed():
next_decision = current_time + meta.cadence
while E[event_ptr].timestamp <= next_decision:
sim.process_event(E[event_ptr])
event_ptr += 1
obs = sim.get_observation(current_time=next_decision)
actions = agent.step(obs)
sim.apply_actions(actions)
current_time = next_decision
while event_ptr < len(E):
sim.process_event(E[event_ptr])
event_ptr += 1
sim.close_all_positions()
</p>
<p>All episodic data is strictly partitioned by event identifier. Sequence numbers resolve any tie in event timestamps. This design enables precise event ordering, strict replay determinism, and supports both classical and tool-calling <a href="https://www.emergentmind.com/topics/llm-agents" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">LLM agents</a> with reproducible trajectories.</p>
<h2 class='paper-heading' id='usage-api-access-and-key-statistics'>5. Usage, API Access, and Key Statistics</h2>
<p>Researchers interact with Kalshi-based episodes programmatically via the PredictionMarketBench Python API. Core workflow:</p>
<p>
frompredictionmarketbenchimportBenchmarkHarness, AgentContext, Simulator
episodes = BenchmarkHarness.list_episodes() # ['KXBTCD-...', ...]
epi = BenchmarkHarness.load_episode('KXBTCD-26JAN2017') # loads all data
sim = Simulator(fee_model=epi.metadata.fee_model, execution_mode=epi.metadata.execution_mode)
ctx = AgentContext(sim)
t = epi.metadata.start_time
whilenot sim.settlement_processed():
t_next = t + epi.metadata.agent_cadence
sim.replay_until(t_next)
obs = ctx.get_observation(time=t_next) # dict: market/bbo/depth/pos/cash
actions = my_agent.policy(obs)
for a in actions:
ctx.place_order(**a)
t = t_next
pnl, trade_log, equity_curve = sim.get_results()
</p>
<p>Observations are Python dicts mapping tickers to current quotes, depth arrays, positions, and cash. Actions are lists of dicts specifying ticker, side, order type, price, size, and TIF. The simulator outputs deterministic logs, timestamped fills, transactional fees, and detailed P&L records for reproduction and offline analysis.</p>
<p>Key statistics for each episode—duration, orderbook snapshots, trade volume, and ticker count—are summarized above. Decision steps per episode are proportional to duration and agent cadence (e.g., 37.4h at 5min → ≈448 steps). Aggregate volatility can be computed via $\sigma_{\mathrm{episode}} = \mathrm{std}(\Delta m_i(t))$ as an offline metric.
6. Research Implications and Observed Dynamics
The standardized Kalshi-based episodes offer a unique backtesting corpus with fee and settlement mechanisms characteristic of real prediction markets. Baseline analyses demonstrate that naive trading agents can underperform due to cumulative transaction costs and adverse settlement effects, while algorithmic, fee-aware agents display robustness in volatile regimes (Arora et al., 28 Jan 2026). This property highlights the critical influence of microstructure and execution modeling in algorithmic market design and validation. A plausible implication is that agents relying solely on directional signal without transaction cost modeling will systematically underperform relative to microstructure-sensitive strategies.
The tool supports studies into agent adaptivity, liquidity provision, settlement risk management, and the development of testable, reproducible results across artificial and learned agent classes. The strict replay determinism and event-partitioned design of Kalshi-based episodes represent a methodological advance aligning with best practices in empirical market microstructure and reinforcement learning benchmark design.