DSLOB: Distributional Shift LOB Benchmark
- DSLOB is a synthetic benchmark dataset that uses agent-based simulation to generate labeled in-distribution and out-of-distribution limit order book data.
- It represents LOB states via 100 consecutive snapshots of the top 10 bid/ask levels, with regime labels distinguishing normal conditions from small and large shocks.
- Empirical results reveal that models unaware of distribution shifts struggle under stress, while shift-aware architectures like AdaRNN demonstrate enhanced robustness.
DSLOB (Distributional Shift Limit Order Book) is a synthetic benchmark dataset designed for the systematic evaluation of forecasting algorithms on limit order book (LOB) data under controlled distributional shift scenarios. Developed using a multi-agent market simulator, DSLOB provides labeled in-distribution (IID) and out-of-distribution (OOD) market stress samples to facilitate rigorous, repeatable comparisons of model robustness in high-frequency financial time series forecasting. This resource enables the quantification of generalization performance and adaptation strategies for LOB-based methods, which is otherwise infeasible with real-world, unlabeled LOB datasets (Cao et al., 2022).
1. Synthetic Data Generation and Agent-Based Market Simulation
DSLOB is constructed using the ABIDES event-based simulator, which emulates a modern NASDAQ-style, FIFO-matching electronic exchange populated by multiple agent types. The agent population for DSLOB comprises:
- Noise Agents (N_noise = 50): These place buy/sell limit orders of randomly selected sizes and directions, with inter-arrival times drawn from a uniform distribution over 1–100 ns.
- Momentum Agents (N_mom = 10): Each agent wakes up deterministically every T_MOM seconds and places market orders based on the sign of the difference between moving averages of the mid-price across two windows (T_min=20, T_max=50). Buy if , sell otherwise.
- Market-Maker Agent (N_MM = 1): Wakes every T_MM=5 s to place symmetric bid/ask quotes.
- Value Agents (N_value = 100): Each agent observes a private fundamental value following an Ornstein–Uhlenbeck process:
with arrival times modeled by a (non-)homogeneous Poisson process with baseline arrival rate  /s. Observation noise is added.
- Regime-Shifting Shocks: To create OOD conditions, shocks are injected at , with a mid-price jump , , and a subsequent post-shock surge in value-agent arrival rate:
Values of and differentiate small- and large-shock regimes.
This controlled simulation paradigm enables the generation of realistic, labeled OOD LOB data, overcoming the limitations of unlabeled distributional shifts in public datasets.
2. LOB Representation and Dataset Specification
At each time step , the LOB state is defined by the top levels for both bid and ask sides:
where and are the price and volume at ask/bid level .
Prediction input sequences are constructed as . The mid-price,
serves as the regression target . Each data point also includes a regime label indicating in-distribution (IID: no shock) or OOD (small/large shock).
The full dataset comprises 365 simulated trading days (∼1 year), sampled at 1 Hz, totaling 8.55 million snapshots. Regime breakdown is: 182 IID days, 91 small-shock days, 91 large-shock days. Data are stored in HDF5 format with explicit fields:
- "X": float32
- "y_mid": float32
- "regime": uint8 in (IID, small, large)
3. Controlled Evaluation Regimes and Distributional Shift Design
DSLOB partitions days into three regimes to isolate the effect of distribution shifts:
- IID: No exogenous shock; behavior reflects "normal" conditions.
- Small-Shock OOD: Shock parameters , , .
- Large-Shock OOD: , , .
For each regime , the data distribution has the same conditional but distinct marginal due to the altered arrival rates and shock responses. The dataset structure supports explicit quantification of OOD generalization by comparing algorithm performance on held-out OOD splits, an evaluation unattainable with unannotated real LOB data.
This design enables direct computation of shift magnitudes, e.g., using divergences between price-return histograms or Wasserstein distances, although these are not computed in the paper.
4. Benchmark Targets, Training Protocols, and Metrics
The primary forecasting task is 1-step mid-price regression, given the past 100 LOB snapshots. Models are trained with mean squared error (MSE):
Performance is quantified using root-mean-squared error (RMSE) for in-distribution (IID test set) and for each OOD shock regime. Mean absolute error and may also be computed for auxiliary analysis.
Best practices, as recommended, require that models train exclusively on IID days. Small- and large-shock periods are held out for OOD/out-of-sample evaluation. Regime labels must not be used for direct training on large shock data unless explicitly adopting a domain adaptation framework.
5. Baseline Models and Empirical Results
Three baseline models, representative of state-of-the-art time series and LOB forecasting methodologies, are evaluated:
| Model | IID Test RMSE | Small Shock RMSE | Large Shock RMSE |
|---|---|---|---|
| AdaRNN | 1.02 ± 2.2e–4 | 1.00 ± 8.1e–5 | 0.99 ± 3.6e–5 |
| Transformer | 0.87 ± 1.98e–3 | 1.02 ± 6.3e–3 | 1.08 ± 0.01 |
| DeepLOB | 0.66 ± 0.11 | 1.08 ± 0.07 | 2.25 ± 0.15 |
- AdaRNN (Temporal Distribution Characterization + Temporal Distribution Matching): Maintains robust RMSE across shock regimes, reflecting its explicit adaptation to distributional shift.
- Transformer-Encoder: Lowest RMSE on IID, but exhibits degradation (25%–40% increase) under shocks.
- DeepLOB (1D CNN + Inception + LSTM): Best IID performance, but RMSE increases 3.4 under large shocks.
Key finding: Distribution-shift-agnostic models generalize poorly under regime change, while AdaRNN's robustness highlights the importance of shift-aware architectures.
6. Usage Guidelines and Evaluation Protocols
- Regime-Specific Splits: Train only on IID days, evaluate on both IID and each OOD regime. Report results for all regimes separately.
- Adaptation vs. Generalization: Use regime labels to distinguish between domain adaptation (limited OOD used for model adjustment) and domain generalization (no OOD exposure during training).
- Distribution Matching Losses: Incorporate penalties such as Maximum Mean Discrepancy or Wasserstein distance between pre- and post-shock features for enhanced robustness:
where is a discrepancy metric.
- Quantifying Shift: Calculate between return histograms for cross-day comparisons of shift magnitude.
- Software Support: All data loaders and evaluation scripts are available in Python (PyTorch), with Jupyter notebook examples included to expedite reproducibility and adoption.
7. Context, Limitations, and Research Implications
DSLOB addresses the longstanding challenge of controlled OOD benchmarking for LOB forecasting. By providing labeled, regime-aware splits, it enables a new class of generalization and robustness studies in algorithmic finance. However, as a synthetic benchmark, it relies on the representativeness of agent models and regime dynamics. The labeling of OOD events, as well as the adoption of specific fundamental processes and shock dynamics, is a modeling choice; real-market OODs may exhibit additional complexities.
A plausible implication is that, while DSLOB sharpens quantitative assessments of robustness to stress-regime shifts, absolute generalization to real-market OODs remains contingent on the fidelity of agent-based simulations. Further, the existing agent composition and shock types could be extended to capture richer market microstructure phenomena.
For benchmarking, DSLOB is positioned as the de facto standard for systematic, regime-controlled OOD analysis in high-frequency limit order book modeling (Cao et al., 2022).