Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

Published 2 Mar 2026 in q-fin.TR and cs.LG | (2603.01820v1)

Abstract: We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task, with a primary focus on Sharpe ratio optimization. Evaluating linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches, we assess out of sample performance on a daily futures dataset spanning commodities, equity indices, bonds, and FX spanning 2010 to 2025. Our evaluation goes beyond average returns and includes statistical significance, downside and tail risk measures, breakeven transaction cost analysis, robustness to random seed selection, and computational efficiency. We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models, which often lead the ranking in standard time series benchmarks. Hybrid models such as VSN with LSTM, a combination of Variable Selection Networks (VSN) and LSTMs, achieves the highest overall Sharpe ratio, while VSN with xLSTM and LSTM with PatchTST exhibit superior downside adjusted characteristics. xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a large-scale benchmark that evaluates deep learning architectures for financial time series by directly optimizing the Sharpe ratio.
The methodology employs an end-to-end pipeline with engineered features, sequence models, and risk-normalized training for robust per-asset signal generation.
Empirical results reveal that hybrid and gated architectures like VLSTM and xLSTM outperform standard models with higher Sharpe ratios and lower drawdowns.

Benchmarking Deep Learning Architectures for Financial Time Series: Risk-Adjusted Performance

Introduction

This study presents a systematic and large-scale empirical benchmark of modern deep learning (DL) architectures for financial time-series prediction and position sizing, emphasizing risk-adjusted performance via Sharpe ratio maximization (2603.01820). The evaluation covers a 15-year window (2010–2025) over a diversified universe of futures contracts—commodities, energy, equity indices, bonds, and FX. The analysis prioritizes not only mean-variance metrics but also downside risk, tail exposure, robustness to experimental variation, and computational efficiency. The research delineates the impact of architectural inductive biases, evaluates cross-class model variants (linear, recurrent, SSM, transformer, hybrid), and investigates risk and cost robustness under realistic, non-stationary market conditions.

Modeling Framework

The unified end-to-end pipeline consists of two main stages: a sequence-processing model transforms engineered features (statistical and technical indicators derived from historical prices) into hidden representations; a projection head (linear/tanh) outputs per-asset signals bounded in $[-1,1]$ for position sizing. Training is performed by direct minimization of negative annualized Sharpe ratio over rolling windows, with volatility targeting for risk normalization across instruments. Ticker embeddings are included for enhanced per-asset adaptation.

Figure 1: The pipeline for end-to-end portfolio optimization; technical indicators are mapped to position signals via DL models and direct Sharpe-ratio-based training.

Architecture Classes

Baselines: Linear Models

AR(1) and DLinear/NLinear serve as reference points. Despite competitive performance in isolated, high-volatility regimes, instability and regime-sensitivity limit their persistent risk-adjusted utility.

Transformer-Based Architectures

Standard and specialized transformers (iTransformer, PatchTST) are included. While these have excelled on high SNR, seasonal benchmarks, their utility in low-SNR financial data is inconsistent—temporal abstraction is not explicit, leading to overfit and instability, with moderate or subpar risk-adjusted returns.

State-Space Models (Mamba/Mamba2)

Selective SSM variants (Mamba, Mamba2—static HiPPO transitions), theoretically efficient for long-range dependencies, are probed for empirical robustness. Without explicit feature selection or dynamic state adaptation, empirical gains are limited and regime dependence pronounced.

Recurrent Models and Gated Extensions

LSTM, xLSTM, PsLSTM form the core of recurrent, adaptively-gated state models. xLSTM incorporates exponential gating, improved memory normalization, and matrix-valued memory, mitigating vanishing gradients and enhancing rare-signal retention.

Hybrid Models

Hybrids embed variable selection networks (VSN) or combine recurrent encoders with patching/attention (VLSTM, VSN+Mamba2, LPatchTST, VxLSTM). These achieve robust representation compression, feature selection, and structure-aware temporal state encoding.

Temporal Fusion Transformer

TFT fuses explicit recurrence (LSTM), feature selection, and attention, enhancing interpretability and robustness—providing a competitive composite inductive bias for multi-horizon financial time-series modeling.

Empirical Performance

Regime-Robustness and Aggregate Results

Nonlinear sequence models significantly outperform linear baselines in both average and downside-adjusted measures, with improved intertemporal stability and less regime-sensitivity.

Figure 2: Comparative PnL paths (volatility-rescaled) for all model classes—hybrid gated/sequence models dominate cumulative returns and Sharpe.

VLSTM is empirically dominant, with Sharpe ratio of 2.40 for 2010–2025, annualized return of 24%, and HAC $t$ -stat of 8.81. LPatchTST and TFT also decisively outperform standard transformers, SSMs, and LSTM alone, each offering Sharpe >2 and consistent risk-adjusted excess over passive benchmarks.

Generic transformers and naive SSMs are unstable, even with theoretical advantages in memory scaling, and deliver inferior average and downside-adjusted results compared to specialized gated or hybrid architectures.

xLSTM achieves the largest realized breakeven transaction cost buffer due to lower turnover, indicating improved cost-robustness. Gated memory adaptation affords LSTM-family models resilience not only in average performance but also in tail exposure and turnover sensitivity.

Downside Risk, Tail Behavior, and Robustness

VLSTM, xLSTM, and LPatchTST consistently achieve lower drawdowns and superior Calmar ratios (VLSTM: 1.15, xLSTM: 1.35, LPatchTST: 1.47) compared to other DL and linear strategies. Minimum annual Sharpe ratios and worst-period metrics confirm tail-robustness across volatile regimes:

Figure 3: Distribution of daily returns—revealing pronounced leptokurtosis and highlighting the tail risks intrinsic to the financial dataset.

Figure 4: Realized volatility distribution (log scale) exhibits marked right skew and upper-tail concentration.

Model rankings are stable under reduced seed averaging, indicating gains are not due to excessive hyperparameterization or initialization luck, but reflect persistent architectural advantages.

Cross-Asset and Category-Level Analysis

Hybrid and xLSTM-type models deliver robust cross-sectional profitability across asset classes—not just localized to specific contracts or categories. Per-asset PnL traces and Sharpe distributions demonstrate consistent alpha generation without concentration risk.

Figure 5: VLSTM cumulative PnL per asset in Commodities Futures—no single asset dominates, indicating broad-based outperformance.

Figure 6: VLSTM Sharpe ratio distribution (box plot) for Commodities Futures—median and spread reflect cross-asset robustness.

xLSTM achieves high breakeven transaction costs on liquid contracts (e.g., ES, ZN) due to substantially reduced turnover.

Figure 7: xLSTM per-asset PnL across Commodities Futures, indicating consistent performance in a high-volatility subuniverse.

Figure 8: xLSTM annual Sharpe ratio (box plot) for Commodities Futures—tight central mass, limited downside dispersion.

Practical and Theoretical Implications

Results demonstrate that structured temporal abstraction, adaptive gating, and robust feature selection are key for denoising, regime adaptation, and risk-adjusted learning in non-stationary, low-SNR financial environments. Asymptotic memory/compute efficiency (e.g., SSMs, transformers) does not translate into risk-adjusted economic superiority without property-aligned inductive bias.

For systematic trading or quantitative asset allocation, models such as VLSTM, xLSTM, and robust hybrids achieve favorable trade-offs between average return, tail exposure, and implementational cost. Their risk-adjusted advantage is rooted in denoising capability, asset-specific memory, and adaptive attention to structural shifts.

The research highlights open directions: (i) further architectural development targeting noise suppression under regime switches, (ii) application to alternative asset classes and higher-frequency data, (iii) integration with market microstructure features for execution-aware modeling, and (iv) studying generalization under explicit cost and liquidity constraints.

Conclusion

This benchmark establishes that linear models, while occasionally competitive in isolated market conditions, are insufficient for persistent, risk-adjusted alpha generation in financial time series due to non-stationarity and structural breaks. The most effective DL architectures are those embodying explicit state persistence, structured gating, and adaptive compression of feature information, as realized in hybrids such as VLSTM and xLSTM. These models outperform both standard transformers and naïve SSMs not only in Sharpe but also in tail-risk management and cost-resilience. Performance stability under reduced experimental budgets supports the robustness and real-world implementability of the proposed approaches. The study provides a rigorous empirical foundation for informed architectural advances in deep learning for quantitative finance.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper tests many modern AI models to see which ones are best at turning noisy financial data into trading signals that make steady, risk-aware profits. The authors focus on “risk-adjusted” performance, especially a score called the Sharpe ratio, which rewards higher returns but penalizes bumpy, risky ups and downs.

In short: they compare lots of deep learning models on 15 years of daily market data (2010–2025) to find out which ones make the most reliable trading decisions across many kinds of markets (commodities, stock indexes, bonds, and currencies).

The main questions the researchers asked

Which kinds of models are best at reading financial time series (numbers that change over time, like prices) and producing good trading signals?
Do fancy, general-purpose AI models (like Transformers) work well in finance, or do models designed to handle time and noise work better?
How stable are the results over different market periods (calm years vs. wild years)?
How do trading costs (paying a “fee” each time you change your position) affect the profits?
Are the results robust, meaning they don’t depend on lucky random starts or special settings?

How did they do the study?

The data

They used daily information from futures markets across:

Commodities (like metals or agriculture),
Energy,
Stock indexes,
Government bonds,
Currencies (FX).

Futures are standardized contracts that bet on the future price of something. The data covered 2010–2025, including calm and turbulent periods. Financial data is hard: it’s noisy (lots of randomness), patterns are weak, and what worked last year might not work this year.

The models they tested

They compared simple and advanced approaches. Here’s a plain-English guide:

Linear models (like simple rules): fast and simple, but not very flexible.
Recurrent neural networks (RNNs), especially LSTMs and xLSTMs: these have “memory,” so they can remember patterns over time.
Transformers and patch-based Transformers (like PatchTST): models that look across the whole sequence at once; patching means grouping days together to smooth noise.
State-space models (like Mamba/Mamba2): models that keep a compact “summary” of the past, designed to handle long sequences efficiently.
Hybrids (like VSN+LSTM, LSTM+PatchTST, VSN+xLSTM): combinations that first pick the most useful features (using a Variable Selection Network, or VSN) and then apply a strong time-learning model.

A useful analogy:

Think of the data as a long, messy song.
- Linear models are like using a single, straight beat to follow the rhythm.
- LSTMs are like musicians with good memory, remembering earlier themes.
- Transformers listen to many parts of the song at once.
- State-space models keep a neat notebook summary of what’s happened.
- VSN is like a volume knob that turns down noisy instruments and turns up the helpful ones.

How the trading signals worked

Each model looks at a window of past days and outputs a signal between -1 and +1.
- +1 means “go fully long” (bet the price will go up).
- -1 means “go fully short” (bet the price will go down).
They used “volatility targeting,” which is like cruise control for risk: if a market is very jumpy, the position size is reduced; if it’s calmer, it can be larger. This keeps overall risk more consistent.
They trained models to directly maximize the Sharpe ratio, not just to predict tomorrow’s price. The Sharpe ratio is like “points per risk”: higher is better if you can get returns without too much shakiness.

How they judged the models

They went beyond simple average returns. They looked at:

Sharpe ratio and statistical tests (to check if results are real, not luck),
Downside risk (how bad the worst stretches were),
Tail risk (how bad rare, extreme losses could be),
Turnover and transaction costs (how much trading you do and how much that would cost),
Robustness (do results hold across different years and random starting points?),
Speed/efficiency (how heavy the models are to run).

They also checked “breakeven transaction cost,” which is the highest trading fee you could pay before the model’s profits disappear. Think of it as: how much “friction” can this strategy handle and still make money?

What did they find?

Models that are built to handle time well and reduce noise consistently beat simple linear models and many general-purpose deep models.
The best results came from hybrids that first select useful features and then use strong time-memory tools.
- VSN + LSTM (often called VLSTM) had the highest overall Sharpe ratio across 2010–2025.
- LSTM + PatchTST and VSN + xLSTM also did very well, especially on downside risk (they handled bad times better).
- xLSTM had the largest “breakeven transaction cost,” meaning it was more resilient to trading fees and frequent rebalancing.
Linear models sometimes did okay in certain years (especially very volatile ones) but were not reliable over the whole period.
Generic Transformers and some state-space models weren’t consistently strong in this financial setting. In finance, the signal is weak and noisy, so models that can selectively remember and filter noise have an edge.
Results stayed strong even when the team reduced the number of random runs (“seeds”), which suggests the findings are robust, not just due to lucky training runs.

Why this matters: it shows that in finance, where real patterns are faint and change over time, architecture choices that highlight useful signals and store them wisely are key.

Why it matters and what could happen next

For researchers and practitioners: If you want steadier, risk-aware trading performance, favor models that:
- Select important features (VSN),
- Keep adaptive, long-term memory (LSTM/xLSTM),
- Smooth out noise (patching or preprocessing),
- And train directly on what you ultimately care about (risk-adjusted returns like the Sharpe ratio).
For real-world trading: xLSTM’s stronger buffer against transaction costs hints it could be more practical when fees and slippage matter, while VLSTM’s high Sharpe suggests strong overall “quality” of signals.
For future work: The study used a particular dataset and setup. Testing on other markets, different time scales (like hourly or minute data), and changing cost assumptions would help confirm how universal these findings are.

In one sentence: The paper shows that smart “memory-and-selection” deep learning models, trained to optimize steady returns rather than pure prediction accuracy, can make more robust trading decisions in the noisy world of finance.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper provides a broad benchmark, but several aspects remain uncertain or unexplored. Future work could address the following gaps:

External validity limited to a single vendor’s cross-asset futures/FX dataset at daily frequency (2010–2025); no tests on other markets (e.g., single stocks, options), regions, or intraday data to assess portability.
Universe construction and survivorship bias are not fully specified (e.g., inclusion rules, handling of delistings/contract switches); sensitivity to alternative continuous-contract methodologies (ratio vs. difference back-adjustment, front-month selection rules) is not examined.
Feature scope appears confined to price-based statistical/technical indicators; the incremental value of exogenous information (macro, carry/roll/premium signals, COT positioning, fundamentals, alternative data) is not evaluated.
No comparison to strong finance baselines (e.g., canonical time-series momentum, carry, or simple trend-following rules with the same volatility targeting), leaving the incremental alpha over standard rule-based strategies unclear.
Portfolio construction is effectively per-asset with scalar signals, equal risk allocation, and no cross-asset optimization; benefits of multi-output models that jointly consider covariance, sector structure, or cross-asset lead–lag effects are not assessed.
Training objective optimizes a pooled sample Sharpe using batch-aggregated returns; sensitivity to objective design (per-asset vs. pooled objectives) and comparisons to alternative differentiable risk objectives (Sortino, CVaR, drawdown penalties, utility-based losses) are missing.
Transaction costs are set to c_k=0 during training and primary evaluation; only post-hoc, constant per-asset breakeven costs are reported. Realistic, time-varying execution frictions (bid–ask spreads, slippage, market impact, fees), capacity limits, and net performance under plausible cost models are not analyzed.
High turnover (often ~700–1000 annualized) raises questions about feasibility under liquidity, tick size, and lot-size constraints; capacity/scalability by asset and the trade-off between turnover and performance are not quantified.
Execution assumptions (signal at close, trade effective next bar) are not stress-tested against close-to-open execution, overnight gaps, limit moves, or partial fills; latency and slippage sensitivity remain unmeasured.
Volatility targeting is fixed at 10% using EWMA; the dependence of results on target volatility, EWMA decay, and alternative risk estimators (e.g., GARCH, realized vol) is not explored.
Statistical inference reports HAC t-stats but does not correct for multiple model comparisons or seed selection (no SPA/Reality Check/DM tests); uncertainty around model ranking and the probability of outperformance is not quantified with bootstraps or confidence intervals.
Seed selection ensembles the top S seeds by validation loss (from 50 or 25 runs), introducing selection bias; the single-seed performance distribution and deployable variance are not fully characterized.
Hyperparameter fairness and tuning parity across architectures are under-specified (lookback L, hidden size H, depth, patch length, parameter counts, learning rates, early stopping); negative results for transformers/SSMs may reflect suboptimal tuning rather than inherent limits.
Compute efficiency is discussed qualitatively, but training time, memory footprint, and energy cost per model are not reported; practical deployability at scale is unclear.
Model- and component-level ablations are limited: the incremental contributions of VSN, LSTM denoising, patching, ticker embeddings, and other architectural choices are not isolated in controlled experiments.
SSM implementation choices (e.g., static HiPPO transitions without horizon jitter) are restrictive; performance of alternative SSMs (S4/DSS/GSS, adaptive horizons, diagonal/low-rank variants) and their tuning space remain unexplored.
Non-stationarity handling is implicit; there are no explicit tests for regime-change robustness (e.g., covariate shift diagnostics, rolling re-tuning cadence comparisons, pre-registered forward-only tests with frozen hyperparameters).
Interpretability is not pursued: feature/ticker importances from VSN, temporal saliency, or factor attribution (exposure to trend, carry, equity/bond betas) are not analyzed; economic plausibility of learned patterns remains opaque.
Sensitivity analyses are sparse: key design choices (lookback window L, patch length, tanh output bounds, EWMA λ, target volatility) and their effects on both performance and turnover are not reported.
Cross-sectional dependence is not explicitly modeled (beyond shared parameters and embeddings); whether attention across assets or covariance-aware decoders improves performance is an open question.
Robustness to data quality issues (missing values, roll date conventions, calendar mismatches, holidays) and preprocessing choices (standardization windows, de-meaning) is not evaluated.
Risk and financing frictions (futures margining, collateral returns, exchange/clearing fees, funding costs) are omitted; whether rankings persist under realistic financing environments is unknown.
Generalization across horizons and tasks is untested (only one-step-ahead daily signals); multi-horizon forecasting, event-conditioned strategies, and alternative targets (directional accuracy, volatility forecasting) are left open.
Reproducibility is limited: code is “available upon request,” and several implementation specifics (exact architecture configs, training schedules, rolling window lengths, evaluation splits) are not exhaustively documented for independent replication.
Comparisons to strong non-deep ML baselines (e.g., gradient Boosted Trees, random forests, regularized linear models with engineered features) are absent; the necessity of deep architectures versus shallow, interpretable models is not established.
Training stability and regularization (gradient variance under Sharpe loss, weight decay, dropout, early stopping criteria) are not reported; convergence behavior and failure modes under different seeds are unclear.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of “Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance”

Below are concrete, real-world applications derived from the paper’s findings, methods, and innovations. They are grouped into Immediate Applications and Long-Term Applications. Each item notes sectors, examples of tools/products/workflows, and key assumptions/dependencies affecting feasibility.

Immediate Applications

These can be deployed now with realistic effort in professional and academic environments.

Sharpe-optimized signal engines for cross-asset futures trading
- Sectors: Finance (buy-side quant, CTAs/managed futures, prop trading)
- Tools/workflows: Adopt VLSTM, LPatchTST, or xLSTM as core signal generators; implement the modular pipeline (feature extraction → sequence model → tanh projection → volatility targeting); train via negative Sharpe loss with pooled-batch Sharpe as per DeePM; deploy daily EOD rebalancing.
- Assumptions/dependencies: Availability of high-quality continuous futures data (2010–2025 analogues), adequate compute for training/seed ensembling, risk controls for position bounds and leverage, awareness that results are conditional on daily data and this backtesting protocol; training ignores transaction costs (gross), so turnover management is critical in production.
Turnover-aware instrument selection using breakeven transaction cost (c*) screening
- Sectors: Finance (portfolio managers, execution desks)
- Tools/workflows: Post-hoc c* computation per asset to filter or size allocations; prioritize assets/contracts with higher c* buffers; pair with turnover dashboards and seed-ensemble position smoothing.
- Assumptions/dependencies: Stationarity of turnover patterns and liquidity; realistic mapping from c* to live costs; scalability limits in illiquid contracts (e.g., Lumber, Oats, Milk III).
Robust signal ensembling to improve stability and reduce trading frictions
- Sectors: Finance (quant research, systematic trading), Software (platforms for quant MLOps)
- Tools/workflows: Train multiple seeds, select top S by validation Sharpe, average signals to lower variance and turnover; integrate with CI/CD pipelines for research-to-production promotion.
- Assumptions/dependencies: Sufficient compute budget for multi-seed training; monitoring for overfitting in seed selection; governance to prevent “seed mining.”
Regime-robust backtesting and reporting standards
- Sectors: Finance (risk, validation), Policy (regulatory guidelines), Academia (methodology pedagogy)
- Tools/workflows: Standardize reports including HAC-adjusted t-stats, subperiod Sharpe, downside/tail metrics (MaxDD, Calmar, CVaR), and passive-relative diagnostics; include seed-robustness experiments as part of model approval.
- Assumptions/dependencies: Organizational buy-in to extend backtesting templates; data coverage across multiple regimes (e.g., post-GFC, 2020 volatility).
Upgrades to momentum/trend-following systems with hybrid recurrent + patching
- Sectors: Finance (managed futures, multi-asset)
- Tools/workflows: Replace traditional time-series momentum rules with LPatchTST or VLSTM-based signals; retain volatility targeting and risk budgets; evaluate strategy overlays on existing CTA stacks.
- Assumptions/dependencies: Consistency of performance when mapped to live execution; realistic slippage and carry/roll adjustments; position scaling under [-1, 1] bounds may need modification.
Academic benchmarking and curriculum modules for low-SNR time series
- Sectors: Academia (ML for finance), Education
- Tools/workflows: Use the paper’s architecture suite (linear, RNN, transformer, SSM, hybrids) and evaluation protocol to teach pitfalls of generic benchmarks; assign Sharpe-optimized training labs; encourage ticker embeddings and feature selection (VSN) for asset-specific learning.
- Assumptions/dependencies: Access to similar datasets (or public proxies), code access (available upon request), compute in teaching labs.
Buy-side quant research workflow templates emphasizing inductive bias
- Sectors: Finance (quant R&D), Software (internal libraries)
- Tools/workflows: Template repositories implementing VLSTM/xLSTM/LPatchTST with ticker embeddings, volatility targeting, and pooled Sharpe loss; comparison harness across regimes; ablation frameworks for inductive bias (feature selection vs. attention vs. recurrence).
- Assumptions/dependencies: Team familiarity with PyTorch/JAX; data engineering for EOD features; integration with existing backtesting engines.
Risk dashboards prioritizing tail behavior and downside robustness
- Sectors: Finance (risk management, product oversight), Investor relations
- Tools/workflows: Integrate worst-3m Sharpe, min-annual Sharpe, CVaR(5%), and Calmar into live dashboards; flag strategy behavior in volatile regimes; compare LSTM variants vs. hybrids on risk-adjusted basis.
- Assumptions/dependencies: Histories long enough to compute stable tail estimates; regular recalibration under regime shifts.
Sell-side/fintech analytics products for model diagnostics
- Sectors: Finance (sell-side research, data vendors), Software
- Tools/workflows: Offer “Sharpe-optimized model checkups” that score client strategies against benchmark architectures; provide turnover vs. performance frontiers and c* estimates; produce per-asset alpha maps.
- Assumptions/dependencies: Client data sharing and confidentiality; clear disclaimers on generalization beyond backtests.
Corporate treasury and asset–liability overlays for hedging programs
- Sectors: Corporate finance (treasury), Commodities/FX hedgers
- Tools/workflows: Use xLSTM/VLSTM-derived signals as overlays for timing hedge ratios (reduce/increase hedge when signals are strong/weak) with strict risk controls; focus on liquid instruments with favorable c*.
- Assumptions/dependencies: Strict governance and limits; careful mapping from speculative signals to hedging objectives; conservative turnover targets.

Long-Term Applications

These require further research, scaling, or development before broad deployment.

Differentiable transaction-cost–aware training and multi-objective optimization
- Sectors: Finance (systematic trading, execution research), Software (AutoML for finance)
- Tools/workflows: Extend loss to include differentiable proxies for turnover/slippage; jointly optimize Sharpe, drawdown, and CVaR; adaptively penalize turnover by asset liquidity.
- Assumptions/dependencies: Reliable cost models and differentiable approximations; robust execution–signal feedback loops; nonstationary cost regimes.
Real-time and intraday extensions with microstructure features
- Sectors: Finance (intraday/short-horizon trading), Market making
- Tools/workflows: Adapt VLSTM/xLSTM/LPatchTST to intraday bars or LOB features; handle non-synchronous multi-asset data; latency-aware deployment.
- Assumptions/dependencies: High-frequency data infrastructure; strict latency and reliability constraints; risk of strategy crowding.
Foundation models and representation learning for financial sequences
- Sectors: Finance (platform research groups), Academia (time-series foundation work), Software (model hubs)
- Tools/workflows: Pretrain hybrid recurrent models (VSN+LSTM/xLSTM) on large multi-asset, multi-frequency corpora; fine-tune for specific tasks (trend-following, mean reversion, risk forecasting); explore few-shot generalization (building on X-Trend lineage).
- Assumptions/dependencies: Large-scale curated datasets and compute; prevention of data leakage and survivorship bias; stability under regime shifts.
Cross-domain applications in energy, commodities, and supply chain risk
- Sectors: Energy/utilities, Commodity trading/hedging, Manufacturing procurement
- Tools/workflows: Apply patching + recurrence architectures to forecast/hedge exposures (e.g., fuels, power, FX); integrate with procurement schedules and storage constraints; optimize hedge timing akin to portfolio weights.
- Assumptions/dependencies: Domain-specific features (weather, load, shipment data); reliable mapping from forecasts to actionable hedges; operational constraints.
AI model governance frameworks for capital markets
- Sectors: Policy/regulation, Financial compliance and model risk management (MRM)
- Tools/workflows: Codify requirements for seed-robustness tests, subperiod reporting, HAC statistics, and c* analysis; develop standardized disclosures for risk-adjusted optimization and turnover profiles.
- Assumptions/dependencies: Regulatory consensus and industry adoption; clarity on how Sharpe-optimized training interacts with investor suitability and risk disclosures.
AutoML for low-SNR time-series with inductive bias search
- Sectors: Software (quant platforms), Finance (research tooling), Academia
- Tools/workflows: Automated pipeline that searches across inductive biases (feature selection, patching, recurrence, state-space) and tunes objective trade-offs; delivers ready-to-validate candidates with risk reports.
- Assumptions/dependencies: Guardrails to prevent overfitting/data snooping; computational budgets; reproducibility and audit trails.
Adaptive portfolio construction beyond tanh-limited scalar signals
- Sectors: Finance (multi-asset portfolio management), Insurance (ALM), Endowments
- Tools/workflows: Move from scalar signals and simple volatility targeting to multi-output position vectors with constraints, dynamic risk budgets, and cross-asset interactions; integrate with optimizer layers.
- Assumptions/dependencies: Differentiable optimizers with constraints; robust estimation of cross-asset covariance; interpretable allocation logic for committees.
Stress-testing frameworks driven by learned regime embeddings
- Sectors: Finance (risk), Policy (macroprudential oversight)
- Tools/workflows: Use model states/ticker embeddings to infer regime labels; replay performance under synthetic shocks; assess model sensitivity to volatility clusters and heavy tails.
- Assumptions/dependencies: Stable mapping from latent states to interpretable regimes; methods to generate plausible shocks; validation data for tail events.
Retail and advisory integration via simplified, low-turnover variants
- Sectors: Wealth management, Robo-advisors
- Tools/workflows: Distill xLSTM-based signals into low-turnover ETF overlays; client-specific risk targets; periodic (e.g., monthly) rebalancing with turnover caps.
- Assumptions/dependencies: Reduced-complexity models that retain efficacy; clear explanations for clients; alignment with suitability and fee structures.
Cross-asset corporate exposure management and dynamic hedging assistants
- Sectors: Corporate treasury, Airlines/shipping, Agri-business
- Tools/workflows: Assist treasurers with dynamic hedge ratio recommendations informed by low-SNR–oriented models; integrate forecasts with cashflow calendars and risk limits.
- Assumptions/dependencies: Accurate exposure measurement; governance for model overrides; robust backtesting against business KPIs.
Open, standardized benchmark suite for low-SNR time series
- Sectors: Academia, Open-source software, Industry consortia
- Tools/workflows: Public datasets and code (when license permits), standardized metrics (Sharpe, HAC t-stats, downside), and established baselines (linear, RNN, transformer, hybrid) to evaluate new methods beyond seasonal, high-SNR benchmarks.
- Assumptions/dependencies: Data licensing, reproducibility across institutions, community maintenance.

Notes on feasibility across items:

Many immediate applications depend on robust data pipelines (clean continuous futures, feature engineering), compute for training/ensembling, and risk/compliance processes.
The paper’s results are explicitly conditional on daily, cross-asset futures (2010–2025), zero-cost training (gross), volatility targeting at 10%, and the authors’ backtesting choices; generalization to other markets, frequencies, and cost regimes requires revalidation.
Hybrid architectures with strong inductive bias (e.g., VSN+LSTM, LPatchTST, xLSTM) showed the best balance of performance, stability, and turnover—making them priority candidates for near-term adoption.

View Paper Prompt View All Prompts

Glossary

AR1x: A per-feature autoregressive model of order 1 used as a minimal temporal benchmark in time-series forecasting. "The AR1x model \cite{ar1} serves as a minimal temporal benchmark, capturing short-term autocorrelation in returns."
AR(n)x: A per-feature autoregressive model of order n serving as a linear baseline in the study. "linear specifications such as AR1x, AR $n$ x, DLinear, and NLinear occasionally achieve strong single-year Sharpe ratios"
Back-adjusted (ratio-adjusted backwards) methodology: A procedure for constructing continuous futures contracts by adjusting for roll-induced jumps. "using a ratio-adjusted backwards methodology (i.e., back-adjusted to remove roll-induced price jumps)"
Basis points: A unit equal to one hundredth of a percent (0.01%), commonly used to express transaction costs or yields. "The tables report annualised, volatility-rescaled gross and net returns together with annualised turnover and the implied breakeven transaction cost $c^*$ in basis points."
Breakeven transaction cost: The maximum constant trading friction a strategy can bear before its cumulative profit becomes zero. "we conduct a post-hoc, asset-level breakeven transaction cost analysis"
Breakeven transaction cost buffer: The margin by which a strategy’s performance can absorb transaction costs before profits vanish. "xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions."
CAGR (Compound Annual Growth Rate): The rate of return that would produce the observed cumulative growth if profits were compounded annually. "presenting compound annual growth rates (CAGR)~\cite{elton_mpt}"
Calmar ratio: A performance metric defined as annualized return divided by maximum drawdown. "including maximum drawdown (Max DD), Calmar ratio (Calmar), worst three-month Sharpe ratio (Worst 3m Sharpe), minimum annual Sharpe ratio (Min Ann. Sharpe), and 5\% conditional value at risk (CVaR 5\%)."
Conditional Value at Risk (CVaR 5%): The expected loss beyond the 5% worst-case loss threshold. "including maximum drawdown (Max DD), Calmar ratio (Calmar), worst three-month Sharpe ratio (Worst 3m Sharpe), minimum annual Sharpe ratio (Min Ann. Sharpe), and 5\% conditional value at risk (CVaR 5\%)."
DeePM framework (regime-robust): A training approach computing objectives over pooled returns to better proxy out-of-sample Sharpe. "Following the regime-robust DeePM framework \cite{deepm_regime_robust}, we compute the Sharpe ratio objective on pooled portfolio returns concatenating all sequences in the batch"
EWMA (Exponentially Weighted Moving Average): A volatility estimator that weights recent observations more heavily than older ones. "We estimate the ex-ante conditional volatility $\sigma_{t,k}$ for each asset using an Exponentially Weighted Moving Average (EWMA) estimator"
Ex-ante conditional volatility: The model’s forecast of future volatility based on recent data and a specific estimator. "We estimate the ex-ante conditional volatility $\sigma_{t,k}$ for each asset using an Exponentially Weighted Moving Average (EWMA) estimator"
FX: Foreign exchange markets or instruments. "foreign-exchange (FX) futures"
HAC (heteroskedasticity and autocorrelation consistent) t-statistics: Inference statistics robust to time-varying variance and serial correlation. "heteroskedasticity and autocorrelation consistent $t$ -statistics ( $t$ HAC)~\cite{newey_west}"
Heavy-tailed return distributions: Return distributions with higher probability of extreme values than Gaussian. "including heavy-tailed return distributions, volatility clustering, and strong deviations from Gaussianity."
HiPPO (High-order Polynomial Projection Operators): A class of operators that maintain compressed summaries of past information in evolving models. "HiPPO (High-order Polynomial Projection Operators) matrices \cite{hippo}, which maintain compressed representations of past information as the model evolves."
Hit rate (Pesaran–Timmermann): The proportion of correct directional predictions, often evaluated with the Pesaran–Timmermann test. "hit rate (Hit)~\cite{pesaran_timmermann}"
Hyperbolic tangent activation: A bounded activation function mapping inputs to [-1, 1], used to constrain trading signals. "followed by a hyperbolic tangent ( $\tanh$ ) activation function to bound the output"
Information ratio: Excess return relative to a benchmark divided by tracking error. "the information ratio (Info. Ratio), HAC $t$ -statistic relative to passive ( $t$ HAC v Passive), and correlation with passive returns (Corr. v Passive)"
iTransformer: An architecture that applies attention across features instead of time, viewing each feature as a token. "The inverted Transformer \cite{itransformer} applies attention across feature dimensions rather than time, treating each feature as a token."
Leverage factor: A scaling term derived from volatility estimates to equalize risk across assets. "This estimation induces a time-varying leverage factor, defined as $\text{vs\_factor}_{t,k} = \frac{1}{\sigma_{t,k}$, which dynamically scales position sizes in response to shifting market regimes."
Linear-attention-like behavior: Attention computations with linear complexity in sequence length, improving scalability. "Mamba2 \cite{mamba2} claims to achieve a mathematically principled architecture with linear-attention-like behavior \cite{lin_attn} while supporting arbitrarily large lookback windows."
Lookback window: The fixed number of past timesteps used as input for forecasting. "Given a fixed lookback window of length $L$ "
LPatchTST: A hybrid architecture combining LSTM-based denoising with PatchTST’s patch-based attention. "LSTM + PatchTST (LPatchTST). This architecture combines explicit recurrence with attention by using an LSTM as a channel-wise temporal denoiser prior to PatchTST."
Mamba: A selective state-space model with implicit temporal recurrence and linear-time complexity. "Mamba models \cite{mamba, mamba2} belong to the class of selective state-space models (SSMs)~\ref{appdx:mamba}, which maintain a latent state that is updated recursively over time."
Mamba2: A refined Mamba variant with simplified state transitions and increased head dimensionality for stability and speed. "Mamba2 refines this formulation by simplifying the state transition structure and increasing head dimensionality, leading to improved numerical stability and throughput."
Maximum drawdown: The largest peak-to-trough decline experienced by a strategy. "including maximum drawdown (Max DD), Calmar ratio (Calmar), worst three-month Sharpe ratio (Worst 3m Sharpe)"
NLinear: A non-recurrent linear model operating on normalized inputs, used as a time-series baseline. "DLinear and NLinear \cite{nlinear} are non-recurrent linear models that apply learned linear mappings to fixed-length input windows."
PatchTST: A transformer that processes sequences as temporal patches to improve robustness and receptive field. "PatchTST \cite{patchtst} segments the input sequence into temporal patches, which are embedded and processed via self-attention."
PnL (Profit and Loss): The cumulative or per-period monetary gains or losses of a strategy. "before its cumulative PnL is driven to zero."
Projection head: The final module mapping hidden representations to scalar outputs or signals. "The second stage is a unified projection head applied to the terminal hidden state $h_t$ ."
PsLSTM: A patch-based sLSTM variant combining patching with exponential-gated recurrence for noise robustness. "Patch sLSTM \cite{pslstm} integrates the patching strategy of PatchTST with the recurrent inductive bias of sLSTM."
Regime (market regime): Distinct market conditions or periods with differing dynamics (e.g., volatility or trends). "across time and economic regimes"
Rolling windows (training): A training procedure that iteratively updates models using consecutive windows of data. "All models are trained using rolling windows and evaluated in a fully out-of-sample trading framework."
Sharpe ratio: Risk-adjusted performance metric defined as mean return divided by return volatility. "The loss function $\mathcal{L}(\theta)$ is defined as the negative differentiable annualized Sharpe Ratio"
Sharpe-ratio optimization: Directly optimizing model parameters to maximize Sharpe rather than minimize prediction error. "with a primary focus on Sharpe-ratio optimization."
Signal-to-noise ratio: The relative strength of the predictive signal compared to noise in the data. "These datasets exhibit strong seasonality and a high signal-to-noise ratio, in contrast to financial time series"
sLSTM: Scalar-memory xLSTM variant using normalized exponential gates for long-range retention. "xLSTM comprises two variants: scalar LSTM (sLSTM), which maintains a scalar memory state updated via normalized exponential gates"
State Space Models (SSMs): Models that maintain and update latent states over time to capture sequence dynamics. "belong to the class of selective state-space models (SSMs)~\ref{appdx:mamba}, which maintain a latent state that is updated recursively over time."
TFT (Temporal Fusion Transformer): A model integrating gated recurrence, attention, and variable selection for robust forecasting. "TFT \cite{tft} integrates gated recurrent layers with interpretable attention mechanisms."
Ticker embeddings: Learned representations for asset identifiers to enable per-asset conditioning. "We added ticker embeddings to all the models to enhance the learning per individual ticker/asset~\cite{entity_embeddings}."
Turnover: The amount of portfolio rebalancing over time, often measured in notional units or as xGMV. "turnover, turnover as a multiple of gross market value (xGMV)~\cite{vol_target}"
Variable Selection Networks (VSN): Modules that perform dynamic, feature-wise soft selection and embedding of covariates. "Variable Selection Networks (VSNs)~\ref{appdx:vsn}, inspired by the Temporal Fusion Transformer, are used to perform feature-wise nonlinear embedding and dynamic soft selection of relevant covariates at each time step."
Volatility clustering: The empirical tendency for high-volatility periods to cluster over time. "including heavy-tailed return distributions, volatility clustering, and strong deviations from Gaussianity."
Volatility targeting: Scaling positions to equalize risk contributions by targeting a specified portfolio volatility. "we employ a volatility targeting framework to equalize risk contributions across the universe~\cite{tsmom,vol_target,enhancing_tsmom}."
Volatility-rescaled gross PnL: Profit and loss scaled to a target volatility level for comparability across strategies. "Performance comparison across models 10\% volatility-rescaled gross PnL."
VLSTM: A hybrid of VSN and LSTM used to build compact temporal representations for financial forecasting. "VLSTM combines a VSN with an LSTM encoder to construct a compact temporal state representation."
VxLSTM: A hybrid where VSN-selected features feed into an xLSTM with matrix-valued memory. "VSN + xLSTM (VxLSTM). In this hybrid, VSN-selected representations are directly fed into an xLSTM."
Worst three-month Sharpe: The lowest Sharpe ratio measured over any three-month interval. "including maximum drawdown (Max DD), Calmar ratio (Calmar), worst three-month Sharpe ratio (Worst 3m Sharpe)"
xGMV (turnover as a multiple of gross market value): Turnover normalized by gross market value, indicating trading intensity relative to portfolio size. "turnover as a multiple of gross market value (xGMV)~\cite{vol_target}"
xLSTM: An LSTM variant with exponential gating and memory normalization, enabling improved long-range retention via matrix-valued memory. "xLSTM replaces LSTM’s traditional sigmoid gating with exponential gating and a normalization term, while also employing a memory matrix rather than a scalar value."

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

Summary

Benchmarking Deep Learning Architectures for Financial Time Series: Risk-Adjusted Performance

Introduction

Modeling Framework

Architecture Classes

Baselines: Linear Models

Transformer-Based Architectures

State-Space Models (Mamba/Mamba2)

Recurrent Models and Gated Extensions

Hybrid Models

Temporal Fusion Transformer

Empirical Performance

Regime-Robustness and Aggregate Results

Downside Risk, Tail Behavior, and Robustness

Cross-Asset and Category-Level Analysis

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The main questions the researchers asked

How did they do the study?

The data

The models they tested

How the trading signals worked

How they judged the models

What did they find?

Why it matters and what could happen next

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of “Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance”

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets