Finding Optimal Trading History in Reinforcement Learning for Stock Market Trading

Published 18 Feb 2025 in cs.LG and cs.AI | (2502.12537v2)

Abstract: This paper investigates the optimization of temporal windows in Financial Deep Reinforcement Learning (DRL) models using 2D Convolutional Neural Networks (CNNs). We introduce a novel approach to treating the temporal field as a hyperparameter and examine its impact on model performance across various datasets and feature arrangements. We introduce a new hyperparameter for the CNN policy, proposing that this temporal field can and should be treated as a hyperparameter for these models. We examine the significance of this temporal field by iteratively expanding the window of observations presented to the CNN policy during the deep reinforcement learning process. Our iterative process involves progressively increasing the observation period from two weeks to twelve weeks, allowing us to examine the effects of different temporal windows on the model's performance. This window expansion is implemented in two settings. In one setting, we rearrange the features in the dataset to group them by company, allowing the model to have a full view of company data in its observation window and CNN kernel. In the second setting, we do not group the features by company, and features are arranged by category. Our study reveals that shorter temporal windows are most effective when no feature rearrangement to group per company is in effect. However, the model will utilize longer temporal windows and yield better performance once we introduce the feature rearrangement. To examine the consistency of our findings, we repeated our experiment on two datasets containing the same thirty companies from the Dow Jones Index but with different features in each dataset and consistently observed the above-mentioned patterns. The result is a trading model significantly outperforming global financial services firms such as the Global X Guru by the established Mirae Asset.

Abstract PDF Upgrade to Chat

Summary

The paper finds that treating the temporal window as a tunable hyperparameter significantly impacts cumulative rewards based on feature arrangement.
It demonstrates that a 2D CNN policy with rearranged features can capture longer-term dependencies more effectively than a fixed window setup.
The research highlights that optimizing DRL configurations through window size adjustment and feature grouping can enhance real-world automated trading strategies.

This paper explores how the amount of historical data (the "temporal window") fed into a Deep Reinforcement Learning (DRL) model affects its performance in stock trading simulations (2502.12537). The core idea is to treat the length of this temporal window not as a fixed value, but as a tunable hyperparameter for a 2D Convolutional Neural Network (CNN) acting as the policy network within the DRL agent. The study uses the FinRL framework and Proximal Policy Optimization (PPO) for training.

Methodology and Implementation

DRL Framework: The trading problem is modeled as a Markov Decision Process (MDP).
- State ( $s$ ): Includes current account balance, prices ( $p$ ) and holdings ( $h$ ) for $D$ stocks, and technical/fundamental features ( $f$ ).
- Action ( $a$ ): Buy, sell, or hold for each stock.
- Reward ( $r$ ): Change in portfolio value (stock value $p^T h$ + balance $b$ ).
- Policy ( $\pi(s)$ ): Provided by the CNN, mapping state $s$ to action probabilities.
- Optimization: PPO is used to train the CNN policy to maximize cumulative rewards.
CNN Policy Network: A 2D CNN is used to process the state information over the selected temporal window.
- Input: The state information for the past $W$ days (where $W$ is the window size in days) is stacked to form a 2D tensor. The exact shape depends on the feature arrangement (see below).
- Architecture: The paper details a specific CNN architecture (Figure 1, Table 1) with multiple convolutional layers (using kernel sizes like 8xN and 4xN with strides), Batch Normalization, ReLU activations, Max Pooling, and fully connected layers at the end. The 2D convolutions process both the feature dimension and the temporal dimension.
- Example Input Tensor (Non-Rearranged, SMA Dataset): For a window size $W$ (e.g., 10 days for a 2-week window), the input might be shaped (num_features, W). With 261 features (Amount, 29 Prices, 29 Holdings, 8*29 Indicators), the input tensor dimension for the CNN would be (261, W). The convolutional kernels slide across this 2D grid.
- Example Input Tensor (Rearranged, SMA Dataset): Features are grouped by company. If there are $D$ companies (e.g., 29) and $F$ features per company (e.g., Price, Holding, MACD, RSI...), the input might be shaped (D, F, W) or flattened to (D * F, W). The paper's Figure 2 shows the rearranged structure visually, grouping columns per company. This allows the CNN kernels to capture interactions within a company's features over time more easily.
Iterative Window Expansion: The key experiment involves varying the observation window size $W$ from 2 weeks (10 trading days) up to 12 weeks (60 trading days) in 2-week increments. The DRL agent is trained and evaluated for each window size.
Feature Rearrangement: Two scenarios are tested for each window size and dataset:
- Without Rearrangement: Features are likely ordered by type (e.g., all opening prices, then all closing prices, then all MACD values, etc.).
- With Rearrangement: Features are grouped by company ticker. All features for company 1 appear consecutively, then all features for company 2, and so on (as shown in Figure 2, left). This changes the spatial locality presented to the CNN kernels.
Datasets: Two datasets based on Dow Jones Index companies are used, differing in their features:
- SMA Dataset: Includes OHLCV prices, volume, day, MACD, Bollinger Bands, RSI, CCI, DX, SMAs, VIX, turbulence. (Table 2, Feature Vector size 261).
- Technical Indicator Dataset: Includes OHLCV prices, volume, and various financial ratios like profit margins, ROA/ROE, liquidity ratios, turnover ratios, debt ratios, PE/PB ratios, dividend yield. (Table 4, Feature Vector size 511).

Results and Practical Implications

Optimal Window Varies: The study finds that the best temporal window size depends heavily on how the input features are arranged.
Without Rearrangement: Shorter windows (specifically 2 weeks) consistently yielded the highest cumulative rewards for both datasets (173.8 for SMA, 155.9 for Technical Indicators - see Table 6). This suggests that without grouping features by company, the CNN struggles to effectively utilize longer histories, possibly due to "information overload" or difficulty correlating related features spread across the input tensor. Recent data holds more predictive power in this setup.
With Rearrangement: Grouping features by company allows the CNN to perform better with longer windows. The optimal window shifted to 4 weeks for the SMA dataset (reward 181.8) and 10 weeks for the Technical Indicator dataset (reward 121.6). Rearranging features makes the spatial structure of the input more meaningful for the 2D CNN, enabling it to better capture longer-term dependencies and interactions within each company's data.
Implementation Takeaway: When using CNNs for time-series forecasting or DRL in finance:
- Treat Window Size as Hyperparameter: Do not assume a fixed window size is optimal. Experiment with different lengths.
- Consider Feature Arrangement: The way features are ordered in the input tensor significantly impacts performance, especially for 2D CNNs. Grouping related features (e.g., by asset) can help the CNN leverage spatial correlations and potentially benefit from longer observation windows.
- Trade-offs: Longer windows require more memory and computation but might capture longer trends if the model architecture and input structure allow it. Shorter windows are faster but might miss longer patterns. Rearranging features adds a preprocessing step but can unlock better performance with longer windows.

Applications

The findings suggest that optimizing the temporal window and feature structure can lead to more effective DRL trading agents. The paper shows (Figure 3) that the best-performing configuration significantly outperformed benchmark ETFs like GURU (which mimics hedge fund holdings) and the DIA (Dow Jones ETF), demonstrating potential for developing advanced, adaptive trading strategies. This approach could be applied to:

Automated portfolio management.
High-frequency trading strategy development (by testing even shorter windows).
Developing custom indices or ETFs based on DRL strategies.

Limitations

The study acknowledges the computational cost of testing numerous window sizes and the need for funding for larger-scale experiments. Future work could involve testing more complex architectures, different markets, incorporating alternative data (like sentiment), and exploring real-time data streams.

Markdown Report Issue