LSTM-Based Recurrent Policies in Deep RL

Updated 14 January 2026

LSTM-based recurrent policies are deep reinforcement learning architectures that integrate sequential information to handle partial observability and environmental noise.
They employ LSTM modules to unroll observation-action histories, enhancing policy and value estimation in both value-based and actor–critic methods.
Empirical studies confirm these methods boost performance in diverse tasks such as continuous control, pixel-level games, and structured scheduling under noisy conditions.

Long Short-Term Memory (LSTM)-based recurrent policies constitute a class of deep reinforcement learning (DRL) architectures designed for environments characterized by partial observability, stochastic disturbances, and non-Markovian hidden state dynamics. Leveraging the intrinsic memory and gating mechanisms of LSTM units, these policies efficiently integrate sequential information, enabling agents to approximate the optimal mapping from historical observations and actions to control decisions. LSTM-based recurrent policies underpin advances in both value-based and actor–critic algorithms—such as Recurrent Deep Q-Networks (DRQN), Recurrent Deterministic Policy Gradient (RDPG), Twin Delayed DDPG (TD3) with recurrence, and hybrid approaches uniting supervised sequence modeling with RL—across discrete, continuous, high-dimensional, and structured control domains.

1. Architectures for Recurrent Policies

The fundamental architecture of LSTM-based recurrent policies incorporates an LSTM module that processes temporal sequences. Typical input at time $t$ includes the current observation $o_t$ , and often the previous action $a_{t-1}$ (Heess et al., 2015, Yang et al., 2021). Architectures fall into categories:

Pure Recurrent Policy-Critic Modules: Both actor and critic networks unroll an LSTM over observation–action history; the hidden state $h_t$ encodes task-specific memory. Action output may be deterministic ( $a_t = \mu_\theta(h_t)$ ) or stochastic ( $a_t = \pi_\theta(h_t, \nu_t)$ ) (Heess et al., 2015, Meng et al., 2021, Omi et al., 2023).
Feature-Extracting Recurrent Heads: LSTM processes concatenated history; output is merged via feed-forward layers with current observation and action features, enhancing instantaneous perception–memory integration (Meng et al., 2021, Omi et al., 2023).
Hybrid Deep Network Compositions: Architectures couple LSTMs with graph neural networks (GNNs) for structured input (scheduling), or fuse supervised LSTM sequence models with RL for state representation (Altundas et al., 2023, Li et al., 2015).
DRQN for Pixel-level Inputs: A convolutional encoder replaces the DQN’s first fully-connected layer with an LSTM module, extending memory over high-dimensional observations (Hausknecht et al., 2015).

Critics, Q-networks, or value estimators typically replicate the recurrent structure to leverage history-dependent value estimation under partial observability.

2. Memory Handling and Dynamics

LSTM cell equations follow the canonical gating and update rules at each time $t$ : $\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i),\ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f),\ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o),\ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c),\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t,\ h_t &= o_t \odot \tanh(c_t) \end{aligned}$ where $x_t$ may include $o_t$ , $o_t$ 0, or other features (Heess et al., 2015, Hausknecht et al., 2015, Meng et al., 2021, Omi et al., 2023). Zero initialization of hidden and cell states occurs at sequence or episode start; truncated backpropagation through time (BPTT) enables optimization over bounded sequence windows (typically $o_t$ 1 to $o_t$ 2).

Effective memory representation depends on input structure and history length; recurrence integrates information over durations necessary for task-specific estimations—velocity inference, long-horizon credit assignment, or delayed reward states. Including past actions alongside observations demonstrably improves denoising and temporal pattern recovery under disturbance (Omi et al., 2023).

3. Integration with Reinforcement Learning Algorithms

LSTM-based policies extend standard RL formulations into partial observability. Representations $o_t$ 3 are used as sufficient statistics of agent’s history for decision-making:

Policy Gradient Methods (RDPG, SVG):
- Deterministic: $o_t$ 4 (Heess et al., 2015, Yang et al., 2021).
- Stochastic: Reparameterization $o_t$ 5, gradient estimated via sampled trajectories and noise (Heess et al., 2015).
Twin Delayed DDPG (TD3) Recurrent Extensions: Separate LSTM-based actor and twin critics; delayed actor updates; target smoothing via noise; BPTT over history (Meng et al., 2021, Omi et al., 2023, Yang et al., 2021).
Deep Q-Learning (DRQN): Q-value output reads off linear mapping from LSTM hidden vector $o_t$ 6; target Bellman loss and gradient clipping stabilize training (Hausknecht et al., 2015).
Hybrid Supervised/Reinforcement Learning: Joint training of LSTM (for observation/reward prediction) and DQN (for control optimization); total loss is a weighted combination (Li et al., 2015).
Graph–LSTM Policy Networks: HybridNet propagates agent–state embeddings forward through LSTMCells to simulate schedule consequences; policy selection via softmax heads operates on LSTM-updated latent vectors (Altundas et al., 2023).

Experience replay and off-policy updates sample sequences (not just transitions) from buffer; policy and value gradients propagate through full or truncated history, with soft target network updates.

4. Empirical Evaluations and Benchmarking

LSTM-based recurrent policies have undergone extensive benchmarking in various domains:

Partially Observed Continuous Control: Tasks include pendulum/cartpole swing-up (with missing velocity or randomized system parameters), reacher and gripper tasks with targets revealed only at initial steps, and pixel-based control (visual occlusion, frame-drop) (Heess et al., 2015, Hausknecht et al., 2015, Yang et al., 2021, Meng et al., 2021).
Disturbance/Noise Robustness: Temporal sinusoid, Gaussian, random disturbances, and missing/hidden sensors were evaluated for continuous-control environments. Inclusion of actions in history and longer LSTM windows significantly improved normalized returns and noise resistance (Omi et al., 2023, Meng et al., 2021).
Long-term Memory and Credit Assignment: Morris water maze task, delayed reward benchmarks, direct mailing CRM simulated with long interaction history—demonstrated recovery of latent state and strategic planning beyond instantaneous observation (Heess et al., 2015, Li et al., 2015).
Pixel-level Atari and Flickering Games: DRQN matches DQN in fully observable settings, but outperforms when the agent sees only a single (flickered) frame each step, and generalizes gracefully to varying degrees of partial observability (Hausknecht et al., 2015).
Structured Scheduling and Coordination: HybridNet, utilizing graph+LSTM propagators, attains superior feasibility and makespan in human–robot team scheduling problems compared to purely feed-forward graph encoders (Altundas et al., 2023).

Performance gains are summarized below for select domains:

Algorithm	CartPole ↑	Pendulum ↑	Reacher ↑	PuckPush ↑	Ant-POMDP-FLK
DDPG	110±15	–100±5	25±4	10±3	~100
TD3	200±10	–50±8	40±6	20±4	~100
LSTM-TD3	—	—	—	—	~2000
RSAC	260±4	–10±4	80±3	55±4	—

LSTM-based methods consistently outperform their feed-forward counterparts in environments where stepwise observability is broken, attribution is delayed, or dynamics are non-Markovian (Heess et al., 2015, Meng et al., 2021, Yang et al., 2021).

5. Training Protocols, Hyperparameters, and Design Principles

Key engineering practices for LSTM-based recurrent RL include:

Truncated BPTT: Unroll and train over limited sequence windows ( $o_t$ 7– $o_t$ 8), with burn-in initialization to stabilize hidden state inference (Hausknecht et al., 2015, Yang et al., 2021, Meng et al., 2021).
Replay Buffer: Store full episodes or reconstruct fixed-length histories on update; for high-dimensional/long-memory domains sequence sampling methods (bootstrapped random updates) enhance diversity (Hausknecht et al., 2015).
Layer Sizes: Typical hidden state dimensions: LSTM ( $o_t$ 9– $a_{t-1}$ 0), FC layers ( $a_{t-1}$ 1– $a_{t-1}$ 2 units); two recurrent layers strike a balance of expressivity and trainability (Yang et al., 2021, Omi et al., 2023).
Architectural Variants: Ablations confirm that single-headed LSTM streams (merging history and current features) outperform multiple-headed splits in disturbance-heavy domains (Omi et al., 2023).
Optimization and Regularization: Adam or ADADELTA optimizers, gradient clipping, target network updates per $a_{t-1}$ 3 steps, dropout, and weight decay are standard (Hausknecht et al., 2015, Li et al., 2015).
Training Losses: RL losses are computed over all time steps in a sampled sequence window; hybrid schemes combine SL-prediction and RL Q-value optimization with task-specific weighting (Li et al., 2015).
Exploration: Entropy-regularization and action noise improve exploration, but systematic coverage of memory-dependent subspaces remains challenging in extremely sparse-reward environments (Yang et al., 2021).
Replay and Initialization: Zeroing hidden/cell states at episode or sequence start is universal to ensure consistent history propagation.

6. Impact, Generalization, and Limitations

Empirical analyses across domains establish that LSTM-based recurrent policies:

Robustly solve POMDPs with missing, noisy, or temporally correlated disturbances (Heess et al., 2015, Meng et al., 2021, Omi et al., 2023).
Generalize from trained disturbances to unseen temporally correlated perturbations, but not to uncorrelated white noise (Omi et al., 2023).
Display no systematic advantage in fully observable or short-memory MDPs over feed-forward methods, at the expense of higher computational and memory overhead (Hausknecht et al., 2015).
Optimize long-horizon rewards in tasks demanding both short and long-term memory integration, as verified by ablation and hybrid training studies (Li et al., 2015).
When paired with off-policy RL algorithms (TD3, SAC), recurrent policies improve both sample efficiency and final returns in temporal credit assignment and delayed or obscured control domains (Yang et al., 2021).

Challenges persist in exploration for sparse-reward, memory-dependent regions and scaling to larger LSTM parameterizations or longer windows; hybrid methodologies and cross-module state sharing (H-TD3) partially alleviate computational cost (Omi et al., 2023).

7. Extensions and Structured Recurrent Policies

Recent research extends LSTM-based recurrent policies to settings requiring structured, graph-theoretic state embeddings and schedule propagation (e.g., coordination in human–robot teams). HybridNet, for instance, merges heterogeneous graph attention encoding with recurrent LSTM-based schedule propagators, resulting in scalable and expressive scheduling policies that outperform GNN-only baselines in both deterministic and stochastic human performance scenarios (Altundas et al., 2023).

Recurrent architectures are also employed in joint supervised/reinforcement learning training, where the LSTM provides domain-independent hidden-state representations for downstream RL optimization, yielding improvements in real-world business process optimization (Li et al., 2015).

—

LSTM-based recurrent policies, via history-conditioned representation learning and robust integration in DRL frameworks, constitute a highly effective paradigm for partially observed, noisy, and temporally extended control tasks. Their architectural flexibility supports both canonical RL and hybrid methodologies over discrete, continuous, high-dimensional, and structured domains, with strong empirical evidence highlighting their utility wherever environmental state is not fully accessible at each timestep.