LSTM-TD3: RL Agent for POMDP Challenges

Updated 5 February 2026

LSTM-TD3 is a reinforcement learning method that augments standard TD3 with an LSTM memory module to recover latent state information in POMDPs.
It integrates actor and critic architectures with dedicated LSTM subnetworks that capture temporal dependencies from action-observation histories.
Empirical results on PyBulletGym benchmarks show that LSTM-TD3 significantly outperforms standard TD3 and windowing methods in environments with noisy and missing observations.

The Twin Delayed Deep Deterministic Policy Gradient with Long Short-Term Memory (LSTM-TD3) agent is a reinforcement learning algorithm that augments the standard TD3 architecture with explicit memory integration via an LSTM, targeting the resolution of Partially Observable Markov Decision Processes (POMDPs). In POMDPs, the observable agent input at each timestep provides only a partial and potentially noisy view of the true system state. LSTM-TD3 introduces a learned memory subsystem to extract temporal dependencies and reconstruct latent states, thus enabling the agent to perform robustly in real-world scenarios where missing or corrupted sensory input is common (Meng et al., 2021).

1. Network Architecture and Memory Integration

LSTM-TD3 extends the canonical TD3 actor–critic framework by integrating a memory-extraction LSTM subnetwork into both the actor and each critic, which operate as follows:

Actor ( $\mu_{\theta^{\mu}}$ ):
- Receives a length- $l$ history $h_t^l = \{o_{t-l}, a_{t-l}, ..., o_{t-1}, a_{t-1}\}$ , processed by an LSTM to yield memory vector $m_t = \mu^{me}(h_t^l)$ .
- The current observation $o_t$ is embedded via a compact MLP ("current-feature extractor," denoted $\mu^{cf}$ ) to yield $f_t = \mu^{cf}(o_t)$ .
- The concatenated vector $[m_t; f_t]$ is passed through a MLP ("perception integration," $\mu^{pi}$ ), and the output specifies the continuous action $a_t$ .
Critics ( $Q_j$ , $j=1,2$ ):
- Use a parallel LSTM structure (distinct weights from the actor) to process $h_t^l$ , resulting in $m_t = Q^{me}(h_t^l)$ .
- The pair $(o_t, a_t)$ is projected by an MLP ("current-feature extractor," $Q^{cf}$ ) to $f_t = Q^{cf}(o_t, a_t)$ .
- $[m_t; f_t]$ feeds into a final MLP ( $Q^{pi}$ ) producing the Q-value estimate $Q_j(o_t, a_t, h_t^l)$ .

Both actor and critics employ two-layer ReLU-activated MLPs analogous in size to standard TD3 (e.g., 256–256 units) and an LSTM cell size of 128.

2. Mathematical Formulation of POMDPs

The environment is formalized as a tuple $(S, A, P, R, O, \Omega)$ , with latent state $s_t \in S$ , action $a_t \in A$ (continuous), and partial observation $o_t \in O$ . Transitions follow $s_{t+1} \sim P(s_{t+1} | s_t, a_t)$ and $o_{t+1} \sim \Omega(o_{t+1} | s_{t+1})$ . The policy receives the $l$ -step history $h_t^l$ (filled with dummy entries for $t < l$ ), and maximizes the discounted reward expectation:

$\mathbb{E}[\sum_t \gamma^t r_t], \quad r_t = R(s_t, a_t, s_{t+1})$

3. Forward Pass and Actor-Critic Computation

Let $x_k = (o_k, a_k)$ for the critic LSTM, $x_k = (o_k)$ for the actor. The LSTM processes input recursively via standard gates:

$i_k = \sigma(W_i x_k + U_i h_{k-1} + b_i)$
$f_k = \sigma(W_f x_k + U_f h_{k-1} + b_f)$
$o_k = \sigma(W_o x_k + U_o h_{k-1} + b_o)$
$\tilde{g}_k = \tanh(W_c x_k + U_c h_{k-1} + b_c)$
$c_k = f_k \circ c_{k-1} + i_k \circ \tilde{g}_k$
$h_k = o_k \circ \tanh(c_k)$

The memory vector is $m_t = h_l$ . The actor outputs $a_t = \mu^{pi}([m_t; \mu^{cf}(o_t)])$ ; the critics yield $Q_j(o_t, a_t, h_t^l) = Q^{pi}_j([m_t; Q^{cf}(o_t, a_t)])$ .

4. Optimization Procedures and Update Mechanisms

Critic Loss: Each $Q_j$ minimizes the MSE to the double-delayed target,

$L(\theta^{Q_j}) = \mathbb{E}\left[(Q_j(o_t, a_t, h_t^l) - \hat{Y}_t)^2\right],$

where $\hat{Y}_t = r_t + \gamma(1-d_t) \min_{k=1,2} Q_k^-(o_{t+1}, a_{t+1}^-, h_{t+1}^l)$ and $a_{t+1}^- = \mu^-(o_{t+1}, h_{t+1}^l) + \text{clip}(\epsilon, -c, c)$ , $\epsilon \sim \mathcal{N}(0, \sigma)$ .

Actor Loss: The policy update maximizes $Q_1$ , i.e.

$L(\theta^\mu) = -\mathbb{E}[Q_1(o_t, \mu(o_t, h_t^l), h_t^l)]$

Delayed Policy Update and Target Networks: As in TD3, the policy (and target networks) are updated every $d_\mu$ steps (commonly $d_\mu = 2$ ). Target networks undergo soft updates:

$\theta^{Q_j^-} \leftarrow \tau \theta^{Q_j} + (1-\tau) \theta^{Q_j^-}$

$\theta^{\mu^-} \leftarrow \tau \theta^{\mu} + (1-\tau) \theta^{\mu^-}$

with $\tau \approx 0.005$ .

5. Training Algorithm Pseudocode

Initialize θ^{Q₁},θ^{Q₂},θ^μ randomly
θ^{Q₁⁻}←θ^{Q₁}, θ^{Q₂⁻}←θ^{Q₂}, θ^{μ⁻}←θ^μ
Replay buffer D ← ∅
h⁰ ← zeros, o₁←env.reset()
for t=1…T do
  aₜ ← μ_{θ^μ}(oₜ, hₜˡ) + ϵ,  ϵ∼N(0,σ)
  observe rₜ,oₜ₊₁,dₜ ← env.step(aₜ)
  store (oₜ,aₜ,rₜ,oₜ₊₁,dₜ) in D
  if dₜ then
    hₜ₊₁ˡ←zeros, oₜ₊₁←env.reset()
  else
    hₜ₊₁ˡ ← (hₜˡ minus oldest (o,a)) ∪ (oₜ,aₜ)
  end
  if |D|>batch_size then
    sample N tuples and their histories {hᶦ, oᶦ, aᶦ, rᶦ, o'ᶦ, dᶦ}ᵢ₌₁ⁿ
    for j in {1,2}:
      Ŷᶦ ← rᶦ + γ(1−dᶦ)·min_k Q_k⁻(o'ᶦ, μ⁻(o'ᶦ,hᶦ_next), hᶦ_next)
      Lⱼ ← (1/N)∑ᶦ [Qⱼ(oᶦ,aᶦ,hᶦ) − Ŷᶦ]²
      θ^{Qⱼ} ← Adam(∇_{θ^{Qⱼ} Lⱼ)
    end
    if t mod d_μ ==0 then
      L^μ ← −(1/N)∑ᶦ Q₁(oᶦ, μ(oᶦ,hᶦ),hᶦ)
      θ^μ ← Adam(∇_{θ^μ} L^μ)
      for j in {1,2}:
        θ^{Qⱼ⁻}←τθ^{Qⱼ}+(1−τ)θ^{Qⱼ⁻}
      end
      θ^{μ⁻}←τθ^μ+(1−τ)θ^{μ⁻}
    end
  end
end

6. Hyperparameters and Memory Ablation

Principal hyperparameters include:

History length $l=5$ (additionally $l=0,3,10$ tested)
Replay buffer size: $10^6$
Batch size $N=100$
Discount factor $\gamma=0.99$
Policy noise $\sigma=0.2$ , noise clip $c=0.5$
Policy delay $d_\mu=2$
Target network update $\tau=0.005$
Actor/critic learning rates: $3 \times 10^{-4}$ (Adam)
MLP architecture: $[256, 256]$
LSTM hidden size: $128$

Ablation studies reveal:

Removing the double-critic structure destabilizes learning (yielding LSTM-DDPG/RDPG).
Omitting target policy smoothing produces a milder performance drop.
Excluding the current-feature extractor degrades MDP performance severely.
Removing past-action inputs from the history significantly impairs POMDP handling; both actor and critic require both $o$ and $a$ in their respective histories.

7. Empirical Evaluation and Baseline Comparisons

LSTM-TD3 was evaluated on five PyBulletGym benchmarks: HalfCheetah, Ant, Walker2D, Hopper, and InvertedDoublePendulum. Scenarios included:

MDP: Full observations.
POMDP-RV: Velocity entries removed.
POMDP-FLK: Entire observations zeroed at random ( $p_{flk}=0.2$ ).
POMDP-RN: Additive Gaussian noise ( $\sigma_{rn}=0.1$ ).
POMDP-RSM: Individual entries zeroed randomly ( $p_{rsm}=0.1$ ).

Baselines comprised DDPG, SAC, vanilla TD3, TD3-OW (recent $l$ observations concatenated), and TD3-OW+PA (recent $l$ actions also concatenated).

For HalfCheetah (after 1M steps, $l=5$ ):

Version	TD3	LSTM-TD3(5)
MDP	11,200±300	10,900±250
POMDP-RV	9,800±400	10,300±320
POMDP-FLK	1,200±500	9,500±400
POMDP-RN	4,000±800	9,800±350
POMDP-RSM	3,200±700	9,200±410

On pure MDPs, LSTM-TD3 matches state-of-the-art (TD3/SAC); on POMDPs with missing, noisy, or corrupted observations, LSTM-TD3 outperforms all baselines, sometimes by more than a factor of two. On tasks where underlying latent variables (e.g., velocity) are removed from observations, the memory module supports estimation via the action-observation sequence, recovering most of the performance lost by conventional architectures except possibly in high-frequency environments where the history window is too short for reliable inference. TD3-OW (observation windowing) slightly improves over naive TD3 in some POMDPs but fails catastrophically in high-noise/flickering settings, and TD3-OW+PA is usually inferior to TD3-OW in both MDP and POMDP regimes.

A plausible implication is that the explicit LSTM-based memory extraction enables true temporal inference necessary for POMDPs, a capability unattainable with mere windowing or static memory concatenation. TD3's architectural components—double critic, policy smoothing, delayed updates—remain critical to stability and sample efficiency under partial observability. Proximal Policy Optimization (PPO) was not included; on MuJoCo-style tasks, PPO requires 2–5x more samples to reach similar returns, so under the 1M step constraint it was not competitive (Meng et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Memory-based Deep Reinforcement Learning for POMDPs (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TD3 LSTM Reinforcement Learning Agent.

LSTM-TD3: RL Agent for POMDP Challenges

1. Network Architecture and Memory Integration

2. Mathematical Formulation of POMDPs

3. Forward Pass and Actor-Critic Computation

4. Optimization Procedures and Update Mechanisms

5. Training Algorithm Pseudocode

6. Hyperparameters and Memory Ablation

7. Empirical Evaluation and Baseline Comparisons

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LSTM-TD3: RL Agent for POMDP Challenges

1. Network Architecture and Memory Integration

2. Mathematical Formulation of POMDPs

3. Forward Pass and Actor-Critic Computation

4. Optimization Procedures and Update Mechanisms

5. Training Algorithm Pseudocode

6. Hyperparameters and Memory Ablation

7. Empirical Evaluation and Baseline Comparisons

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research