Papers
Topics
Authors
Recent
Search
2000 character limit reached

LSTM-TD3: RL Agent for POMDP Challenges

Updated 5 February 2026
  • LSTM-TD3 is a reinforcement learning method that augments standard TD3 with an LSTM memory module to recover latent state information in POMDPs.
  • It integrates actor and critic architectures with dedicated LSTM subnetworks that capture temporal dependencies from action-observation histories.
  • Empirical results on PyBulletGym benchmarks show that LSTM-TD3 significantly outperforms standard TD3 and windowing methods in environments with noisy and missing observations.

The Twin Delayed Deep Deterministic Policy Gradient with Long Short-Term Memory (LSTM-TD3) agent is a reinforcement learning algorithm that augments the standard TD3 architecture with explicit memory integration via an LSTM, targeting the resolution of Partially Observable Markov Decision Processes (POMDPs). In POMDPs, the observable agent input at each timestep provides only a partial and potentially noisy view of the true system state. LSTM-TD3 introduces a learned memory subsystem to extract temporal dependencies and reconstruct latent states, thus enabling the agent to perform robustly in real-world scenarios where missing or corrupted sensory input is common (Meng et al., 2021).

1. Network Architecture and Memory Integration

LSTM-TD3 extends the canonical TD3 actor–critic framework by integrating a memory-extraction LSTM subnetwork into both the actor and each critic, which operate as follows:

  • Actor (μθμ\mu_{\theta^{\mu}}):
    • Receives a length-ll history htl={otl,atl,...,ot1,at1}h_t^l = \{o_{t-l}, a_{t-l}, ..., o_{t-1}, a_{t-1}\}, processed by an LSTM to yield memory vector mt=μme(htl)m_t = \mu^{me}(h_t^l).
    • The current observation oto_t is embedded via a compact MLP ("current-feature extractor," denoted μcf\mu^{cf}) to yield ft=μcf(ot)f_t = \mu^{cf}(o_t).
    • The concatenated vector [mt;ft][m_t; f_t] is passed through a MLP ("perception integration," μpi\mu^{pi}), and the output specifies the continuous action ata_t.
  • Critics (QjQ_j, j=1,2j=1,2):
    • Use a parallel LSTM structure (distinct weights from the actor) to process htlh_t^l, resulting in mt=Qme(htl)m_t = Q^{me}(h_t^l).
    • The pair (ot,at)(o_t, a_t) is projected by an MLP ("current-feature extractor," QcfQ^{cf}) to ft=Qcf(ot,at)f_t = Q^{cf}(o_t, a_t).
    • [mt;ft][m_t; f_t] feeds into a final MLP (QpiQ^{pi}) producing the Q-value estimate Qj(ot,at,htl)Q_j(o_t, a_t, h_t^l).

Both actor and critics employ two-layer ReLU-activated MLPs analogous in size to standard TD3 (e.g., 256–256 units) and an LSTM cell size of 128.

2. Mathematical Formulation of POMDPs

The environment is formalized as a tuple (S,A,P,R,O,Ω)(S, A, P, R, O, \Omega), with latent state stSs_t \in S, action atAa_t \in A (continuous), and partial observation otOo_t \in O. Transitions follow st+1P(st+1st,at)s_{t+1} \sim P(s_{t+1} | s_t, a_t) and ot+1Ω(ot+1st+1)o_{t+1} \sim \Omega(o_{t+1} | s_{t+1}). The policy receives the ll-step history htlh_t^l (filled with dummy entries for t<lt < l), and maximizes the discounted reward expectation:

E[tγtrt],rt=R(st,at,st+1)\mathbb{E}[\sum_t \gamma^t r_t], \quad r_t = R(s_t, a_t, s_{t+1})

3. Forward Pass and Actor-Critic Computation

Let xk=(ok,ak)x_k = (o_k, a_k) for the critic LSTM, xk=(ok)x_k = (o_k) for the actor. The LSTM processes input recursively via standard gates:

  • ik=σ(Wixk+Uihk1+bi)i_k = \sigma(W_i x_k + U_i h_{k-1} + b_i)
  • fk=σ(Wfxk+Ufhk1+bf)f_k = \sigma(W_f x_k + U_f h_{k-1} + b_f)
  • ok=σ(Woxk+Uohk1+bo)o_k = \sigma(W_o x_k + U_o h_{k-1} + b_o)
  • g~k=tanh(Wcxk+Uchk1+bc)\tilde{g}_k = \tanh(W_c x_k + U_c h_{k-1} + b_c)
  • ck=fkck1+ikg~kc_k = f_k \circ c_{k-1} + i_k \circ \tilde{g}_k
  • hk=oktanh(ck)h_k = o_k \circ \tanh(c_k)

The memory vector is mt=hlm_t = h_l. The actor outputs at=μpi([mt;μcf(ot)])a_t = \mu^{pi}([m_t; \mu^{cf}(o_t)]); the critics yield Qj(ot,at,htl)=Qjpi([mt;Qcf(ot,at)])Q_j(o_t, a_t, h_t^l) = Q^{pi}_j([m_t; Q^{cf}(o_t, a_t)]).

4. Optimization Procedures and Update Mechanisms

  • Critic Loss: Each QjQ_j minimizes the MSE to the double-delayed target,

L(θQj)=E[(Qj(ot,at,htl)Y^t)2],L(\theta^{Q_j}) = \mathbb{E}\left[(Q_j(o_t, a_t, h_t^l) - \hat{Y}_t)^2\right],

where Y^t=rt+γ(1dt)mink=1,2Qk(ot+1,at+1,ht+1l)\hat{Y}_t = r_t + \gamma(1-d_t) \min_{k=1,2} Q_k^-(o_{t+1}, a_{t+1}^-, h_{t+1}^l) and at+1=μ(ot+1,ht+1l)+clip(ϵ,c,c)a_{t+1}^- = \mu^-(o_{t+1}, h_{t+1}^l) + \text{clip}(\epsilon, -c, c), ϵN(0,σ)\epsilon \sim \mathcal{N}(0, \sigma).

  • Actor Loss: The policy update maximizes Q1Q_1, i.e.

L(θμ)=E[Q1(ot,μ(ot,htl),htl)]L(\theta^\mu) = -\mathbb{E}[Q_1(o_t, \mu(o_t, h_t^l), h_t^l)]

  • Delayed Policy Update and Target Networks: As in TD3, the policy (and target networks) are updated every dμd_\mu steps (commonly dμ=2d_\mu = 2). Target networks undergo soft updates:

θQjτθQj+(1τ)θQj\theta^{Q_j^-} \leftarrow \tau \theta^{Q_j} + (1-\tau) \theta^{Q_j^-}

θμτθμ+(1τ)θμ\theta^{\mu^-} \leftarrow \tau \theta^{\mu} + (1-\tau) \theta^{\mu^-}

with τ0.005\tau \approx 0.005.

5. Training Algorithm Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Initialize θ^{Q₁},θ^{Q₂},θ^μ randomly
θ^{Q₁⁻}←θ^{Q₁}, θ^{Q₂⁻}←θ^{Q₂}, θ^{μ⁻}←θ^μ
Replay buffer D ← ∅
h⁰ ← zeros, o₁←env.reset()
for t=1…T do
  aₜ ← μ_{θ^μ}(oₜ, hₜˡ) + ϵ,  ϵ∼N(0,σ)
  observe rₜ,oₜ₊₁,dₜ ← env.step(aₜ)
  store (oₜ,aₜ,rₜ,oₜ₊₁,dₜ) in D
  if dₜ then
    hₜ₊₁ˡ←zeros, oₜ₊₁←env.reset()
  else
    hₜ₊₁ˡ ← (hₜˡ minus oldest (o,a)) ∪ (oₜ,aₜ)
  end
  if |D|>batch_size then
    sample N tuples and their histories {hᶦ, oᶦ, aᶦ, rᶦ, o'ᶦ, dᶦ}ᵢ₌₁ⁿ
    for j in {1,2}:
      Ŷᶦ ← rᶦ + γ(1−dᶦ)·min_k Q_k⁻(o'ᶦ, μ⁻(o'ᶦ,hᶦ_next), hᶦ_next)
      Lⱼ ← (1/N)∑ᶦ [Qⱼ(oᶦ,aᶦ,hᶦ) − Ŷᶦ]²
      θ^{Qⱼ} ← Adam(∇_{θ^{Qⱼ} Lⱼ)
    end
    if t mod d_μ ==0 then
      L^μ ← −(1/N)∑ᶦ Q₁(oᶦ, μ(oᶦ,hᶦ),hᶦ)
      θ^μ ← Adam(∇_{θ^μ} L^μ)
      for j in {1,2}:
        θ^{Qⱼ⁻}←τθ^{Qⱼ}+(1−τ)θ^{Qⱼ⁻}
      end
      θ^{μ⁻}←τθ^μ+(1−τ)θ^{μ⁻}
    end
  end
end

6. Hyperparameters and Memory Ablation

Principal hyperparameters include:

  • History length l=5l=5 (additionally l=0,3,10l=0,3,10 tested)
  • Replay buffer size: 10610^6
  • Batch size N=100N=100
  • Discount factor γ=0.99\gamma=0.99
  • Policy noise σ=0.2\sigma=0.2, noise clip c=0.5c=0.5
  • Policy delay dμ=2d_\mu=2
  • Target network update τ=0.005\tau=0.005
  • Actor/critic learning rates: 3×1043 \times 10^{-4} (Adam)
  • MLP architecture: [256,256][256, 256]
  • LSTM hidden size: $128$

Ablation studies reveal:

  • Removing the double-critic structure destabilizes learning (yielding LSTM-DDPG/RDPG).
  • Omitting target policy smoothing produces a milder performance drop.
  • Excluding the current-feature extractor degrades MDP performance severely.
  • Removing past-action inputs from the history significantly impairs POMDP handling; both actor and critic require both oo and aa in their respective histories.

7. Empirical Evaluation and Baseline Comparisons

LSTM-TD3 was evaluated on five PyBulletGym benchmarks: HalfCheetah, Ant, Walker2D, Hopper, and InvertedDoublePendulum. Scenarios included:

  • MDP: Full observations.
  • POMDP-RV: Velocity entries removed.
  • POMDP-FLK: Entire observations zeroed at random (pflk=0.2p_{flk}=0.2).
  • POMDP-RN: Additive Gaussian noise (σrn=0.1\sigma_{rn}=0.1).
  • POMDP-RSM: Individual entries zeroed randomly (prsm=0.1p_{rsm}=0.1).

Baselines comprised DDPG, SAC, vanilla TD3, TD3-OW (recent ll observations concatenated), and TD3-OW+PA (recent ll actions also concatenated).

For HalfCheetah (after 1M steps, l=5l=5):

Version TD3 LSTM-TD3(5)
MDP 11,200±300 10,900±250
POMDP-RV 9,800±400 10,300±320
POMDP-FLK 1,200±500 9,500±400
POMDP-RN 4,000±800 9,800±350
POMDP-RSM 3,200±700 9,200±410

On pure MDPs, LSTM-TD3 matches state-of-the-art (TD3/SAC); on POMDPs with missing, noisy, or corrupted observations, LSTM-TD3 outperforms all baselines, sometimes by more than a factor of two. On tasks where underlying latent variables (e.g., velocity) are removed from observations, the memory module supports estimation via the action-observation sequence, recovering most of the performance lost by conventional architectures except possibly in high-frequency environments where the history window is too short for reliable inference. TD3-OW (observation windowing) slightly improves over naive TD3 in some POMDPs but fails catastrophically in high-noise/flickering settings, and TD3-OW+PA is usually inferior to TD3-OW in both MDP and POMDP regimes.

A plausible implication is that the explicit LSTM-based memory extraction enables true temporal inference necessary for POMDPs, a capability unattainable with mere windowing or static memory concatenation. TD3's architectural components—double critic, policy smoothing, delayed updates—remain critical to stability and sample efficiency under partial observability. Proximal Policy Optimization (PPO) was not included; on MuJoCo-style tasks, PPO requires 2–5x more samples to reach similar returns, so under the 1M step constraint it was not competitive (Meng et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TD3 LSTM Reinforcement Learning Agent.