Papers
Topics
Authors
Recent
Search
2000 character limit reached

LED-WM: Language-Aware Encoder for World Model

Updated 3 December 2025
  • The paper introduces a novel language-aware encoder (LED) that grounds natural language descriptions to grid observations, enhancing policy generalization.
  • It replaces the standard CNN encoder with a cross-modal attention module that aligns linguistic and entity dynamics, all trained end-to-end.
  • Empirical results show perfect name grounding and competitive OOD generalization on MESSENGER benchmarks, outperforming prior methods.

The Language-aware Encoder for Dreamer World Model (LED-WM) is a model-based reinforcement learning system designed to enhance policy generalization by explicitly grounding language descriptions of environmental dynamics to visual entity observations. Built atop DreamerV3’s Recurrent State-Space Model (RSSM), LED-WM introduces a modular, cross-modal attention mechanism that aligns natural language “manuals” describing environment entities with grid-based symbolic input, thus enabling robust, zero-shot generalization on out-of-distribution (OOD) environment dynamics and linguistic descriptions. LED-WM eliminates the need for inference-time planning and expert demonstrations, with all components trained end-to-end and policy optimization executed entirely in latent space. The architecture and training regime enable the agent to interpret not just “what to do” but “how the environment behaves,” as shown by empirical results in challenging compositional generalization settings such as MESSENGER and MESSENGER-WM (Nguyen et al., 28 Nov 2025).

1. Model Structure and Design

LED-WM builds directly on DreamerV3’s RSSM, which maintains both a deterministic hidden state htRdhh_t \in \mathbb{R}^{d_h} and a stochastic latent ztRdzz_t \in \mathbb{R}^{d_z} at each time step tt. Architectural components include (1) a sequence model ht=fϕ(ht1,zt1,at1)h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1}); (2) a posterior encoder ztqϕ(ztht,xt)z_t \sim q_\phi(z_t | h_t, x_t); (3) a transition prior z^tpϕ(ztht1,at1)\hat z_t \sim p_\phi(z_t | h_{t-1}, a_{t-1}); and (4) decoders for reward r^t\hat r_t, continuation c^t\hat c_t, and (optionally) observation x^t\hat x_t.

LED-WM’s primary innovation is the replacement of the vanilla CNN encoder with a bespoke “Language-aware Encoder for Dreamer” (LED):

  • Input: Grid observation ot{0,1}10×10×So_t \in \{0,1\}^{10 \times 10 \times |\mathcal{S}|} (entities and agent), a set of NN sentences (manual L={L1,...,LN}L = \{L_1, ..., L_N\}), and time index tt.
  • Sentence Embedding: Each sentence LiL_i is embedded via a frozen T5 encoder: si=T5enc(Li)Rdss_i = \mathrm{T5}_\text{enc}(L_i) \in \mathbb{R}^{d_s}, i1..Ni \in 1..N.
  • Entity Embedding: Each entity-symbol receives a learned embedding sbiRdsbsb_i \in \mathbb{R}^{d_{sb}}; agent has its own embedding.
  • Movement History Feature: Defined as Dit=(pitpat)pitpatpitpit1pitpit1D_i^t = \frac{(p_i^t - p_a^t)}{\|p_i^t - p_a^t\|} \cdot \frac{p_i^t - p_i^{t-1}}{\|p_i^t - p_i^{t-1}\|}, tracking spatial and temporal entity dynamics.
  • Cross-modal Attention: For entity-sentence pairs, form query qi=MLPq([sbiDit])q_i = \mathrm{MLP}_q([sb_i || D_i^t]) and keys kj=MLPk(sj)k_j = \mathrm{MLP}_k(s_j), then compute attention αi,j=softmaxj(qikjd)\alpha_{i,j} = \mathrm{softmax}_j\left( \frac{q_i \cdot k_j}{\sqrt{d}} \right) and entity-specific “grounded” embedding ei=jαi,jvje_i = \sum_j \alpha_{i,j} v_j.
  • Grid Construction: Replace each grid cell containing entity ii with eie_i, agent cell with agent embedding, empty cells zero or padding; output GR10×10×dvalG_\ell \in \mathbb{R}^{10 \times 10 \times d_\text{val}}.
  • Feature Extraction: GG_\ell processed via CNN, followed by flattening and concatenation with time embedding; output xtx_t used in the world-model.

Notably, the observation-reconstruction decoder pϕ(xtht,zt)p_\phi(x_t | h_t, z_t) is removed, as empirical evidence suggests its presence impairs generalization. Multi-step rollout predictions for reward and continuation provide additional learning signals.

2. Training Regime and Objectives

World-model training follows an ELBO-style loss over trajectories τ=(o1:T,a1:T,r1:T,c1:T)\tau = (o_{1:T}, a_{1:T}, r_{1:T}, c_{1:T}):

LWM(ϕ)=t=1T[βrepLrep(t)+βdynLdyn(t)+βpredLpred(t)]\mathcal{L}_{\mathrm{WM}}(\phi) = \sum_{t=1}^{T} \left[ \beta_\mathrm{rep} \mathcal{L}_\mathrm{rep}(t) + \beta_\mathrm{dyn} \mathcal{L}_\mathrm{dyn}(t) + \beta_\mathrm{pred} \mathcal{L}_\mathrm{pred}(t) \right]

Where:

  • Representation Loss: Lrep=KL[qϕ(zh,x)pϕ(zh)]\mathcal{L}_\mathrm{rep} = \mathrm{KL}[ q_\phi(z|h,x) || p_\phi(z|h) ]
  • Dynamics Loss: Ldyn=Eqϕ(zh,x)[lnpϕ(zh)]\mathcal{L}_\mathrm{dyn} = -\mathbb{E}_{q_\phi(z|h,x)} [ \ln p_\phi(z|h) ]
  • Multi-step Prediction: Sum over 1-step and HH-step rollouts for reward/continue; discount factor λ=0.9\lambda = 0.9 for stochastic environments.

Hyperparameters are: βdyn=1\beta_\mathrm{dyn} = 1, βrep=0.1\beta_\mathrm{rep} = 0.1, βpred=1\beta_\mathrm{pred} = 1.

T5 encoder weights remain frozen; all cross-modal attention components are optimized via end-to-end backpropagation through LWM\mathcal{L}_{\mathrm{WM}}.

Policy and value (actor πθ\pi_\theta, critic VψV_\psi) are trained Dreamer-style by “imagining” latent-space rollouts, using sampled states and model transitions. The policy loss is:

Lπ(θ)=E[Gt]\mathcal{L}_\pi(\theta) = -\mathbb{E}[G_t]

Where GtG_t is the imagined cumulative return over KK steps. Critic loss minimizes E[(Vψ(ht,zt)Gt)2]\mathbb{E}[(V_\psi(h_t, z_t) - G_t)^2]. Learning rates are 3×1043 \times 10^{-4} (world model), 2×1042 \times 10^{-4} (actor), 1×1041 \times 10^{-4} (critic).

3. Policy Execution and Amortization

LED-WM generates a latent generative model for transitions pϕ(zt+1,rt,ctht,zt,at)p_\phi(z_{t+1}, r_t, c_t|h_t, z_t, a_t), enabling policy and value function optimization without inference-time planning. Once trained, the agent outputs actions as a function of (ht,zt)(h_t, z_t), computed in constant O(1)\mathcal{O}(1) time from the LED encoder and RSSM, with no iterative planning or expert demonstration dependence.

This approach yields a fully amortized policy capable of instantaneous execution in real-world settings.

4. Experimental Evaluation and Performance

LED-WM is evaluated in two primary environments:

  • MESSENGER: A 10×1010 \times 10 entity grid with three entities (enemy, messenger, goal) plus agent; each entity described by manual sentences specifying name, movement type, and role. Four test stages probe name grounding (S1), new dynamics combinations (S2/S2-dev), and complex linguistic OOD generalization (S3 with distractors, synonyms, and overlapping names).
  • MESSENGER-WM: Extends MESSENGER S2 with combinatorial splits (NewCombo, NewAttr, NewAll) along entity, movement, and role assignment axes.

Key policy generalization metrics are win-rate (MESSENGER, over 1,000 episodes) and average score (MESSENGER-WM, over 1,000 games, 60 trials each).

MESSENGER Results (win-rate %):

Method S1 S2 S2-dev S3
Dynalang 0.03±0.02 0.04±0.05 0.03±0.05
CRL 88±2.5 76±5 32±1.9
EMMA (no curriculum) 85±1.4 45±12 10±0.8
EMMA (curriculum) 88±2.3 95±0.4 22±3.8
LED-WM 100±0 51.6±2.7 96.6±1.0 34.97±1.73

LED-WM achieves perfect accuracy on the name grounding stage (S1), matches the CRL baseline on the most complex OOD split (S3), and outperforms all model-based alternatives.

MESSENGER-WM Results (average score):

Method NewCombo NewAttr NewAll
EMMA-LWM (Online IL) 1.01±0.12 0.96±0.17 0.62±0.21
EMMA-LWM (Filtered BC) 1.18±0.10 0.75±0.20 0.44±0.18
LED-WM 1.31±0.05 1.15±0.08 1.16±0.02

World-model Finetuning: For novel test configurations, the policy can be further optimized via synthetic trajectory generation using LED-WM, with modest performance gains observed, notably in the S2-dev split: pre-finetune 1.4478±0.01, post-finetune 1.4513±0.01.

5. Mechanisms for Language Grounding and Generalization

The explicit cross-modal attention mechanism of LED-WM aligns each manual sentence to the appropriate entity symbol, informed by both linguistic descriptors and observed movement patterns. This grounding is essential for resolving semantic ambiguities and enables the model to form transition dynamics conditioned on language-inferred roles. In contrast, baselines such as Dynalang (concatenated language embedding with CNN encoder) fail to perform in OOD scenarios, registering near-zero scores, while ablating attention modules in LED-WM collapses performance similarly.

Distinct latent transitions p(zt+1,rht,zt,at)p(z_{t+1}, r | h_t, z_t, a_t) are learned conditional on grounded entity roles, which is a key factor for robust compositional generalization in MESSENGER-type environments.

6. Ablation Studies, Limitations, and Future Directions

Ablation experiments demonstrate that removing the attention module negates OOD generalization capacity in all test stages, while reinstating DreamerV3’s observation decoder impairs generalization and leave-one-out accuracy.

LED-WM’s current limitations include:

  • Suboptimal performance on S2 relative to CRL, attributed to “single-movement-combo bias.” CRL’s explicit debiasing of name-to-role correlations is identified as a promising direction for LED-WM enhancement.
  • Frozen language encoder; no end-to-end language fine-tuning. Incorporation of a fine-tuned LLM may improve linguistic robustness, especially under paraphrase.
  • Grid-symbol abstraction: LED-WM does not operate on raw pixel images, thus translation to full vision settings remains a non-trivial extension.
  • Synthetic finetuning yields only incremental performance gains; improved transfer from imagined to real trajectories is an open research area.

A plausible implication is that improved grounding and cross-modal mapping architectures may further enhance generalization in dynamic, linguistically-rich RL tasks.

7. Significance and Context

LED-WM integrates a lightweight, attention-based language grounding module with the DreamerV3 latent world model, providing a scalable method for policy generalization in compositional RL settings where agents must interpret environmental dynamics as described by language manuals. The system demonstrates state-of-the-art performance among model-based approaches and offers a foundation for future work in scaling to vision inputs and more expressive LLMs (Nguyen et al., 28 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Language-aware Encoder for Dreamer World Model (LED-WM).