LED-WM: Language-Aware Encoder for World Model
- The paper introduces a novel language-aware encoder (LED) that grounds natural language descriptions to grid observations, enhancing policy generalization.
- It replaces the standard CNN encoder with a cross-modal attention module that aligns linguistic and entity dynamics, all trained end-to-end.
- Empirical results show perfect name grounding and competitive OOD generalization on MESSENGER benchmarks, outperforming prior methods.
The Language-aware Encoder for Dreamer World Model (LED-WM) is a model-based reinforcement learning system designed to enhance policy generalization by explicitly grounding language descriptions of environmental dynamics to visual entity observations. Built atop DreamerV3’s Recurrent State-Space Model (RSSM), LED-WM introduces a modular, cross-modal attention mechanism that aligns natural language “manuals” describing environment entities with grid-based symbolic input, thus enabling robust, zero-shot generalization on out-of-distribution (OOD) environment dynamics and linguistic descriptions. LED-WM eliminates the need for inference-time planning and expert demonstrations, with all components trained end-to-end and policy optimization executed entirely in latent space. The architecture and training regime enable the agent to interpret not just “what to do” but “how the environment behaves,” as shown by empirical results in challenging compositional generalization settings such as MESSENGER and MESSENGER-WM (Nguyen et al., 28 Nov 2025).
1. Model Structure and Design
LED-WM builds directly on DreamerV3’s RSSM, which maintains both a deterministic hidden state and a stochastic latent at each time step . Architectural components include (1) a sequence model ; (2) a posterior encoder ; (3) a transition prior ; and (4) decoders for reward , continuation , and (optionally) observation .
LED-WM’s primary innovation is the replacement of the vanilla CNN encoder with a bespoke “Language-aware Encoder for Dreamer” (LED):
- Input: Grid observation (entities and agent), a set of sentences (manual ), and time index .
- Sentence Embedding: Each sentence is embedded via a frozen T5 encoder: , .
- Entity Embedding: Each entity-symbol receives a learned embedding ; agent has its own embedding.
- Movement History Feature: Defined as , tracking spatial and temporal entity dynamics.
- Cross-modal Attention: For entity-sentence pairs, form query and keys , then compute attention and entity-specific “grounded” embedding .
- Grid Construction: Replace each grid cell containing entity with , agent cell with agent embedding, empty cells zero or padding; output .
- Feature Extraction: processed via CNN, followed by flattening and concatenation with time embedding; output used in the world-model.
Notably, the observation-reconstruction decoder is removed, as empirical evidence suggests its presence impairs generalization. Multi-step rollout predictions for reward and continuation provide additional learning signals.
2. Training Regime and Objectives
World-model training follows an ELBO-style loss over trajectories :
Where:
- Representation Loss:
- Dynamics Loss:
- Multi-step Prediction: Sum over 1-step and -step rollouts for reward/continue; discount factor for stochastic environments.
Hyperparameters are: , , .
T5 encoder weights remain frozen; all cross-modal attention components are optimized via end-to-end backpropagation through .
Policy and value (actor , critic ) are trained Dreamer-style by “imagining” latent-space rollouts, using sampled states and model transitions. The policy loss is:
Where is the imagined cumulative return over steps. Critic loss minimizes . Learning rates are (world model), (actor), (critic).
3. Policy Execution and Amortization
LED-WM generates a latent generative model for transitions , enabling policy and value function optimization without inference-time planning. Once trained, the agent outputs actions as a function of , computed in constant time from the LED encoder and RSSM, with no iterative planning or expert demonstration dependence.
This approach yields a fully amortized policy capable of instantaneous execution in real-world settings.
4. Experimental Evaluation and Performance
LED-WM is evaluated in two primary environments:
- MESSENGER: A entity grid with three entities (enemy, messenger, goal) plus agent; each entity described by manual sentences specifying name, movement type, and role. Four test stages probe name grounding (S1), new dynamics combinations (S2/S2-dev), and complex linguistic OOD generalization (S3 with distractors, synonyms, and overlapping names).
- MESSENGER-WM: Extends MESSENGER S2 with combinatorial splits (NewCombo, NewAttr, NewAll) along entity, movement, and role assignment axes.
Key policy generalization metrics are win-rate (MESSENGER, over 1,000 episodes) and average score (MESSENGER-WM, over 1,000 games, 60 trials each).
MESSENGER Results (win-rate %):
| Method | S1 | S2 | S2-dev | S3 |
|---|---|---|---|---|
| Dynalang | 0.03±0.02 | 0.04±0.05 | – | 0.03±0.05 |
| CRL | 88±2.5 | 76±5 | – | 32±1.9 |
| EMMA (no curriculum) | 85±1.4 | 45±12 | – | 10±0.8 |
| EMMA (curriculum) | 88±2.3 | 95±0.4 | – | 22±3.8 |
| LED-WM | 100±0 | 51.6±2.7 | 96.6±1.0 | 34.97±1.73 |
LED-WM achieves perfect accuracy on the name grounding stage (S1), matches the CRL baseline on the most complex OOD split (S3), and outperforms all model-based alternatives.
MESSENGER-WM Results (average score):
| Method | NewCombo | NewAttr | NewAll |
|---|---|---|---|
| EMMA-LWM (Online IL) | 1.01±0.12 | 0.96±0.17 | 0.62±0.21 |
| EMMA-LWM (Filtered BC) | 1.18±0.10 | 0.75±0.20 | 0.44±0.18 |
| LED-WM | 1.31±0.05 | 1.15±0.08 | 1.16±0.02 |
World-model Finetuning: For novel test configurations, the policy can be further optimized via synthetic trajectory generation using LED-WM, with modest performance gains observed, notably in the S2-dev split: pre-finetune 1.4478±0.01, post-finetune 1.4513±0.01.
5. Mechanisms for Language Grounding and Generalization
The explicit cross-modal attention mechanism of LED-WM aligns each manual sentence to the appropriate entity symbol, informed by both linguistic descriptors and observed movement patterns. This grounding is essential for resolving semantic ambiguities and enables the model to form transition dynamics conditioned on language-inferred roles. In contrast, baselines such as Dynalang (concatenated language embedding with CNN encoder) fail to perform in OOD scenarios, registering near-zero scores, while ablating attention modules in LED-WM collapses performance similarly.
Distinct latent transitions are learned conditional on grounded entity roles, which is a key factor for robust compositional generalization in MESSENGER-type environments.
6. Ablation Studies, Limitations, and Future Directions
Ablation experiments demonstrate that removing the attention module negates OOD generalization capacity in all test stages, while reinstating DreamerV3’s observation decoder impairs generalization and leave-one-out accuracy.
LED-WM’s current limitations include:
- Suboptimal performance on S2 relative to CRL, attributed to “single-movement-combo bias.” CRL’s explicit debiasing of name-to-role correlations is identified as a promising direction for LED-WM enhancement.
- Frozen language encoder; no end-to-end language fine-tuning. Incorporation of a fine-tuned LLM may improve linguistic robustness, especially under paraphrase.
- Grid-symbol abstraction: LED-WM does not operate on raw pixel images, thus translation to full vision settings remains a non-trivial extension.
- Synthetic finetuning yields only incremental performance gains; improved transfer from imagined to real trajectories is an open research area.
A plausible implication is that improved grounding and cross-modal mapping architectures may further enhance generalization in dynamic, linguistically-rich RL tasks.
7. Significance and Context
LED-WM integrates a lightweight, attention-based language grounding module with the DreamerV3 latent world model, providing a scalable method for policy generalization in compositional RL settings where agents must interpret environmental dynamics as described by language manuals. The system demonstrates state-of-the-art performance among model-based approaches and offers a foundation for future work in scaling to vision inputs and more expressive LLMs (Nguyen et al., 28 Nov 2025).