DreamerV3: Model-Based RL Algorithm

Updated 10 December 2025

DreamerV3 is a model-based reinforcement learning algorithm that utilizes compact latent world models to enable efficient policy optimization through imagined rollouts.
It employs an RSSM architecture with parallel-trained encoder, actor, and critic components, ensuring robustness across continuous control, visual tasks, and sparse-reward environments.
Empirical results in applications like traffic signal control and pixel-based tasks highlight its sample efficiency and generality using minimal, domain-invariant hyperparameter tuning.

DreamerV3 is a general-purpose model-based reinforcement learning (RL) algorithm that learns compact latent world models to enable efficient policy optimization via imagined rollouts in latent space. The architecture is designed to be robust across a wide range of high-dimensional RL tasks, including continuous control, visual domains, and sparse-reward challenges such as Minecraft diamond collection, all without domain-specific hyperparameter tuning (Hafner et al., 2023).

1. Architectural Overview: World Model and Latent Imagination

DreamerV3 comprises three primary neural modules trained in parallel: a world model (typically an RSSM—Recurrent State Space Model), an actor for policy learning, and a critic for value estimation (Hafner et al., 2023). At each time step $t$ , the agent processes observation $x_t$ to generate a stochastic latent state $z_t$ via an encoder, maintaining a deterministic recurrent hidden state $h_t$ updated alongside previous action $a_{t-1}$ and latent $z_{t-1}$ . The model state is denoted $s_t = (h_t, z_t)$ .

World model:
- Encoder: $q_{\phi}(z_t|h_{t-1}, x_t)$
- Transition/prior: $p_{\theta}(z_t|h_{t-1}, a_{t-1})$
- RNN update: $h_t = f_{\mathrm{RNN}}(h_{t-1}, z_t, a_{t-1})$
Observation, reward, and continuation heads:
- $x_t$ 0
- $x_t$ 1
- $x_t$ 2
Actor–critic:
- Actor $x_t$ 3 and critic $x_t$ 4 operate in latent space and are trained on trajectories generated by rollouts (imagination) under the learned world model.

Imagination-based training enables actor–critic learning to proceed entirely on simulated experience, reducing the sample complexity compared to model-free RL.

2. Mathematical Formulation and Optimization Objectives

The DreamerV3 world model is trained via variational inference, minimizing a weighted sum of losses over batches of experience:

Observation reconstruction loss:

$x_t$ 5

KL divergence (latent regularization):

$x_t$ 6

Reward and continuation prediction loss:

$x_t$ 7

Total world-model loss:

$x_t$ 8

Imagination rollouts proceed from real $x_t$ 9, with the model iteratively generating future latents and actions: Sample $z_t$ 0, $z_t$ 1, $z_t$ 2. The actor and critic are optimized via value targets computed as $z_t$ 3-returns over imagined trajectories.

3. Robustness Mechanisms and Empirical Stability

DreamerV3 employs a suite of domain-agnostic architectural and optimization strategies to ensure stability and generality across domains (Hafner et al., 2023):

Symlog Transformation:

Scalar values are mapped as $z_t$ 4, reducing large target magnitudes and balancing gradients.

Discrete two-hot value regression:

Final values are discretized, then two-hot encoded for stable training under varying value target scales.

KL balancing and free bits:

KL-divergence regularization adapts without manual scheduling, allowing the model to retain informative representations across both simple and complex environments.

Unimix categorical distributions:

All categorical outputs mix $z_t$ 5 uniform with $z_t$ 6 network output, mitigating deterministic collapse in discrete distributions.

These mechanisms enable DreamerV3 to operate robustly without the domain-specific norm schedules and KL annealing critical to previous Dreamer variants.

4. Implementation and Hyperparameterization

DreamerV3 is designed for minimal, domain-invariant hyperparameter tuning. Key adjustable parameters are:

Model size:

Controls layer widths/hidden sizes in the encoder, recurrent cell, decoders, actor, and critic.

Training ratio ( $z_t$ 7):

Number of gradient updates per environment step, balancing data reuse and overfitting risk.

For example, in the traffic signal control (TSC) domain, the study found that:

Model size “S” achieves strong stability and data efficiency.
Training ratios $z_t$ 8 in $z_t$ 9, with $h_t$ 0 strongly recommended.
Larger models (M, L) offer only modest gains and require narrower $h_t$ 1 tuning (Li et al., 4 Mar 2025).

Default settings in the DreamerV3 codebase use: Replay capacity $h_t$ 2, batch size $h_t$ 3, sequence length $h_t$ 4, imagination horizon $h_t$ 5, RSSM latent $h_t$ 6 discrete, Adam optimizer, LayerNorm+SiLU activations, no dropout or weight decay. The same configuration solves over 150 tasks without adjustment (Hafner et al., 2023).

Parameter	XS	S	M	L
Viable $h_t$ 7 values	64,128,512	64–512	128,256	128 only
Time to stabilize	~3h	~2.5h	~2.2h	~2.0h
Best $h_t$ 8	128	128	128	128

5. Applications and Empirical Performance

DreamerV3 has demonstrated state-of-the-art results in diverse tasks and domains:

Traffic Signal Control:

DreamerV3 trains a corridor TSC model in SUMO using queue lengths and signal phases as state, piecewise penalties for congestion, and discrete actions for split changes. Peak queue reductions from $h_t$ 9 vehicles to $a_{t-1}$ 0 were observed. Sample efficiency is confirmed, particularly with medium-size models and intermediate training ratios (Li et al., 4 Mar 2025).

Pixel-based RL (e.g., Minecraft, DeepMind Control, Atari):

DreamerV3 achieved first diamond collection from scratch in Minecraft, outperforming expert-dependent competitors, attained new SOTA on DMC vision tasks and Crafter (Hafner et al., 2023).

Stability and Generality:

The algorithm shows stable convergence curves post-initial exploration, fastest stabilization for larger models, and strong data-efficiency when compared with pure model-free baselines in control/multi-agent domains.

6. Comparative and Theoretical Analysis

DreamerV3 is distinguished from previous Dreamer variants by its robust, single-configuration training across domains, categorical/discrete value heads, and improved normalization/balancing. Notably, in the TSC study, claims of accelerated convergence via increased training ratio $a_{t-1}$ 1—which hold in other environments—did not materialize; excessively high or low $a_{t-1}$ 2 instead introduced instability (Li et al., 4 Mar 2025). The findings suggest that, for structured control domains, medium model/ratio choices are optimal and generalize across scenario changes.

A plausible implication is that the practical sample efficiency of DreamerV3 is problem-dependent, with configuration sweet-spots dictated by the complexity and smoothness of the task environment.

7. Significance and Future Considerations

DreamerV3 exemplifies a new class of world model-based RL agents capable of generalizing over domain boundaries without manual reconfiguration. The capacity to learn effective policy with far fewer real-environment interactions—enabled by RSSM-based latent imagination—makes it suitable for large-scale and real-time applications where sample efficiency is paramount.

Current limitations include pronounced early-episode reward fluctuations, narrow viable hyperparameter ranges for large models, and problem-dependent data-efficiency characteristics. Further work may investigate domain-adaptive scheduling for the training ratio, improved latent representation learning under distractors, and formal guarantees for convergence times across classes of environments (Li et al., 4 Mar 2025, Hafner et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Mastering Diverse Domains through World Models (2023)

DreamerV3 for Traffic Signal Control: Hyperparameter Tuning and Performance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamerV3 Algorithm.