AstraNav-World: Unified Probabilistic Navigation Model

Updated 31 December 2025

AstraNav-World is a unified probabilistic world model that integrates diffusion-based video prediction with a vision-language policy for dynamic embodied navigation.
The model simultaneously forecasts future visual states and corresponding actions using synchronized rollouts and mutual conditioning to prevent drift.
Experimental evaluations on benchmarks demonstrate improved trajectory accuracy and robust real-world zero-shot transfer compared to classical baselines.

AstraNav-World is a unified probabilistic world model designed for @@@@1@@@@ in open, dynamic environments. It simultaneously predicts future visual states and corresponding action sequences conditioned on a history of visual observations and natural-language instructions. AstraNav-World integrates a diffusion-based video generator with a vision-language policy, implementing synchronized rollouts in which predicted visual scenes and planned actions are updated jointly. This bidirectional coupling ensures that future visual predictions are executable and that navigation policies remain grounded in consistent, task-relevant physical futures.

1. Joint Model Architecture

AstraNav-World's architecture comprises three tightly integrated modules conditioned on historical frames and language instructions:

Vision-Language Planner (τθ): Built upon Qwen-2.5-VL-3B, this component produces context embeddings $C \in \mathbb{R}^{L \times D}$ reflecting goal semantics and spatial context from the input instruction $I$ and observation history $O_{\text{hist}}$ .
Video Generator (υθ): Utilizes the Wan-2.2-TI2V-5B conditional diffusion backbone with an ST-VAE encoder for 16× spatial and 4× temporal compression. Context embeddings are injected via 30 DiT transformer layers, enabling the prediction of future video latents $z_{i+1},\ldots,z_{i+N}$ .
Action Policy Head:
- Variant A: “Action Former” employs transformer-based architecture for trajectory generation.
- Variant B: Diffusion-based policy leverages Multimodal Fusion Cross-Attention (MMFCA) to mutually condition video and action streams.

The key design principle is the synchronized rollout: during both training and inference, embeddings from τθ guide both video and action generation, with MMFCA ensuring mutual conditioning at each generative step. This prohibits drift and decoupling between visual and behavioral foresight.

2. Mathematical Formulation and Loss Functions

The joint generative factorization is

$p(z_{i+1:i+N}, A_{i+1:i+N} \mid C) = p_v(z \mid C) \cdot p_p(A \mid z, C)$

where $p_v$ is realized via conditional diffusion and $p_p$ via either transformer or diffusion policy head.

Video Generator Loss:
- Forward diffusion on video latents:
$z_t = (1-t) \cdot z^{\text{future}} + t \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$ - Denoising regression:

$\mathcal{L}_{\text{VG}} = \mathbb{E}_{t, z^{\text{future}}, \epsilon, C} \left[ \left\| v_{\theta}(z_t, t, C) - (\epsilon - z^{\text{future}}) \right\|^2 \right]$
Action Former Loss (for predicted actions $A$ $A$ ):
- Position:
$\mathcal{L}_{\text{pos}} = \frac{1}{N} \sum_n (|X_n - X_n^*| + |Y_n - Y_n^*|)$ - Angle:

$\mathcal{L}_{\text{angle}} = 1 - \frac{1}{N} \sum_n \left[ \cos\theta_n \cdot \cos\theta_n^* + \sin\theta_n \cdot \sin\theta_n^* \right]$ - Arrival flag:

$\mathcal{L}_{\text{arrive}} = -\frac{1}{N} \sum_n \left[ \alpha_n^* \log\sigma(\alpha_n) + (1-\alpha_n^*) \log(1-\sigma(\alpha_n)) \right]$ - Combined:

$\mathcal{L}_{\text{PH}} = \lambda_1 \mathcal{L}_{\text{pos}} + \lambda_2 \mathcal{L}_{\text{angle}} + \lambda_3 \mathcal{L}_{\text{arrive}}$

with $\lambda_1 = \lambda_2 = \lambda_3 = 1.0$ .
Diffusion Policy Loss:

$A_t = (1-t)A^{\text{future}} + t\epsilon$

$\mathcal{L}_{\text{PH}}^{\text{(diff)}} = \mathbb{E}_{t, A^{\text{future}}, \epsilon, C} \left[ \left\| v_{\varphi,\theta}(A_t, t, C) - (\epsilon - A^{\text{future}}) \right\|^2 \right]$

Unified Objective:

$\mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{VG}} + \lambda \cdot \mathcal{L}_{\text{PH}}, \quad \lambda=1.0$

3. Training Paradigm and Inference Protocols

AstraNav-World training proceeds in two stages:

Component-Specific Pretraining:

τθ is frozen while υθ (video) and policy head are pre-trained independently on sampled batches of instructions, observations, ground-truth future frames, and action sequences.

Joint Fine-Tuning with MMFCA:

All parameters are unfrozen. MMFCA (probability $p=0.5$ for diffusion policy) enables fusion between video latents and actions, and the entire model is fine-tuned jointly using the unified objective.

At inference, “Sparse Foresight Scheduling” (SFS) is employed:

Every $K$ steps (default $K=10$ ), both video and action streams are run with MMFCA for multi-step prediction.
Intermediate steps use only the policy head for efficiency.
The Action Former variant omits the video generator during inference.

The closed-loop process (at each synchronized step) is:

Compute context $C$ from τθ.
Generate future video latents $z_{i+1:i+N}$ .
Generate actions $A_{i+1:i+N}$ conditioned on video latents.
Execute the first action, shift observation window, and repeat.

4. Experimental Evaluation and Performance

AstraNav-World was evaluated across the R2R-CE, RxR-CE, and HM3D-OVON navigation benchmarks. Performance metrics include Success Rate (SR), Oracle Success (OS), Success weighted by Path Length (SPL), and Navigation Error (NE). SOTA and classical baselines were tested for comparison.

Method	NE↓	OS↑	SR↑	SPL↑	(RxR) NE↓	SR↑	SPL↑
CorrectNav	4.24	67.5	65.1	62.3	4.09	69.3	63.3
AstraNav-World (Action Former)	3.93	73.1	67.2	64.2	3.93	70.4	59.6
AstraNav-World (Diffusion)	3.86	73.9	67.9	65.4	3.82	72.9	61.5

Object-goal navigation on HM3D-OVON: | Method | SR↑ | SPL↑ | |----------------------------------|-------|-------| | MTU3D | 40.8 | 12.1 | | AstraNav-World (Action Former) | 45.1 | 28.3 | | AstraNav-World (Diffusion) | 45.7 | 28.7 |

Key ablation results:

Disabling the video generator (υθ) yields a 3–5% drop in SR, confirming the necessity of vision-action coupling.
SFS with $K=50$ achieves up to 6.7× speedup with less than 1% SR loss.

Visual-Trajectory Consistency:

R2R 5-step: PSNR = 13.69, FVD = 670
R2R 1-step: PSNR = 15.54
RxR 5-step: PSNR = 14.50, FVD = 497
RxR 1-step: PSNR = 18.55

5. Real-World Zero-Shot Transfer

Testing on a physical mobile robot (RGB-D camera, onboard compute) for both instruction-goal and object-goal tasks in novel office and home scenes demonstrated robust real-world generalization without fine-tuning. AstraNav-World achieved ≈60–70% SR, outperforming previous sim-to-real methods by >10% SR. The domain gap is mitigated by the generative model’s ability to learn simulacra of physical dynamics via diffusion on video latents rather than overfitting to simulator-specific texture statistics.

A plausible implication is that AstraNav-World captures transferable spatial and physical consistency, supporting interpretable navigation in environments with unseen characteristics.

6. Limitations and Prospective Directions

Identified constraints include:

Generation latency: Diffusion video and policy inference pose challenges for strict real-time loops, even with SFS.
Generalization to complex outdoor scenes with unmodeled dynamic obstacles remains limited.
Failure cases are prevalent in heavily cluttered environments and subject to short-horizon diffusion drift.

Future research priorities include:

Accelerated sampling via model distillation or reduced diffusion steps,
Stronger physics priors for improved collision and dynamic obstacle modeling,
Hierarchical long-horizon foresight scheduling,
Integration of closed-loop visual feedback within latent rollouts.

7. Supplementary Details and Model Scaling

Hyperparameters: Learning rate $1 \times 10^{-5}$ (cosine decay), LoRA rank=128 for video generator, full fine-tuning for τθ.
Model Sizes: Qwen2.5-VL-3B (3B params, τθ), Wan2.2-TI2V-5B (5B params, υθ), policy heads: ~50M (Action Former), ~200M (Diffusion).
Compute: 96× NVIDIA H20 GPUs, component pretraining ≈60 hours, joint fine-tuning ≈120 hours.
Data: R2R, RxR (Matterport3D houses), OVON (HM3D scenes, shortest-path sampling), panoramic triples (left/front/right), discrete and continuous waypoints, multi-modal instructions.
Pretraining: τθ trained on large VL datasets and VLN episodes; video generator initialized from web video Wan checkpoints.

By tightly linking foresight vision (diffusion-based video prediction) and action planning (vision-language policy) in a bidirectional generative framework, AstraNav-World demonstrates high trajectory accuracy, consistent navigation, and robust zero-shot transfer, representing a significant advancement in foresight control for dynamic embodied navigation tasks (Hu et al., 25 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AstraNav-World: World Model for Foresight Control and Consistency (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AstraNav-World.