EVolSplat4D: 4D Gaussian Splatting for Urban Scenes

Updated 25 January 2026

The paper presents a hybrid architecture that factors urban scenes into static, dynamic, and far-field branches for photorealistic novel view synthesis.
The approach unifies volume-based and per-pixel Gaussian predictions with occlusion-aware image-based rendering and motion-adjusted techniques.
Experimental results on KITTI and Waymo datasets show improved PSNR, SSIM, and real-time performance while reducing ghosting artifacts.

EVolSplat4D is an advanced framework for modeling, reconstructing, and rendering dynamic urban scenes using 4D Gaussian splatting. It unifies volume-based and per-pixel Gaussian predictions, providing efficient, photorealistic novel view synthesis for both static and dynamic environments. EVolSplat4D expands on foundational 4D Gaussian splatting methodologies by incorporating hybrid architectural designs enabling real-time, consistent 4D scene understanding and synthesis in complex driving scenarios (Miao et al., 22 Jan 2026). The system factors static, dynamic, and far-field scene components into distinct yet compositional branches.

1. Problem Setting and Representation

The core problem addressed by EVolSplat4D is novel view synthesis (NVS) for urban environments under static and dynamic regimes. The task is to generate a photorealistic image $\mathcal{I}'$ at any novel camera pose and time, given a set of calibrated multiview images $\{I_n\}$ , with optional depth or LiDAR priors and tracked actor bounding boxes. The modeling backbone is a 3D Gaussian Splatting radiance field parameterized per primitive as $g = (\mu, \Sigma, c, \alpha)$ , where:

$\mu \in \mathbb{R}^3$ : 3D position,
$\Sigma \in \mathbb{R}^{3\times3}$ : anisotropic covariance,
$c$ : view-dependent color (modeled via spherical harmonics),
$\alpha \in \mathbb{R}$ : opacity.

A primitive's spatial contribution at location $x$ is

$G_i(x) = \alpha_i \exp\left[-\frac{1}{2}(x-\mu_i)^\top \Sigma_i^{-1}(x-\mu_i)\right]$

and rendering a ray $r$ corresponds to f–alpha compositing all projected splats in front-to-back order:

$C(r) = \sum_{i \in \text{sorted}} c_i \alpha_i \prod_{j<i} (1-\alpha_j)$

(Miao et al., 22 Jan 2026).

2. Hybrid Network Architecture

EVolSplat4D factorizes the scene into three synergy branches:

a) Volume-Based Static Region Branch:

Multi-view RGB images, depth, and DINO semantic features are aggregated into a unified point cloud $P$ .
Sparse 3D U-Net (TorchSparse) produces feature volume $V$ $V$ , enabling geometry decoding for each point $p_i$ $p_{i}$ :
- $\Delta\mu_i = \tanh(\text{MLP}_\text{pos}(V(p_i))) \cdot \text{voxel\_size}$ ,
- $\mu_i = p_i + \Delta\mu_i$ ,
- $[\alpha_i, s_i, q_i]$ via MLPs for opacity and covariance.
Appearance is synthesized by projecting each $\mu_i$ into $K$ views, sampling local RGB and DINO features, and using softmax-based visibilities for occlusion-aware IBR (Miao et al., 22 Jan 2026).

b) Object-Centric Dynamic Actor Branch:

Utilizes tracked actors’ canonical point clouds and pose trajectories.
Per-actor point sets are transformed framewise using 3D bounding box priors, with features aggregated temporally across neighboring frames.
Motion-adjusted IBR facilitates robust per-object rendering under motion, using a shared dynamic MLP.

c) Per-Pixel Far-Field Branch:

Applies a cross-view attention 2D U-Net to per-view 10D encodings, outputting per-pixel Gaussians for static/far, distant background.
Plücker coordinates and learned depth embed far-field rays, then merged via minimal occlusion logic.

Branches are merged at the rendering stage to yield consistent, full-scene coverage (Miao et al., 22 Jan 2026).

3. Rendering and Compositing

Rendering proceeds by aggregating Gaussians from close-range (static), dynamic actors, and far-field. After tile-based differentiable splatting,

$C^{\text{cr+dyn}} = \sum_{i \in G^{\text{cr}} \cup G^{\text{dyn}}} c_i \alpha_i \prod_{j<i}(1-\alpha_j),$

the output image is composed as

$C_{\text{full}} = C^{\text{cr+dyn}} + (1 - O^{\text{cr+dyn}}) \cdot C_{\text{far}},$

with $O^{\text{cr+dyn}}$ as cumulative opacity (Miao et al., 22 Jan 2026). This architecture enables differentiable, real-time splatting and accurate visibility resolution.

4. Learning Objectives and Optimization

EVolSplat4D is trained via a composite loss:

Photometric loss (combining L1 and SSIM):

$L_{\text{rgb}} = (1-\lambda_{\text{SSIM}}) \|C-\hat{C}\|_1 + \lambda_{\text{SSIM}} (1-\text{SSIM}(C, \hat{C})),\quad \lambda_{\text{SSIM}}=0.2$

Decomposition loss ensuring mask agreement for close-range volumes:

$L_{\text{mask}} = \|O^{\text{cr}} - \hat{O}\|_1$

Regularization: weight on $L_{\text{mask}}$ is $\lambda_m=0.1$ ; SH degree for appearance is 1.

Optimization uses Adam with $\text{lr}=10^{-3}$ and single-image random sampling per iteration. Optional fine-tuning via standard 3DGS prune/grow for 1,000 steps provides additional quality gains (Miao et al., 22 Jan 2026).

5. Experimental Results and Ablation Studies

Comprehensive evaluations on KITTI-360, KITTI, Waymo, and PandaSet datasets confirm EVolSplat4D’s superiority:

Static scenes (KITTI-360): Outperforms MVSNeRF, MuRF, EDUS, PixelSplat, MVSplat, DepthSplat, AnySplat (23.36 dB PSNR, SSIM=0.798, LPIPS=0.177); and 24.43 dB on Waymo.
Dynamic scenes: Achieves 26.32 dB (Waymo), 26.27 dB (PandaSet), exceeding DrivingRecon, STORM, AnySplat (Miao et al., 22 Jan 2026).
Extrapolation: Under 1 m lane shift, Fréchet Inception Distance (KID) is 0.062–0.080, lower than STORM (0.080–0.102).
Inference and rendering: Feed-forward inference in ~1.3 s; real-time rendering >80 FPS at ∼11 GB memory.

Ablations demonstrate:

Removing the volume branch: –0.76 dB PSNR and increased ghosting.
No motion-adjusted IBR: –1 dB PSNR on dynamic actors.
Disabling occlusion-awareness: –0.45 dB, increased blur.
Window size $W=3$ provides best tradeoff for IBR (Miao et al., 22 Jan 2026).

6. Architectural Advantages and Limitations

Advantages:

Volume branch ensures globally consistent geometry for close-range static regions using multi-view data.
Dynamic actor branch enables robust 4D tracking and reconstruction under noisy priors, crucial for articulated or partially observed actors.
Far-field branch economizes memory and computation on distant scenery.
Full hybrid compositionality allows for efficient, feed-forward and real-time operation, with per-component editing (replace/shift/delete actors).
Modular design supports scene decomposition and semantic editing.

Limitations:

Dependency on LiDAR for 3D bounding boxes, though the system is architecturally decoupled from the detection pipeline and can accept future monocular detectors.
Far-field geometry may degrade for very large baseline extrapolations due to lack of explicit geometric priors.
Rigid-motion assumption in the dynamic branch leads to blurring artifacts for highly non-rigid actors (e.g., pedestrians), which could be remedied by integrating non-rigid dynamic fields (Miao et al., 22 Jan 2026).

7. Relation to Foundational 4DGS and Future Directions

EVolSplat4D builds directly on the principles of 4D Gaussian Splatting with native 4D primitives (Yang et al., 2024, Yang et al., 2023), and continuous-time dynamical systems (Asiimwe et al., 22 Dec 2025). By factorizing the learning problem and decoupling spatial, temporal, and semantic scene elements, EVolSplat4D addresses prior limitations in efficiency and scalability for urban scene synthesis. Future evolution may incorporate continuous-time neural ODE fields, diffusion priors for rich dynamic content, and non-rigid actor modeling for further gains in fidelity and controllability (Asiimwe et al., 22 Dec 2025, Xiao et al., 4 Aug 2025).

EVolSplat4D thus represents the current state-of-the-art in hybrid feed-forward Gaussian scene modeling, balancing accuracy, efficiency, and extensibility for real-world 4D scene understanding and editing in complex environments (Miao et al., 22 Jan 2026).