WKR: 4D Gaussian Splatting for Dynamic Scenes

Updated 25 January 2026

WKR is a comprehensive repository that aggregates advanced techniques like EVolSplat4D, enabling 4D Gaussian splatting for dynamic scene modeling in space and time.
It employs a hybrid network architecture combining static, dynamic actor, and far-field branches to achieve real-time rendering and consistent scene reconstruction.
The framework integrates neural ODE-driven temporal dynamics and optimized rendering pipelines, delivering practical benefits for urban simulation and autonomous driving applications.

EVolSplat4D is a 4D Gaussian Splatting framework that models dynamic scenes in space and time, supporting efficient, high-fidelity, and photorealistic novel view synthesis for complex static and dynamic environments. It integrates hybrid geometric representations, advanced feed-forward inference, and temporally continuous modeling to achieve state-of-the-art scene consistency, real-time rendering, and robust dynamic actor synthesis, particularly for applications in urban simulation and autonomous driving (Miao et al., 22 Jan 2026). EVolSplat4D draws from a lineage of dynamic splatting methodologies, extending explicit spatiotemporal Gaussian volumetrics (Yang et al., 2024, Yang et al., 2023) with recent advances in learned dynamical systems (Asiimwe et al., 22 Dec 2025) and robust feed-forward architectures.

1. Formal Mathematical Framework

EVolSplat4D represents a dynamic scene as a collection of Gaussian primitives parameterized over both spatial and temporal dimensions. Each primitive $g_i$ encodes:

Mean $\mu_i \in \mathbb{R}^3$ (center position)
Covariance $\Sigma_i \in \mathbb{R}^{3 \times 3}$ (anisotropy via scale $s_i$ and rotation $q_i$ )
Opacity $\alpha_i \in \mathbb{R}$ (controls blending)
View- and time-dependent color $c_i$ (expanded by spherical harmonics)

The per-primitive radiance field contribution is: $G_i(x) = \alpha_i \exp\left(-\frac{1}{2}(x-\mu_i)^T \Sigma_i^{-1} (x-\mu_i)\right)$ Rendering aggregates sorted splats via f–alpha compositing: $C(\text{ray}) = \sum_{i \in \text{sorted}} c_i \alpha_i \prod_{j < i} (1 - \alpha_j)$ Dynamic scenes (space + time) use hybrid representations: volume-based close-range 3D Gaussians, canonical object-centric dynamic actors, and efficient far-field pixel-wise Gaussians. For learned dynamical extensions, state vectors $x_i(t)$ (including position, covariance, color, opacity) evolve according to a neural ODE: $\mu_i \in \mathbb{R}^3$ 0 Integration via Dormand–Prince or RK4 enables sample-efficient motion law learning, temporal extrapolation, and localized controllable dynamics (Asiimwe et al., 22 Dec 2025).

2. Hybrid Network Architecture

The EVolSplat4D framework employs a tripartite branching architecture for comprehensive urban scene coverage:

Volume-based Static Region Branch: Dense multi-view RGB and depth inputs are elevated to a 3D feature volume (sparse U-Net). From the semantic point cloud, 3D Gaussians are decoded per spatial location, using trilinear feature queries and MLPs for positional refinement, appearance estimation (occlusion-aware IBR), and covariance prediction. Appearance MLPs are conditioned via DINO-based semantic similarity weights.
Object-Centric Dynamic Actor Branch: Tracks temporal motion of actors via canonical point clouds, transformed through tracked 3D bounding box poses. Motion-adjusted IBR projects each actor's geometry into temporally adjacent frames, aggregating features to reconstruct stable 4D Gaussians despite noisy or partial tracking data.
Per-Pixel Far-Field Branch: For distant background, a cross-view-attentive 2D U-Net predicts Gaussian parameters per pixel, integrating Plücker ray embeddings for geometric consistency. Outputs are merged to fill the far-field and preclude holes in out-of-domain settings.

Algorithmic pipeline proceeds from multi-modal data extraction to semantic pruning, branch-wise Gaussian decoding, and aggregated differentiable splatting rendering (Miao et al., 22 Jan 2026).

3. Training Losses and Optimization Procedures

EVolSplat4D is supervised primarily by photometric fidelity to ground-truth images under multi-view, multiframe contexts: $\mu_i \in \mathbb{R}^3$ 1 A mask decomposition loss $\mu_i \in \mathbb{R}^3$ 2 aligns the rendered occupancy map with geometric projections, enforcing branch separation and reducing ghosting artifacts: $\mu_i \in \mathbb{R}^3$ 3 End-to-end optimization uses Adam (lr = $\mu_i \in \mathbb{R}^3$ 4), random frame and pixel sampling, and spherical harmonics degree $\mu_i \in \mathbb{R}^3$ 5. For temporally continuous extensions (EvoGS), additional terms encourage smooth state evolution: $\mu_i \in \mathbb{R}^3$ 6

$\mu_i \in \mathbb{R}^3$ 7

Optional fine-tuning via standard 3DGS optimization (point pruning/growing) can yield several dB PSNR improvements after $\mu_i \in \mathbb{R}^3$ 81000 steps (Miao et al., 22 Jan 2026).

4. Rendering Pipeline and Temporal Dynamics

Rendering merges close-range statics, dynamic actors, and far-field background Gaussians using tile-based, front-to-back rasterization. Appearance is assigned via view-dependent spherical harmonics, with occlusion- and semantic-aware blending. Gaussian attributes for dynamic actors are modulated via temporally interpolated features and motion priors to maintain coherence across motion sequences.

For continuous-time dynamical integration (Asiimwe et al., 22 Dec 2025), forward and backward temporal extrapolation is enabled by numerically integrating the neural ODE beyond training intervals, predicting plausible future or past states. Localized dynamics injection allows user control by blending external velocity fields with the learned field $\mu_i \in \mathbb{R}^3$ 9 using spatial-temporal masks, facilitating compositional animation (e.g., selectively moving vehicles).

5. Experimental Results and Quantitative Evaluations

Performance evaluation spans urban NVS, complex dynamic scenes, and scene editing:

Setting	PSNR (dB)	SSIM	LPIPS	Speed (FPS)	Notes
KITTI-360 (static, FF)	23.36	0.798	0.177	>80	Outperforms MVSNeRF
Waymo (static, OutDomain)	24.43	—	—	—	Robust to domain gap
Dynamic (KITTI/Waymo)	~20.76–26.32	—	—	—	Surpasses STORM, DrivingRecon
Fine-tuned (1000 steps)	28.29	—	—	~1.3s/scene	Outperforms SUDS, EmerNeRF
Extrapolation (KID $\Sigma_i \in \mathbb{R}^{3 \times 3}$ 0)	0.062–0.080	—	—	—	Best among baselines
Real-time efficiency	—	—	—	>80	11GB GPU, batch render

Ablation studies demonstrate the necessity of volume, pixel, and motion-adjusted branches: removal yields significant PSNR drops, increased ghosting, or geometric artifacts. Window size $\Sigma_i \in \mathbb{R}^{3 \times 3}$ 1 for occlusion-aware IBR is optimal, and actor box prediction matches GT boxes with high fidelity (Miao et al., 22 Jan 2026).

6. Context, Limitations, and Prospective Directions

EVolSplat4D addresses the bottleneck of per-scene optimization by enabling real-time, feed-forward inference that decouples geometry and appearance over spatial and temporal scales. It offers compositional decomposition for scene editing (replace/shift/delete actors) and full environment coverage via its hybrid branches.

Limitations include the reliance on LiDAR-based actor tracking (although architecture supports future monocular detectors), degraded far-field geometry under large motion extrapolation due to sparse depth priors, and blurring of non-rigid actors (e.g., pedestrians) arising from rigid-motion modeling. Prospective extensions involve integrating non-rigid dynamic fields, deeper uncertainty modeling, and richer temporal conditioning mechanisms, as suggested by VDEGaussian’s integration of video diffusion models and uncertainty-weighted alignment (Xiao et al., 4 Aug 2025).

In summary, EVolSplat4D unifies real-time 4D Gaussian splatting for dynamic urban scenes by advancing geometric, appearance, and temporal modeling, validated through rigorous quantitative and qualitative evaluations across large-scale autonomous driving datasets and beyond (Miao et al., 22 Jan 2026, Asiimwe et al., 22 Dec 2025, Yang et al., 2024, Yang et al., 2023, Xiao et al., 4 Aug 2025).