DrivingForward 3D Scene Reconstruction

Updated 7 January 2026

DrivingForward is a feed-forward, self-supervised model that reconstructs 3D driving scenes from sparse, multi-camera data using 3D Gaussian splatting.
It jointly trains Pose, Depth, and Gaussian networks with a differentiable renderer to achieve fast, real-time inference and metric consistency.
Its design mitigates challenges of limited view overlap and uncertain camera extrinsics, enabling robust automotive scene reconstruction.

DrivingForward is a feed-forward model designed for real-time driving scene reconstruction using 3D Gaussian splatting from flexible surround-view camera input. It addresses the challenges inherent to vehicle-mounted camera imagery, which is typically sparse with limited view overlap, and where vehicle motion makes reliable camera extrinsics difficult to obtain. DrivingForward reconstructs 3D scenes by jointly training three neural networks—Pose, Depth, and Gaussian—along with a differentiable 3D Gaussian-splatting renderer, all operating in a self-supervised regime without ground-truth depth or extrinsics, and is capable of fast, feed-forward inference from arbitrary subsets of multi-frame, multi-camera data (Tian et al., 2024).

1. Model Architecture

DrivingForward comprises three main components—Pose Network (P), Depth Network (D), and Gaussian Network (G)—co-trained with a differentiable renderer.

Pose Network (P): Receives pairs of images from temporal (same camera, different times), spatial (different cameras, same time), or spatio-temporal contexts. Outputs a relative $3 \times 4$ transformation $T_{i^t \rightarrow t'} \in SE(3)$ estimated using a ResNet-18 backbone encoder followed by a two-layer MLP producing a 6D vector (axis-angle rotation + translation). Photometric reprojection losses supervise this transformation.
Depth Network (D): Consumes a single image and outputs a dense depth map $D_i^t$ , along with intermediate latent features $F_{image}$ . Its backbone consists of ResNet-18, a volumetric feature-fusion module, and a U-Net decoder supporting multi-scale depth prediction.
Gaussian Network (G): For each input pixel, predicts the parameters of a Gaussian primitive $g_k = \{\mu_k, \Sigma_k, \alpha_k, c_k\}$ $g_{k} = {μ_{k}, Σ_{k}, α_{k}, c_{k}}$ .
- The center $\mu_k$ is formed by unprojecting the depth $D_i(u)$ : $\mu_i(u) = \Pi^{-1}(u, D_i(u))$ , where $\Pi^{-1} = K_i^{-1} u D_i(u)$ , all in vehicle coordinates.
- Covariance $\Sigma_k = R(r_k) \mathrm{diag}(s_k) R(r_k)^\top$ with $s_k \in \mathbb{R}_+^3$ (axis-scales), $r_k \in \mathbb{R}^4$ (unit quaternion).
- Opacity $\alpha_k \in [0, 1]$ and color coefficients $c_k$ in a spherical-harmonics basis.
- The network has a depth encoder (U-Net on $D_i$ ), a fusion decoder (merging $F_{depth}$ and $F_{image}$ ), and four output "heads" for $s_k$ (softplus), $r_k$ ( $\ell_2$ normalization), $\alpha_k$ (sigmoid), and $c_k$ (linear).

2. 3D Gaussian Splatting Formulation

Splatting is the core rendering mechanism whereby 3D Gaussian primitives, predicted from input images, are rendered into novel target views:

Projection: Each primitive's mean $\mu_k$ $μ_{k}$ and covariance $\Sigma_k$ $Σ_{k}$ (in vehicle coordinates) are projected into 2D using the target camera intrinsics and extrinsics. The image-plane Gaussian parameters are:
- $\mu_{\text{2D}, k} = \pi(\mu_k)$
- $\Sigma_{\text{2D}, k} = J_k \Sigma_k J_k^\top$ , with $J_k = \partial \pi / \partial x |_{x = \mu_k}$
Per-pixel Splatting Weight:

$w_k(p) = \alpha_k \exp\left[-\frac{1}{2}(p-\mu_{\text{2D},k})^\top \Sigma_{\text{2D},k}^{-1} (p - \mu_{\text{2D},k})\right]$

Each primitive contributes weighted color to pixels in the target view.
Compositing: Primitives are sorted by increasing depth ( $z_k$ ). At each pixel $p$ :

$T_0(p) = 1, \quad T_k(p) = T_{k-1}(p)[1-w_k(p)], \quad C_k(p) = w_k(p)c_k$

Final color is accumulated via:

$C(p) = \sum_{k=1}^K C_k(p)$

All splatting and compositing are differentiable on GPU, following the 3DGS paradigm.

3. Self-supervision and Loss Functions

Training is fully self-supervised using only RGB surround-camera images (nuScenes dataset, no LiDAR/extrinsics). The overall loss is:

$\mathcal{L}_{total} = \mathcal{L}_{loc} + \lambda_{render}\mathcal{L}_{render}$

Pose/Depth self-supervision ( $\mathcal{L}_{loc}$ ):
- Reprojection loss:
$\mathcal{L}_{\text{reproj}} = \eta \frac{1-\mathrm{SSIM}(I_{trg}, \hat{I}_{trg})}{2} + (1-\eta) \|I_{trg} - \hat{I}_{trg} \|_1, \ \eta = 0.15$ - Targeted reprojection:

$\hat{I}_{trg}(u) = I_{src}[K_{src}T^{trg\rightarrow src} D_{trg}(u)K_{trg}^{-1}u]$ - Multiple contexts: temporal ( $\mathcal{L}_{tm}$ ), spatial ( $\mathcal{L}_{sp}$ ), and spatial-temporal ( $\mathcal{L}_{sp-tm}$ ). - Depth smoothness:

$\mathcal{L}_{smooth} = \sum_u |\partial_u D| e^{-|\partial_u I|}$ - Loss weights: $\lambda_{sp}=0.03, \lambda_{sp-tm}=0.1, \lambda_{smooth}=0.001$ .
Rendering supervision ( $\mathcal{L}_{render}$ ):

$\mathcal{L}_{render} = \beta \|I_{render} - I_{gt}\|_2 + \gamma \text{LPIPS}(I_{render}, I_{gt}), \quad \beta = 1, \gamma = 0.05$
- $\lambda_{render}=0.01$ .
Regularization: Enforced using quaternion normalization (for $r_k$ ), softplus/sigmoid activations (for $s_k,\alpha_k$ ), and depth smoothness.

4. Inference Workflow

Inference is strictly feed-forward, requiring only the Depth and Gaussian networks (and the differentiable renderer). The sequence proceeds as follows for a set of $M$ images $\{I_i\}_{i=1}^M$ :

For each input image $I_i$ $I_{i}$ :
- Depth network infers $D_i$ and $F_{image, i}$ .
- Depth unprojection yields $\mu_i(u) = \Pi^{-1}(u, D_i(u))$ for each pixel $u$ .
- Gaussian network predicts per-pixel $\Sigma_i(u)$ , $\alpha_i(u)$ , $c_i(u)$ .
Pools all pixel-Gaussians into one set $\{g_k\}$ .
Renderer produces the novel view via splatting/compositing.

Feed-forward operation ensures outputs remain metrically consistent for arbitrary frame/camera combinations, without per-scene optimization. Inference on six surround cameras executes in approximately 0.3–0.6 seconds on a single A6000 GPU.

Below is a summary table of inference flow:

Step	Operation	Output
Per image	DNet: $I_i \to D_i, F_{image,i}$	Depth map, latent feature
Per pixel	Unproject $D_i$	3D primitive center $\mu_i$
Per pixel	GNet: fusion, ouputs	$\Sigma_i$ , $\alpha_i$ , $c_i$
All images/pixels	Aggregate $\{g_k\}$	Set of all Gaussian primitives
All primitives	Render view	Splatting-composited RGB image

5. Parameterizations and Differentiable Rendering

DrivingForward assigns a 3D Gaussian primitive to every input pixel:

Center ( $\mu_k$ ): By depth unprojection in vehicle coordinates.
Covariance ( $\Sigma_k$ ): Parameterized by an axis-aligned scale $s_k$ and rotation $r_k$ (as quaternion, converted via $R(r_k)$ ).
Opacity ( $\alpha_k$ ): Scaled to [0,1] by sigmoid.
Color coefficients ( $c_k$ ): Spherical harmonics basis up to degree $\ell$ , for view-dependent effects.

The differentiable renderer implements:

3D→2D Gaussian projection (using $\pi(\cdot)$ and its Jacobian),
Per-pixel Gaussian evaluation and weighted summation across sorted depths,
Composite image assembly.

This enables fully end-to-end differentiable optimization during training, as well as fast, optimization-free inference.

6. Application Domain and Significance

DrivingForward is tailored to the automotive context, leveraging sparse, surround-view camera configurations on moving vehicles. Real-world data, such as the nuScenes dataset, is used for training and evaluation—no LiDAR or ground-truth extrinsics are required.

The architecture robustly handles the low overlap and extrinsic uncertainty typical of automotive multi-camera systems. Comparative experiments demonstrate superior scene reconstruction quality over existing feed-forward and scene-optimized methods, particularly in real-scale depth recovery and real-time novel-view synthesis (Tian et al., 2024). The model’s flexibility to accept arbitrary combinations of frames and views, while maintaining metric consistency and operational speed (<1 s), is notable for practical deployment.

7. Implementation Notes and Inference Pseudocode

The inference process is purely feed-forward, with no test-time optimization. The following pseudocode summarizes the steps:

function INFERENCE({I_i}):
  G_list ← []
  for each image I_i in input_set:
    D_i, F_img ← DEPTH_NETWORK(I_i)
    μ_i      ← UNPROJECT(D_i, K_i, E_i)
    s_i,r_i,α_i,c_i ← GAUSSIAN_NETWORK(D_i, F_img)
    Σ_i      ← quat2rot(r_i)·diag(s_i)·quat2rot(r_i)^T
    append G_list with (μ_i, Σ_i, α_i, c_i)
  end for

  I_render ← GAUSSIAN_SPLAT_RENDER(G_list, K_tgt, E_tgt)
  return I_render

Block diagram representation:

Iᵢ ──► D ──► {Dᵢ, F_{image}ᵢ}
                      │
                      └─► Unproject(Dᵢ) ⇒ μᵢ
                      │
                    ▼
  G(μᵢ, Dᵢ, F_{image}ᵢ) ⇒ {Σᵢ, αᵢ, cᵢ}
                      │
                    ▼
          Aggregate all gₖ={μₖ,Σₖ,αₖ,cₖ}
                      │
                    ▼
            Differentiable Renderer ⇒ I_{render}

All architecture, loss, and inference pipeline details are fully specified in the original source (Tian et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DrivingForward Architecture.