DrivingForward 3D Scene Reconstruction
- DrivingForward is a feed-forward, self-supervised model that reconstructs 3D driving scenes from sparse, multi-camera data using 3D Gaussian splatting.
- It jointly trains Pose, Depth, and Gaussian networks with a differentiable renderer to achieve fast, real-time inference and metric consistency.
- Its design mitigates challenges of limited view overlap and uncertain camera extrinsics, enabling robust automotive scene reconstruction.
DrivingForward is a feed-forward model designed for real-time driving scene reconstruction using 3D Gaussian splatting from flexible surround-view camera input. It addresses the challenges inherent to vehicle-mounted camera imagery, which is typically sparse with limited view overlap, and where vehicle motion makes reliable camera extrinsics difficult to obtain. DrivingForward reconstructs 3D scenes by jointly training three neural networks—Pose, Depth, and Gaussian—along with a differentiable 3D Gaussian-splatting renderer, all operating in a self-supervised regime without ground-truth depth or extrinsics, and is capable of fast, feed-forward inference from arbitrary subsets of multi-frame, multi-camera data (Tian et al., 2024).
1. Model Architecture
DrivingForward comprises three main components—Pose Network (P), Depth Network (D), and Gaussian Network (G)—co-trained with a differentiable renderer.
- Pose Network (P): Receives pairs of images from temporal (same camera, different times), spatial (different cameras, same time), or spatio-temporal contexts. Outputs a relative transformation estimated using a ResNet-18 backbone encoder followed by a two-layer MLP producing a 6D vector (axis-angle rotation + translation). Photometric reprojection losses supervise this transformation.
- Depth Network (D): Consumes a single image and outputs a dense depth map , along with intermediate latent features . Its backbone consists of ResNet-18, a volumetric feature-fusion module, and a U-Net decoder supporting multi-scale depth prediction.
- Gaussian Network (G): For each input pixel, predicts the parameters of a Gaussian primitive .
- The center is formed by unprojecting the depth : , where , all in vehicle coordinates.
- Covariance with (axis-scales), (unit quaternion).
- Opacity and color coefficients in a spherical-harmonics basis.
- The network has a depth encoder (U-Net on ), a fusion decoder (merging and ), and four output "heads" for (softplus), ( normalization), (sigmoid), and (linear).
2. 3D Gaussian Splatting Formulation
Splatting is the core rendering mechanism whereby 3D Gaussian primitives, predicted from input images, are rendered into novel target views:
- Projection: Each primitive's mean and covariance (in vehicle coordinates) are projected into 2D using the target camera intrinsics and extrinsics. The image-plane Gaussian parameters are:
- , with
- Per-pixel Splatting Weight:
Each primitive contributes weighted color to pixels in the target view.
- Compositing: Primitives are sorted by increasing depth (). At each pixel :
Final color is accumulated via:
All splatting and compositing are differentiable on GPU, following the 3DGS paradigm.
3. Self-supervision and Loss Functions
Training is fully self-supervised using only RGB surround-camera images (nuScenes dataset, no LiDAR/extrinsics). The overall loss is:
- Pose/Depth self-supervision ():
- Reprojection loss:
- Targeted reprojection:
- Multiple contexts: temporal (), spatial (), and spatial-temporal (). - Depth smoothness:
- Loss weights: .
Rendering supervision ():
- .
- Regularization: Enforced using quaternion normalization (for ), softplus/sigmoid activations (for ), and depth smoothness.
4. Inference Workflow
Inference is strictly feed-forward, requiring only the Depth and Gaussian networks (and the differentiable renderer). The sequence proceeds as follows for a set of images :
- For each input image :
- Depth network infers and .
- Depth unprojection yields for each pixel .
- Gaussian network predicts per-pixel , , .
- Pools all pixel-Gaussians into one set .
- Renderer produces the novel view via splatting/compositing.
Feed-forward operation ensures outputs remain metrically consistent for arbitrary frame/camera combinations, without per-scene optimization. Inference on six surround cameras executes in approximately 0.3–0.6 seconds on a single A6000 GPU.
Below is a summary table of inference flow:
| Step | Operation | Output |
|---|---|---|
| Per image | DNet: | Depth map, latent feature |
| Per pixel | Unproject | 3D primitive center |
| Per pixel | GNet: fusion, ouputs | , , |
| All images/pixels | Aggregate | Set of all Gaussian primitives |
| All primitives | Render view | Splatting-composited RGB image |
5. Parameterizations and Differentiable Rendering
DrivingForward assigns a 3D Gaussian primitive to every input pixel:
- Center (): By depth unprojection in vehicle coordinates.
- Covariance (): Parameterized by an axis-aligned scale and rotation (as quaternion, converted via ).
- Opacity (): Scaled to [0,1] by sigmoid.
- Color coefficients (): Spherical harmonics basis up to degree , for view-dependent effects.
The differentiable renderer implements:
- 3D→2D Gaussian projection (using and its Jacobian),
- Per-pixel Gaussian evaluation and weighted summation across sorted depths,
- Composite image assembly.
This enables fully end-to-end differentiable optimization during training, as well as fast, optimization-free inference.
6. Application Domain and Significance
DrivingForward is tailored to the automotive context, leveraging sparse, surround-view camera configurations on moving vehicles. Real-world data, such as the nuScenes dataset, is used for training and evaluation—no LiDAR or ground-truth extrinsics are required.
The architecture robustly handles the low overlap and extrinsic uncertainty typical of automotive multi-camera systems. Comparative experiments demonstrate superior scene reconstruction quality over existing feed-forward and scene-optimized methods, particularly in real-scale depth recovery and real-time novel-view synthesis (Tian et al., 2024). The model’s flexibility to accept arbitrary combinations of frames and views, while maintaining metric consistency and operational speed (<1 s), is notable for practical deployment.
7. Implementation Notes and Inference Pseudocode
The inference process is purely feed-forward, with no test-time optimization. The following pseudocode summarizes the steps:
1 2 3 4 5 6 7 8 9 10 11 12 |
function INFERENCE({I_i}):
G_list ← []
for each image I_i in input_set:
D_i, F_img ← DEPTH_NETWORK(I_i)
μ_i ← UNPROJECT(D_i, K_i, E_i)
s_i,r_i,α_i,c_i ← GAUSSIAN_NETWORK(D_i, F_img)
Σ_i ← quat2rot(r_i)·diag(s_i)·quat2rot(r_i)^T
append G_list with (μ_i, Σ_i, α_i, c_i)
end for
I_render ← GAUSSIAN_SPLAT_RENDER(G_list, K_tgt, E_tgt)
return I_render |
Block diagram representation:
1 2 3 4 5 6 7 8 9 10 11 12 |
Iᵢ ──► D ──► {Dᵢ, F_{image}ᵢ}
│
└─► Unproject(Dᵢ) ⇒ μᵢ
│
▼
G(μᵢ, Dᵢ, F_{image}ᵢ) ⇒ {Σᵢ, αᵢ, cᵢ}
│
▼
Aggregate all gₖ={μₖ,Σₖ,αₖ,cₖ}
│
▼
Differentiable Renderer ⇒ I_{render} |
All architecture, loss, and inference pipeline details are fully specified in the original source (Tian et al., 2024).