Feed-Forward Novel View Synthesis

Updated 5 February 2026

Feed-Forward Novel View Synthesis is a neural rendering method that predicts novel views in a single pass using sparse reference images and known camera poses.
Recent approaches integrate explicit 3D geometric representations and advanced feature fusion (e.g., 3D Gaussian Splatting) to ensure high view fidelity and geometric consistency.
Generative refinement stages, such as diffusion-based modules, further enhance high-frequency details, enabling near real-time performance for diverse scenes.

Feed-Forward Novel View Synthesis (NVS) is a class of methods in computer vision that predict images of a scene from unseen viewpoints in a single forward pass of a neural model, given a sparse set of reference images with known camera poses. Unlike optimization-based neural rendering approaches—which require per-scene gradient descent to fit scene representations—feed-forward NVS frameworks train a generalizable model that directly maps observed views and camera metadata to the target view, enabling near real-time inference and generalization to novel scenes (Wu et al., 8 Jan 2026, Dong et al., 20 Jan 2026). Recent advances in this field have led to notable improvements in view fidelity, geometric consistency, and efficiency, with architectures increasingly leveraging explicit 3D structure, advanced feature fusion, and generative refinement modules.

1. Foundational Principles and Problem Scope

Feed-forward NVS seeks a single-pass mapping

$\hat{I}^t = M(\{(I_i, K_i, R_i, t_i)\}_{i=1}^N, K^t, R^t, t^t)$

that predicts the image $\hat{I}^t$ of a scene from a novel target viewpoint, conditioned on $N$ context RGB images $I_i$ and their camera intrinsics/extrinsics $(K_i, R_i, t_i)$ . This setting contrasts with scene-optimized neural rendering pipelines (e.g., NeRF, 3D Gaussian Splatting parameterization per scene) which perform per-scene fitting, and instead emphasizes universal models that can regress $\hat{I}^t$ for arbitrary scenes and camera configurations—trading some reconstruction accuracy for drastic speed gains and universal applicability (Wu et al., 8 Jan 2026, Wang et al., 29 May 2025, Min et al., 2024).

Key objectives:

Single-pass inference: Prediction of novel views requires only one network evaluation, no per-scene optimization or per-ray MLP queries (Dong et al., 20 Jan 2026, Chen et al., 2024).
Generalization: Models are trained on diverse datasets to handle unseen scenes, camera poses, and sparsity in observed images (Min et al., 2024, Rauniyar et al., 11 Jan 2025).
3D awareness for geometry consistency: Modern approaches encode strong 3D priors to maintain scene structure and occlusion boundaries (Zhu et al., 27 Sep 2025, Dong et al., 20 Jan 2026, Chen et al., 2024).

2. Geometric Backbones and Feature Encoding

Current feed-forward NVS pipelines incorporate explicit or implicit geometric representations to enhance view consistency:

2.1 3D Gaussian Splatting (3DGS) Backbones

3DGS Pipeline: Inputs are encoded with shared CNNs or ViT encoders and combined across views (via cross-view Transformers or ViTs), producing a global feature volume. Prediction heads estimate per-pixel or per-patch 3D Gaussian parameters $\{\mu_j, \Sigma_j, o_j, c_j, f_j\}$ ${μ_{j}, Σ_{j}, o_{j}, c_{j}, f_{j}}$ where $\mu_j$ $μ_{j}$ is center, $\Sigma_j$ $Σ_{j}$ is projected covariance, $o_j$ $o_{j}$ is opacity, $c_j$ $c_{j}$ is SH color, and $f_j$ $f_{j}$ encodes high-frequency detail (Dong et al., 20 Jan 2026, Chen et al., 2024).
- Rendering employs rasterization/splatting: for pixel $u$ , rendered value is a weighted sum of projected Gaussians with kernel density $\mathcal{N}(u; u_j, \Sigma_j)$ (Dong et al., 20 Jan 2026).
Depth & Alignment: Some models (eFreeSplat) forego epipolar constraints, instead using ViTs for dense cross-view matching and iterative alignment methods to ensure scale-consistent depths across all input views (Min et al., 2024). Others (MVSplat360) utilize plane sweep volumes and cost volumes to inform depth, allowing prediction even under very sparse input coverage (Chen et al., 2024).
Information Bottlenecks: To scale to many input views, ZPressor inserts a cross-attention compression module between image encoding and Gaussian prediction, reducing redundancy in multi-view features and preventing memory/computation explosion (Wang et al., 29 May 2025).

2.2 Alternative Geometric and Feature Representations

Depth-guided skip connections: Encoder-decoder architectures extract and warp multi-scale features from the source image to the target via predicted depth, enabling correct alignment and sharp structures without explicit 3D representations (Hou et al., 2021).
Plane-Sweep Volume (PSV) and Layered Approaches: Methods like fMPI build explicit PSVs by homographic reprojection and process these in grouped or super-sampled slabs for efficiency. Downstream networks then produce MPI or MLI representations rendered via over-operation and differentiable rasterization (Kohler et al., 2023, Solovev et al., 2022).
Projective Conditioning: Instead of ray-based or absolute camera encodings, projective cues are built by projecting reference views into the target image, stabilizing the input space against gauge ambiguities and facilitating self-supervised, masked-autoencoding pretraining (Wu et al., 8 Jan 2026).

Despite explicit 3D reasoning, feed-forward Gaussians or MPIs alone often result in smoothed artifacts or loss of high-frequency texture, especially for sparsely observed or disoccluded regions. Contemporary pipelines employ additional generative stages to restore realism:

Dual-Domain Feature Extraction and Diffusion Refinement: The One-Shot Refiner framework introduces a dual-branch, detail-aware feature extractor (CNN spatial + FFT frequency) fused with geometric ViT features, passing detail-rich features to Gaussian heads (Dong et al., 20 Jan 2026). A single-step feature-guided diffusion module (Stable Diffusion UNet) refines the coarse splatted image latent, guided by both rendered features and source references.
Video and 4D Diffusion for Temporal Consistency: For dynamic or 360° scenes, models like WorldSplat and MVSplat360 train on temporally aggregated inputs and use 4D-aware transformers with cross-view and temporal attention to generate Gaussians and refine the output via video diffusion. Enhanced variants encode BEV sketches, depth, semantics, and can be conditioned on text or trajectory (Zhu et al., 27 Sep 2025, Chen et al., 2024).
Iterative Feed-Forward Correction: SIMPLI and fMPI apply learned feed-forward update rules over initial multiplane representations or PSVs, correcting discrepancies and collapsing planes into a small set of deformable, textured layers (Solovev et al., 2022, Kohler et al., 2023). This enables real-time inference with compact scene proxies.

4. Training Regimes, Losses, and Scalability

Unified and multi-stage training strategies are common:

Losses: Reconstruction (MSE/ $\ell_2$ ), perceptual (LPIPS, VGG), adversarial (GAN), depth-consistency, and pixel/feature-level error feedback are standard (Dong et al., 20 Jan 2026, Chen et al., 2024).
Joint Optimization: Example—One-Shot Refiner jointly optimizes the ViT 3DGS backbone and the diffusion module using a sum of reconstruction, perceptual, and GAN losses, with staged freezing/ramping to stabilize training (Dong et al., 20 Jan 2026).
Pretraining and Self-Supervision: Masked autoencoding with projective cues allows large-scale uncalibrated data to be leveraged before fine-tuning on paired NVS datasets (Wu et al., 8 Jan 2026).
Scalable Fusion: When input view count increases, ZPressor compresses features into a small anchor set and cross-attends the rest, preserving performance beyond dozens of views without memory/extreme compute scaling (Wang et al., 29 May 2025).

5. Empirical Performance and Benchmarks

Feed-forward NVS models are benchmarked on datasets such as RealEstate10K, DL3DV-10K, nuScenes, and custom large-scale 360° or urban scene datasets (Dong et al., 20 Jan 2026, Zhu et al., 27 Sep 2025, Chen et al., 2024, Rauniyar et al., 11 Jan 2025). Key findings:

Method/Setting	PSNR↑	SSIM↑	LPIPS↓	FID↓	Notes
One-Shot Refiner	22.67	0.69	0.16	40.46	DL3DV, 512²; Ours vs. 18.27 AnySplat (Dong et al., 20 Jan 2026)
DepthSplat + ZPressor	23.88	0.815	0.150	—	DL3DV-10K, 36 views; baseline 19.23/0.666/0.286 (Wang et al., 29 May 2025)
eFreeSplat	26.45	0.865	0.126	—	RE10K, 2-view settings (Min et al., 2024)
MVSplat360	16.81	0.514	0.418	17.01	DL3DV-10K, n=300 setting (Chen et al., 2024)

Empirical trends:

Generative/diffusion-enhanced frameworks consistently outperform Gaussians-only or PSV-only baselines in high-frequency detail and FID/LPIPS perceptual metrics (Dong et al., 20 Jan 2026, Zhu et al., 27 Sep 2025).
Bottleneck-aware models (ZPressor) recover performance at high view-count regimes where vanilla feed-forward approaches saturate (Wang et al., 29 May 2025).
Qualitative improvements include greater consistency in thin structures, less "blurring" in unseen regions, and plausible inpainting of occluded areas.

6. Architectures for Dynamic, Outdoor, and Large-Scale Scenes

Specialized strategies address challenges in extreme settings:

Dynamic Scenes: CogNVS and WildRayZer segment co-visible and dynamic pixels, using sparse SLAM followed by feed-forward video inpainting (CogNVS: self-supervised diffusion on masked regions) or explicit motion-masked scene encoding for transient objects (WildRayZer: pseudo-mask from residuals, motion head masks gradients/input patches) (Chen et al., 16 Jul 2025, Chen et al., 15 Jan 2026).
Large-Scale Outdoor: Aug3D couples feed-forward PixelNeRF with geometry-based synthetic view augmentation, raising PSNR by constructing SfM-derived image clusters and domes over real/semantic building locales for robust training (Rauniyar et al., 11 Jan 2025).
360° NVS: MVSplat360 and similar architectures combine multi-view feature fusion, cost volumes, and video diffusion to handle wide or near-complete sweep input gaps, achieving state-of-the-art in challenging, sparse settings (Chen et al., 2024).
Real-Time Constraints: fMPI and SIMPLI demonstrate that input grouping, super-sampling, and lightweight differentiable layer aggregation permit 50–100× speedups over prior PSV/IBR methods with state-of-the-art fidelity (Kohler et al., 2023, Solovev et al., 2022).

7. Limitations, Open Problems, and Future Directions

Despite rapid advances, several critical challenges persist:

Data Regime Sensitivity: Most architectures remain data-hungry, with limited robustness under wide baseline, occlusion, or non-overlap. Domain drift can degrade generalization (Min et al., 2024).
Sparse/Extreme Input Cases: Fidelity decreases for regions never seen in any context view, especially when learning-based hallucination of unseen geometry or appearance is required (Chen et al., 2024, Rauniyar et al., 11 Jan 2025).
Scalability: Linear growth in observed Gaussians or PSV size with view count is addressed only partially by modules like ZPressor; ultra-dense or kilometer-scale scenes remain challenging (Wang et al., 29 May 2025).
Dynamic/Transient Handling: Accurate masking and completion for dynamic content, particularly in cases of fine object motion or non-rigid deformation, is an open topic—as are improved unsupervised dynamic segmentation and temporal coherence (Chen et al., 16 Jul 2025, Chen et al., 15 Jan 2026).
Ultra-high Resolution and Speed: Advances are needed in memory-efficient rendering, mixed-precision kernels, and diffusion model acceleration for real-time deployment at resolutions beyond 480P or 800 px width (Wang et al., 29 May 2025, Chen et al., 2024).
Integration of Multimodal/Auxiliary Cues: Emerging work incorporates BEV semantics, text, trajectory, or sensor data (LiDAR) into feed-forward capacities for controllable, robust scene synthesis beyond static RGB (Zhu et al., 27 Sep 2025).

Across paradigms, the consensus is that coupling explicit 3D geometric reasoning, bottlenecked or projective input encoding, and generative refinement architectures enables feed-forward NVS systems to achieve geometric fidelity, spatial/temporal consistency, and practical inference speed without per-scene optimization. Nevertheless, robustness to out-of-distribution settings, dynamic regions, and extremely large-scale or sparse-view inputs remains an active and essential area for research evolution (Dong et al., 20 Jan 2026, Wu et al., 8 Jan 2026, Min et al., 2024, Zhu et al., 27 Sep 2025).