From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

Published 8 Jan 2026 in cs.CV | (2601.05116v1)

Abstract: Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces projective conditioning as a robust alternative to Plücker ray conditioning, improving feed-forward view synthesis by encoding inputs as projection images.
It leverages a masked auto-encoding pretraining strategy combined with DINOv3 features to boost data efficiency and perceptual quality.
Experiments show significant PSNR gains and enhanced geometric consistency under camera perturbations compared to traditional methods like LVSM.

Projective Conditioning for Robust Feed-Forward View Synthesis

Introduction

"From Rays to Projections: Better Inputs for Feed-Forward View Synthesis" (2601.05116) presents a significant reevaluation of input conditioning for feed-forward novel view synthesis models. The paper identifies critical instability issues with absolute Plücker ray conditioning—a representation underpinning recent scalable view synthesis transformers such as LVSM. The authors propose projective conditioning, wherein models are supplied with point cloud projection images derived from context views and depths, shifting the paradigm from brittle geometric regression in high-dimensional ray space to a stable, visually coherent 2D image-to-image translation. A dedicated masked auto-encoding pretraining strategy leverages this 2D structure, enabling scalable self-supervised training on uncalibrated data.

Analysis of Ray Conditioning Instabilities

Conventional large view synthesis models encode pixel-wise camera rays using Plücker coordinates, treating the synthesis task as absolute regression in 6D ray space. This framework is theoretically problematic: Plücker coordinates vary non-smoothly under global SE(3) or Sim(3) transformations, and even small camera movements result in distributed, non-local perturbations of input tokens. The neural backbone thus overfits to arbitrary world coordinate gauges, manifesting in severe degradation when subjected to basic camera operations:

Figure 1: Under a random global SE(3) transformation, ray-conditioned models fail while projective conditioning remains robust.

Empirical results demonstrate pronounced artifacts and collapse in LVSM and related ray-conditioned approaches under coordinate changes, as well as common field-of-view, aspect-ratio, or roll modifications. These findings reveal Plücker rays as an over-parameterized, gauge-dependent representation ill-suited for consistent 3D rendering.

Projective Conditioning: Methodology

Projective conditioning eschews explicit camera parameter embeddings. Instead, depth maps are extracted from context views using off-the-shelf perception models, then unprojected and rasterized into a unified point cloud as observed from the target frustum. The resulting point cloud projection image encodes geometric relationships as a stable, camera-invariant 2D buffer. The model thus receives both context RGB images and the projection cue, facilitating robust completion of target views via image-to-image mapping.

Figure 2: An overview of the two-stage training pipeline. Pretraining is self-supervised and fine-tuning uses projected point clouds providing geometric cues.

A decoder-only ViT backbone processes patch-wise tokens from context images, DINOv3 features, and the point cloud projection. To ensure token uniqueness, positional rotary embeddings (RoPE) are applied, addressing ambiguities arising from patchification of empty regions in the projection image.

The authors provide a formal quotient-space interpretation: projective conditioning maps the configuration space of context images, depths, and camera poses into the quotient space modulo global SE(3) transformation. Thus, the input is invariant by construction, eliminating the need for network-learned coordinate invariance.

Masked Auto-Encoding Pretraining

Masked auto-encoding (MAE) is adopted for pretraining, exploiting the structural similarity between sparsified, randomly masked target images and point cloud projection images. The network—conditioned on context views and corrupted targets—learns powerful cross-view completion priors in a fully self-supervised regime. Fine-tuning then proceeds with projective cues, drastically reducing dependence on labeled RGB-D data and improving efficiency.

Figure 3: Pretraining reconstructs the target view from a masked version; fine-tuning uses warped projections into the target frustum.

Experimental Results

The proposed framework is evaluated on a new consistency benchmark simulating out-of-distribution camera transformations, as well as sparse-view synthesis benchmarks derived from RealEstate10K and DL3DV. Projective conditioning shows marked improvements over both ray-conditioned ViT models and feed-forward 3D Gaussian approaches.

On consistency (robustness to camera perturbations), the model achieves up to 25.43 PSNR under World Scale transformations versus LVSM's 14.56, with significant gains across all tested axes (aspect, FOV, roll).
On RealEstate10K sparse-view synthesis:
- 24-layer projective model: 28.60 PSNR on large overlap; 26.88 PSNR on medium; 24.98 PSNR on small.
- Outperforming comparable LVSM variants, especially under challenging overlaps and disocclusion scenarios.

Qualitative results further indicate improved geometric consistency and reduction of artifacts compared to AnySplat, WorldMirror, LVSM, and RayZer baselines.

Figure 4: Consistency benchmark demonstrating superior geometric coherency. LVSM, RayZer, and AnySplat exhibit inconsistencies and errors under camera alteration.

Figure 5: Qualitative comparisons on RealEstate10K. Orange and blue boxes highlight failure modes in other models, with projective conditioning maintaining structural fidelity.

Efficiency analyses show reduced processing time over 3D Gaussian approaches and competitive rendering throughput.

Ablation Studies

Ablations confirm the merit of each architectural innovation:

Pretraining on large-scale, diverse datasets (DL3DV) boosts adaptation speed and quality even with minimal fine-tuning.
DINOv3 feature priors further enhance perceptual quality.
The combination of pretraining, DINO features, and projective conditioning yields the highest composite scores in PSNR/SSIM/LPIPS.
Breakdown of "seen" versus "unseen" pixels in small overlap settings elucidates that projective conditioning generalizes spatially, hallucinating plausible content in occluded regions.
Figure 6: Ablation illustrating the mitigation of rendering artifacts from pretraining design choices.

Implications and Prospects

The reframing of feed-forward view synthesis as a quotient-space structured image-to-image problem demonstrates that removing over-parametrization from the input is critical for generalization, consistency, and controllability. Projective conditioning sidesteps notorious "train–test gap" issues in coordinate gauges and enables camera-robust, high-fidelity synthesis with practical runtime.

Practically, the approach greatly enhances the scalability and deployment of view synthesis models in unconstrained, dynamic environments, with direct utility for robotics, AR/VR, and content creation where on-the-fly camera motion and scene interaction are prevalent.

Theoretically, this quotient-based input design opens new research avenues into generalized scene representations, dynamic scene adaptation, and further extension to temporal and deformable entities. The masked auto-encoding pretraining regime signals a promising direction for harnessing vast, unlabeled 2D/3D data in visual LLMs and universal scene synthesis.

Conclusion

This work demonstrates that projective conditioning fundamentally improves feed-forward view synthesis, addressing the brittleness of coordinate-dependent ray embeddings and enabling robust, consistent, and scalable novel view synthesis. Strong empirical results substantiate the approach's effectiveness, with pretraining and architectural choices further boosting data efficiency and quality. Future work may extend projective conditioning to dynamic and temporal scenarios, enriching the compositional and interactive capabilities of 3D scene synthesis.

Markdown Report Issue