Learning unified 3D representations from unposed multi-view images

Establish methods that can learn robust, unified 3D scene representations directly from unposed multi-view images, integrating geometry, appearance, and semantics without requiring known camera poses or per-scene optimization, and that remain effective in sparse-view settings.

Background

The paper motivates spatial intelligence as requiring representations that jointly capture scene geometry, appearance, and semantics. While many supervised and pose-dependent methods exist, they often need ground-truth calibration or per-scene optimization, and treat the three aspects in isolation.

Self-supervised approaches reduce annotation needs but frequently assume known camera poses or dense video, and can degrade in sparse-view regimes. Consequently, learning such unified 3D representations directly from unposed multi-view images is identified as a field-level challenge that the authors aim to address with UniSplat.

References

However, deriving such effective 3D representations directly from unposed multi-view images remains an open challenge.

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images  (2604.10573 - Zhou et al., 12 Apr 2026) in Section 1 (Introduction)