Depth Any Panoramas (DAP): 360° Depth Estimation
- Depth Any Panoramas (DAP) is a framework for generating all-angle, per-pixel depth maps from 360° images, handling geometric distortions with specialized models.
- DAP pipelines leverage diverse synthetic and real-world datasets alongside innovative pseudo-label curation to ensure robust, metric and scale-invariant depth recovery.
- Advanced architectures—including spherical-aware CNNs, fusion networks, and vision transformers—enable high-fidelity 3D structure capture for applications in AR/VR, robotics, and autonomous navigation.
Depth Any Panoramas (DAP) refers to methodologies, foundational models, and complete pipelines for estimating dense, metric, and/or scale-invariant depth from 360-degree (panoramic or omnidirectional) images. These approaches address the unique geometric, data, and deployment challenges of panoramic imagery, enabling applications that require high-fidelity global 3D structure capture, such as robotics, AR/VR, and autonomous navigation.
1. Defining DAP: Scope, Motivation, and Distinctions
Depth Any Panoramas (DAP) encompasses a family of methods designed to produce all-angle, per-pixel depth maps from full-sphere images, typically in equirectangular projection. Unlike conventional depth-from-perspective pipelines—where models are trained on rectilinear images and output depth for a limited field of view—DAP must contend with greater geometric distortion (particularly near the poles), boundary wraparound, and wide domain shifts between synthetic and real-world data. The field has matured from early handcrafted fusion approaches to fully end-to-end vision transformer (ViT) pipelines that exhibit strong @@@@1@@@@ across diverse scene types, domains, and capture setups (Lin et al., 18 Dec 2025, Jiang et al., 28 Dec 2025).
DAP is distinguished by (i) its target of complete 360° × 180° coverage, (ii) explicit geometric handling of equirectangular, spherical, and cylindrical projections, (iii) a focus on both metric and scale-invariant depth recovery, and (iv) pipeline and dataset scalability favoring foundational model paradigms.
2. Datasets and Synthetic Data Curation for Panoramic Depth
DAP model success is predicated on access to large-scale, diverse, and high-quality panoramic RGB–depth datasets. Early efforts used synthetic environments such as Structured3D and PanoSUNCG or multi-view reconstructions on cube-mapped or equirectangular imagery, but these frequently suffered scale-ambiguity and limited real-world coverage.
Recent contributions have greatly expanded the available training resources:
- 360° in the Wild (Park et al., 2024): 25,000 real-world panoramas spanning indoor, outdoor, and mannequin-frozen scenes, annotated with depth via fused multi-view stereo (COLMAP) on synthetic cube-map faces. This resource supports both single-image depth estimation and view synthesis, with depth ranges 0.5 m–30+ m, and standardized pose conventions.
- DAP-2M (Lin et al., 18 Dec 2025): Unifies 18k Structured3D (synthetic indoor), 90k AirSim360 (high-fidelity synthetic outdoor), 200k DiT360 (text-to-image), and 1.7M unlabeled real panoramas, using a three-stage pseudo-label curation pipeline—comprising geometry-grounded initialization, realism-driven pseudo-labeling, and progressive fine-tuning—to generate robust, geometry-consistent depth supervision across 2 million panoramas.
- DA² curation engine (Li et al., 30 Sep 2025): Converts perspective RGB–D datasets to full ERP panoramas via a perspective-to-equirectangular projection, filling missing sides/top/bottom via spherical out-painting models, resulting in over 543k high-quality RGB–D panorama pairs, bringing the scale to 607k when combined with existing data.
All recent DAP pipelines leverage such heterogeneous sources, with sophisticated domain adaptation (e.g., realism-invariant labelers, confidence-driven curation), enabling high generalization even under strong cross-domain and scale shifts.
3. Core Model Architectures and Geometric Formulations
DAP architectures bifurcate into several specialized classes, each encoding panoramic geometry and distortion in unique ways:
A. Spherical- and Distortion-Aware Convolutional Approaches
- DAMO (Distortion-Aware Monocular Omnidirectional) leverages a ResNet-50 backbone, inserting deformable convolution and strip pooling modules to address ERP-induced stretching and distortion—especially near the poles. Spherical-aware weight matrices rebalance supervision according to equirectangular surface area (Chen et al., 2020).
B. Multi-Projection and Fusion-Based Solutions
- SphereFusion (Yan et al., 9 Feb 2025) processes each panorama through parallel equirectangular (2D CNN) and spherical-mesh ResNet branches, projecting features to a shared spherical mesh and fusing them with gated attention at each resolution. The final prediction is performed in the spherical domain, balancing the texture detail captured in 2D with geometric fidelity from the mesh branch.
C. Vision Transformer (ViT) DAP Foundations
- DAP Foundation Model (Lin et al., 18 Dec 2025) and DA² (SphereViT) (Li et al., 30 Sep 2025) employ ViTs (e.g., DINOv3-L, DINOv2-ViT-L) to learn global panoramic representations. DAP couples a metric depth head with a plug-and-play range mask, while SphereViT introduces fixed spherical coordinate embeddings through cross-attention, enhancing geometric consistency across latitude/longitude. Both models achieve end-to-end inference, high throughput, and strong zero-shot generalization.
D. Scale-Invariant Transformations
- DA360 (Jiang et al., 28 Dec 2025) adapts Depth Anything V2, learning a global shift parameter from the ViT class token to convert affine-invariant log-disparity to scale-invariant depth. Circular padding in the DPT decoder enforces spherical continuity, mitigating seam artifacts at ERP boundaries.
E. Stereo and LiDAR Fusion Pipelines
- MCPDepth (Qiao et al., 2024) maps ERP frames to cylindrical panoramas for pairwise stereo matching using standard convolutional stereo networks extended with circular attention, then fuses depth maps in the ERP domain via lightweight U-Net architectures. This approach is compatible with embedded deployment due to its avoidance of custom kernels.
- LiDAR–Fisheye Fusion (Ma et al., 2020) utilizes multi-camera rigs and LiDAR, projecting 3D points onto wide-FOV images, upsampling sparse depths through local adaptive least squares, and seamless blending via graph cuts and multi-band blending prior to mapping to spherical coordinates.
| Model/Class | Projection Domain(s) | Key Innovations |
|---|---|---|
| DAMO | Equirectangular | Deformable conv, strip pooling, spherical loss |
| SphereFusion | ERP & Spherical mesh | Gated fusion, mesh cache, dual-branch arch |
| DAP, DA², DA360 | ERP, Spherical (ViT) | Spherical embedding, SI/metric loss, circular pad |
| MCPDepth | Cylindrical, ERP | Stereo+fusion, circular attn, deployment ready |
4. Loss Functions, Optimization, and Geometric Consistency
Depth estimation from 360-degree imagery faces unique losses from ERP sampling non-uniformity, ambiguity in metric scale, and inconsistency in high-frequency structure. Representative DAP loss formulations include:
- SILog Loss (Scale-Invariant Log, as in MiDaS) (Lin et al., 18 Dec 2025, Jiang et al., 28 Dec 2025): Enforces correct relative structure under unknown scale/shifts.
- Dense-Fidelity Loss: Measures Gram-matrix similarity on tangent-plane projections to penalize global geometric fidelity (Lin et al., 18 Dec 2025).
- Gradient and Normal Losses: Focused on preserving sharp edges () and surface normals () for fine spatial accuracy.
- Point-Cloud Losses: Penalize 3D Cartesian errors after spherical unwrapping (Lin et al., 18 Dec 2025).
- Spherical-Aware Pixel Weighting: E.g., in DAMO balances loss contributions over latitude (Chen et al., 2020).
- Plug-and-Play Range Masks: Binary masks for specified distance bands () facilitate metric masking and loss restriction (Lin et al., 18 Dec 2025).
Optimization incorporates data-in-the-loop feedback, curriculum learning across depth ranges, and robust pseudo-label selection. Recent models achieve stable, generalizable metric depth with per-benchmark AbsRel reductions of 25–50% over earlier state-of-the-art (Jiang et al., 28 Dec 2025, Lin et al., 18 Dec 2025).
5. Evaluation Protocols and Comparative Results
Panoramic depth models are evaluated across several large-scale benchmarks with standardized metrics. Key datasets and metrics include:
- Stanford2D3D: Indoor, real-world panorama suite.
- Matterport3D: Large diverse indoor scenes.
- Deep360, Metropolis: Synthetic and real outdoor panoramic sets (Metropolis includes 3,000 panoramic frames with LiDAR/MVS ground truth).
- 360° in the Wild: 11k frame-depth pairs for training; held-out splits for evaluation (Park et al., 2024).
Metrics routinely reported are AbsRel, RMSE, (percentage of pixels with error ratio ), and derivatives. For example, DAP Foundation Model reports Stanford2D3D AbsRel = 0.0921 (↓36% versus prior SOTA), RMSE = 0.3820, and Deep360 AbsRel = 0.0659 for zero-shot predictions (Lin et al., 18 Dec 2025). DA360 ViT-L achieves AbsRel = 0.0793 (Matterport3D), 0.0710 (Stanford2D3D), 0.2011 (Metropolis), consistently outperforming PanDA and previous transformer-based models (Jiang et al., 28 Dec 2025).
Trade-offs are documented for real-time efficiency (SphereFusion: 17 ms per 512×1024 panorama) versus transformer-based approaches (60+ ms), as well as ablations demonstrating the necessity of global shift parameters, circular padding, and cross-attention spherical embeddings for peak accuracy.
6. Applications and Deployment Considerations
DAP methods have enabled advances in multiple real-world avenues:
- Localization and Mapping: Robust metric depth allows for 3D model construction, SLAM, and robot navigation in environments where monocular cues alone are insufficient (Lin et al., 18 Dec 2025).
- AR/VR: Accurate panoramic depth supports immersive environment reconstruction, real-time scene understanding, and synthetic view generation.
- Autonomous Vehicles: Efficient depth pipelines with low-latency inference (e.g., SphereFusion, MCPDepth) are suitable for embedded, resource-constrained systems (Yan et al., 9 Feb 2025, Qiao et al., 2024).
- Indoor Scene Completion and Outpainting: Diffusion-based pipelines such as PanoDiffusion perform joint RGB-D outpainting for reconstruction and semantic SLAM when wide-FOV data is incomplete (Wu et al., 2023).
Practical deployment is facilitated by models designed with efficiency (ONNX/TensorRT compatibility (Qiao et al., 2024)), end-to-end operation (SphereViT, DAP), and explicit distortion handling, enabling hardware-agnostic rollout in fielded systems.
7. Limitations, Challenges, and Research Frontiers
Despite progress, prominent limitations are noted:
- Domain Discrepancy and Pseudo-Label Quality: Generalization across synthetic/real, indoor/outdoor domains depends on pseudo-label curation quality and coverage (Lin et al., 18 Dec 2025). Out-of-distribution scenes still present accuracy challenges.
- Seam Artifacts and Resolution: ERP wraparound remains a source of visible seams; current training resolutions (e.g., 512×1024) may lose fine detail or introduce boundary errors (Li et al., 30 Sep 2025, Jiang et al., 28 Dec 2025).
- Scale Ambiguity and Metricity: Achieving true metric depth remains contingent upon sufficient training diversity, reliable ground truth, and proper loss design. Some models remain only scale-invariant in practice outside robustly labeled domains.
- Computational Footprint and Resource Demands: Transformer-based foundations incur significant compute cost for training, though inference has become tractable for most architectures.
Areas for future research include integration of temporal/multi-view constraints, self-supervised adaptation, spherical convolution for seamless geometry, and joint semantic-deep representation learning (Lin et al., 18 Dec 2025, Li et al., 30 Sep 2025).
References:
- "Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation" (Lin et al., 18 Dec 2025)
- "Depth Anything in : Towards Scale Invariance in the Wild" (Jiang et al., 28 Dec 2025)
- "DA: Depth Anything in Any Direction" (Li et al., 30 Sep 2025)
- "SphereFusion: Efficient Panorama Depth Estimation via Gated Fusion" (Yan et al., 9 Feb 2025)
- "MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas" (Qiao et al., 2024)
- "360 in the Wild: Dataset for Depth Prediction and View Synthesis" (Park et al., 2024)
- "Distortion-aware Monocular Depth Estimation for Omnidirectional Images" (Chen et al., 2020)
- "PanoDiffusion: 360-degree Panorama Outpainting via Diffusion" (Wu et al., 2023)
- "A Method of Generating Measurable Panoramic Image for Indoor Mobile Measurement System" (Ma et al., 2020)