ColonDepth: Synthetic Endoscopy RGB-Depth Data

Updated 31 January 2026

ColonDepth Dataset is a comprehensive synthetic RGB–Depth corpus simulating monocular colonoscopy with realistic anatomical structures and imaging artifacts.
It leverages CT-derived 3D meshes and Blender’s path-tracing to generate precisely aligned RGB images and accurate per-pixel depth maps.
The dataset supports robust evaluation through disjoint texture–lighting folds, enabling advanced benchmarking of depth estimation models in endoscopic imaging.

The ColonDepth dataset is a comprehensive synthetic RGB–Depth corpus designed to simulate monocular colonoscopy under realistic anatomical and imaging conditions. Engineered to facilitate precise spatial perception and support 3D navigation in minimally invasive procedures, ColonDepth enables reproducible benchmarking of depth estimation and geometric reasoning for deep learning models in endoscopic imagery (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

1. Data Acquisition and Simulation Workflow

ColonDepth is constructed from patient-derived CT scans of the large intestine. Segmented CT volumes (voxel size ≈ 0.5 mm) are processed into detailed triangular surface meshes representing colon anatomy. These meshes are traversed by a virtual endoscope within Blender, which simulates the imaging hardware via a pinhole camera (60° horizontal FOV, 256×256 px resolution) and rigidly attached point light sources. Rendering utilizes Blender’s path-tracing engine (Cycles) to replicate specular reflections and mucosal subsurface scattering observed in vivo.

Three primary material–texture families (T1: healthy mucosa, T2: inflamed mucosa, T3: artificial dye patterns) are paired with three lighting configurations (L1: bright diffuse, L2: directional grazing, L3: spectrally colored) to form the test folds “T1–L1,” “T2–L2,” and “T3–L3.” The camera trajectory is randomized in translation (±5 cm) and rotation (±30°) relative to the colon lumen centerline, producing a diverse set of viewpoints. No physical phantom is used; the mesh directly reflects CT-derived anatomy (Yang et al., 2023).

2. Dataset Composition and Structure

ColonDepth comprises over 16,000 RGB images paired with pixelwise floating-point ground-truth depth maps. Each image is 256×256 px, 8-bit RGB, and each depth map records camera-to-surface distance (in mm) at sub-millimeter precision, with spatial depth ranging from 0.5 mm to 200 mm. Depth maps are stored as single-channel 32-bit float arrays (TIFF or NumPy format) and are aligned such that no further registration is required. All frames are temporally unsequenced; each viewpoint is rendered independently (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

A typical train/validation split follows 80/20 proportions, yielding 12,813 training pairs and 3,203 validation pairs for model development. No held-out ColonDepth test set is reported in (Vazquez et al., 24 Jan 2026); performance validation is carried out using independent datasets.

3. Ground-Truth Depth Generation and Annotation

Per-pixel depth maps are produced directly by Blender’s Z-buffer during path-traced rendering, corresponding precisely to the physical CT scale. Camera intrinsics are calibrated to standard clinical endoscopes, ensuring anatomical proportions are preserved. During preprocessing, depth values are clamped to [0.5, 200] mm and normalized linearly for storage; min–max normalization to [0,1] is performed for input to learning algorithms. All RGB–depth pairs remain perfectly aligned spatially; no multi-view fusion is necessary (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

4. Dataset Splitting and Cross-Validation Protocols

The dataset is partitioned into three disjoint folds by unique texture–lighting combinations, supporting robust cross-validation. For three-fold evaluation:

Fold	Test Condition	Train Folds	#Test Frames
Fold 1	T1–L1	T2–L2 ∪ T3–L3	364
Fold 2	T2–L2	T1–L1 ∪ T3–L3	364
Fold 3	T3–L3	T1–L1 ∪ T2–L2	364

This protocol is designed to probe generalization across unseen combinations of material textures and illumination conditions (Yang et al., 2023).

5. Statistical, Anatomical, and Imaging Properties

The pooled depth distribution extends from 0.5 mm to 200 mm (mean ≈ 40 mm; std ≈ 35 mm; positive skew). Each fold contains 5–15% pixels with intense specular highlights (intensity >0.9) and 10–20% in deep shadow (<0.1), reflecting realistic illumination variability. Anatomical detail includes haustral folds (ridge-shaped, low curvature), mucosal pits and crypt orifices (crater-like), variable lumen diameter (25–65 mm), and simulated polyps (spherical; ⌀5–15 mm). These features mirror those encountered in clinical imaging of the colon (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

6. Benchmark Metrics and Evaluation Procedures

ColonDepth supports quantitative comparison of depth estimation algorithms via standard metrics. Let $N$ be the number of valid pixels, $d_i$ the predicted depth, and $d_i^*$ the ground truth:

Root-Mean-Square Error (RMSE): $\sqrt{ \frac{1}{N} \sum_{i=1}^{N} (d_i - d_i^*)^2 }$
Root-Mean-Square Log Error (RMSE_log): $\sqrt{ \frac{1}{N} \sum_{i=1}^{N} (\log d_i - \log d_i^*)^2 }$
Absolute Relative Error (Abs Rel): $\frac{1}{N} \sum_{i=1}^N |d_i - d_i^*| / d_i^*$
Squared Relative Error (Sq Rel): $\frac{1}{N} \sum_{i=1}^N (d_i - d_i^*)^2 / d_i^*$
Accuracy under threshold $\tau$ : $a_k = \text{fraction of pixels with } \max(d_i/d_i^*, d_i^*/d_i) < 1.25^k$ , for $k \in \{1,2,3\}$

Median scaling is performed before evaluation: $\tilde D_{scaled} = \tilde D \cdot (\text{median}(D^*) / \text{median}(\tilde D))$ (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

7. Modeling, Training Protocols, and Downstream Integration

Models such as Visual Geometry Grounded Transformer (VGGT) are fine-tuned on ColonDepth using per-pixel $L_2$ regression and $L_1$ scale-invariant losses. Example optimization:

$L_2 = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2$
$L_{SI} = \frac{1}{N} \sum_{i=1}^N ( \log \hat{y}_i - \log y_i )^2 - \frac{1}{N^2} [\sum_{i=1}^N (\log \hat{y}_i - \log y_i )]^2$
$L_{total} = \alpha L_2 + \beta L_{SI}$ , typically $\alpha=1$ , $\beta=0.1$ .

Training hyper-parameters include AdamW optimization (weight decay $3\times10^{-3}$ ), constant learning rate ( $1\times10^{-4}$ ), batch size 8, and standard RGB/depth pre-processing at 256×256 px with normalization to $[0,1]$ . Augmentation comprises random flips, color jitter, and blur. Performance on the Scenario dataset demonstrates that fine-tuned VGGT achieves RMSE 3.81 mm and $\delta_1$ accuracy of 0.863, outperforming prior baselines (Vazquez et al., 24 Jan 2026).

Depth maps inferred by VGGT from ColonDepth are then integrated into U-Net variants via the Geometric Prior-guided Module (GPM), which uses spatial and channel attention to modulate encoder features. GPM incorporates self-update and cross-update blocks employing CBAM attention mechanisms on skip connections at four encoder levels. This approach yields sharper boundary localization, improved handling of highlights and texture-less regions, and enhanced segmentation metrics (e.g., +3.2% DSC on Kvasir) across multiple benchmarks (Vazquez et al., 24 Jan 2026).

8. Unique Challenges and Testbed Realism

ColonDepth is designed to challenge monocular depth estimation models with realistic imaging artifacts and geometric complexity:

Specular highlights and deep shadows undermine photometric consistency.
Smooth mucosal regions provide few local texture cues.
Complex geometry, including haustral folds and polyps, introduce steep and ambiguous depth discontinuities (“stepped edges”).
Simulated lens distortion introduces non-linear spatial warping resembling clinical endoscopic optics.

The plausible implication is that ColonDepth, by reproducing these in vivo adversities, provides a rigorous evaluation setting for both local edge sensitivity and global geometric consistency, progressively revealing critical failure modes in depth networks (Yang et al., 2023, Vazquez et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

A geometry-aware deep network for depth estimation in monocular endoscopy (2023)

Learning with Geometric Priors in U-Net Variants for Polyp Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ColonDepth Dataset.