ColonDepth: Synthetic Endoscopy RGB-Depth Data
- ColonDepth Dataset is a comprehensive synthetic RGB–Depth corpus simulating monocular colonoscopy with realistic anatomical structures and imaging artifacts.
- It leverages CT-derived 3D meshes and Blender’s path-tracing to generate precisely aligned RGB images and accurate per-pixel depth maps.
- The dataset supports robust evaluation through disjoint texture–lighting folds, enabling advanced benchmarking of depth estimation models in endoscopic imaging.
The ColonDepth dataset is a comprehensive synthetic RGB–Depth corpus designed to simulate monocular colonoscopy under realistic anatomical and imaging conditions. Engineered to facilitate precise spatial perception and support 3D navigation in minimally invasive procedures, ColonDepth enables reproducible benchmarking of depth estimation and geometric reasoning for deep learning models in endoscopic imagery (Yang et al., 2023, Vazquez et al., 24 Jan 2026).
1. Data Acquisition and Simulation Workflow
ColonDepth is constructed from patient-derived CT scans of the large intestine. Segmented CT volumes (voxel size ≈ 0.5 mm) are processed into detailed triangular surface meshes representing colon anatomy. These meshes are traversed by a virtual endoscope within Blender, which simulates the imaging hardware via a pinhole camera (60° horizontal FOV, 256×256 px resolution) and rigidly attached point light sources. Rendering utilizes Blender’s path-tracing engine (Cycles) to replicate specular reflections and mucosal subsurface scattering observed in vivo.
Three primary material–texture families (T1: healthy mucosa, T2: inflamed mucosa, T3: artificial dye patterns) are paired with three lighting configurations (L1: bright diffuse, L2: directional grazing, L3: spectrally colored) to form the test folds “T1–L1,” “T2–L2,” and “T3–L3.” The camera trajectory is randomized in translation (±5 cm) and rotation (±30°) relative to the colon lumen centerline, producing a diverse set of viewpoints. No physical phantom is used; the mesh directly reflects CT-derived anatomy (Yang et al., 2023).
2. Dataset Composition and Structure
ColonDepth comprises over 16,000 RGB images paired with pixelwise floating-point ground-truth depth maps. Each image is 256×256 px, 8-bit RGB, and each depth map records camera-to-surface distance (in mm) at sub-millimeter precision, with spatial depth ranging from 0.5 mm to 200 mm. Depth maps are stored as single-channel 32-bit float arrays (TIFF or NumPy format) and are aligned such that no further registration is required. All frames are temporally unsequenced; each viewpoint is rendered independently (Yang et al., 2023, Vazquez et al., 24 Jan 2026).
A typical train/validation split follows 80/20 proportions, yielding 12,813 training pairs and 3,203 validation pairs for model development. No held-out ColonDepth test set is reported in (Vazquez et al., 24 Jan 2026); performance validation is carried out using independent datasets.
3. Ground-Truth Depth Generation and Annotation
Per-pixel depth maps are produced directly by Blender’s Z-buffer during path-traced rendering, corresponding precisely to the physical CT scale. Camera intrinsics are calibrated to standard clinical endoscopes, ensuring anatomical proportions are preserved. During preprocessing, depth values are clamped to [0.5, 200] mm and normalized linearly for storage; min–max normalization to [0,1] is performed for input to learning algorithms. All RGB–depth pairs remain perfectly aligned spatially; no multi-view fusion is necessary (Yang et al., 2023, Vazquez et al., 24 Jan 2026).
4. Dataset Splitting and Cross-Validation Protocols
The dataset is partitioned into three disjoint folds by unique texture–lighting combinations, supporting robust cross-validation. For three-fold evaluation:
| Fold | Test Condition | Train Folds | #Test Frames |
|---|---|---|---|
| Fold 1 | T1–L1 | T2–L2 ∪ T3–L3 | 364 |
| Fold 2 | T2–L2 | T1–L1 ∪ T3–L3 | 364 |
| Fold 3 | T3–L3 | T1–L1 ∪ T2–L2 | 364 |
This protocol is designed to probe generalization across unseen combinations of material textures and illumination conditions (Yang et al., 2023).
5. Statistical, Anatomical, and Imaging Properties
The pooled depth distribution extends from 0.5 mm to 200 mm (mean ≈ 40 mm; std ≈ 35 mm; positive skew). Each fold contains 5–15% pixels with intense specular highlights (intensity >0.9) and 10–20% in deep shadow (<0.1), reflecting realistic illumination variability. Anatomical detail includes haustral folds (ridge-shaped, low curvature), mucosal pits and crypt orifices (crater-like), variable lumen diameter (25–65 mm), and simulated polyps (spherical; ⌀5–15 mm). These features mirror those encountered in clinical imaging of the colon (Yang et al., 2023, Vazquez et al., 24 Jan 2026).
6. Benchmark Metrics and Evaluation Procedures
ColonDepth supports quantitative comparison of depth estimation algorithms via standard metrics. Let be the number of valid pixels, the predicted depth, and the ground truth:
- Root-Mean-Square Error (RMSE):
- Root-Mean-Square Log Error (RMSE_log):
- Absolute Relative Error (Abs Rel):
- Squared Relative Error (Sq Rel):
- Accuracy under threshold : , for
Median scaling is performed before evaluation: (Yang et al., 2023, Vazquez et al., 24 Jan 2026).
7. Modeling, Training Protocols, and Downstream Integration
Models such as Visual Geometry Grounded Transformer (VGGT) are fine-tuned on ColonDepth using per-pixel regression and scale-invariant losses. Example optimization:
- , typically , .
Training hyper-parameters include AdamW optimization (weight decay ), constant learning rate (), batch size 8, and standard RGB/depth pre-processing at 256×256 px with normalization to . Augmentation comprises random flips, color jitter, and blur. Performance on the Scenario dataset demonstrates that fine-tuned VGGT achieves RMSE 3.81 mm and accuracy of 0.863, outperforming prior baselines (Vazquez et al., 24 Jan 2026).
Depth maps inferred by VGGT from ColonDepth are then integrated into U-Net variants via the Geometric Prior-guided Module (GPM), which uses spatial and channel attention to modulate encoder features. GPM incorporates self-update and cross-update blocks employing CBAM attention mechanisms on skip connections at four encoder levels. This approach yields sharper boundary localization, improved handling of highlights and texture-less regions, and enhanced segmentation metrics (e.g., +3.2% DSC on Kvasir) across multiple benchmarks (Vazquez et al., 24 Jan 2026).
8. Unique Challenges and Testbed Realism
ColonDepth is designed to challenge monocular depth estimation models with realistic imaging artifacts and geometric complexity:
- Specular highlights and deep shadows undermine photometric consistency.
- Smooth mucosal regions provide few local texture cues.
- Complex geometry, including haustral folds and polyps, introduce steep and ambiguous depth discontinuities (“stepped edges”).
- Simulated lens distortion introduces non-linear spatial warping resembling clinical endoscopic optics.
The plausible implication is that ColonDepth, by reproducing these in vivo adversities, provides a rigorous evaluation setting for both local edge sensitivity and global geometric consistency, progressively revealing critical failure modes in depth networks (Yang et al., 2023, Vazquez et al., 24 Jan 2026).