DepthCropSeg++: Depth Image Segmentation
- The paper presents a dual-method approach that merges a foundation crop segmentation model and a seed-driven instance segmentation method, achieving 93.11% mIoU on agricultural data.
- It leverages depth information with specialized FADE upsampling and contour-oriented loss functions to effectively reduce manual annotation effort.
- The methodology is applicable in precision agriculture and robotic picking, outperforming established baselines in both real and synthetic environments.
DepthCropSeg++ denotes two distinct but methodologically related research approaches addressing segmentation in depth images: (1) a cross-species, cross-environment crop segmentation foundation model for in-field agricultural tasks (Zhang et al., 18 Jan 2026), and (2) a seed-driven, edge-guided instance segmentation method primarily for robotic picking in industrial scenes, trained exclusively on synthetic depth data (Grard et al., 2018). Both paradigms leverage depth information for object delineation and employ specialized architectural and training strategies to circumvent manual pixel-level annotation bottlenecks.
1. Model Architectures
Crop Segmentation Foundation Model
DepthCropSeg++ in agricultural contexts utilizes the BEiT-Adapter-Large backbone, which entails a ViT encoder (patch size 16, 12 transformer blocks) augmented by locally-biased adapters. Each adapter processes multi-scale ViT features via a convolution and local window attention to produce refined features . Decoder upsampling departs from standard bilinear approaches by introducing FADE (Feature Adaptive Dynamic Upsampling), generating position-specific kernels at every spatial position through weighted fusion of shallow encoder () and global transformer decoder () assets:
where is a window. Edge-awareness is enhanced via dynamic upsampling: higher gating values are allocated near object boundaries, preserving fine detail while maintaining semantic coherence within interiors (Zhang et al., 18 Jan 2026).
Seed-Based Instance Segmentation
In the robotic bulk extraction setting, DepthCropSeg++ is implemented as a VGG-16–based fully convolutional encoder–decoder with skip connections (Grard et al., 2018). The input stacks the normalized depth map and a user-provided binary seed location, enabling per-object segmentation. The network jointly predicts (1) inter-instance boundary logits and (2) the seed-specific mask logits . The mask is extracted post-thresholding () by connected component analysis anchored at the user click. Skip connections ensure preservation of high-frequency boundary information, and the receptive field at the deepest layer spans px.
2. Leveraging Depth Information and Edge-Mask Duality
Both approaches exploit the duality between edge localization and mask connectivity in depth imagery. In the seed-driven setting, boundaries and masks are related by:
Explicit supervision of boundaries guides the network toward sharper instance separation, mitigating label leakage between adjacent objects or crop rows.
Agricultural DepthCropSeg++ incorporates depth-informed pseudo-label generation. Monocular depth maps (from Depth Anything V2) undergo gradient-guided histogram thresholding to yield binary masks, which after filtering, constitute high-quality pseudo-label sources supplementing sparse annotation.
3. Training Strategies and Dataset Scale
Crop Segmentation
A two-stage self-training curriculum is employed. Stage 1 (coarse supervision) trains on a mix of manually annotated and depth-pseudo-labeled images using pixel-wise cross-entropy:
Stage 2 (refined supervision) utilizes model predictions to construct a trimap mask , focusing optimization on confident pixels via masked cross-entropy:
Training on 28,601 images across over 30 crop species and 15 environmental conditions encompasses both pixel-annotated samples, pseudo-labeled depth images, and full-coverage foreground masks (Zhang et al., 18 Jan 2026).
Synthetic Data Pipeline
In the waste sorting scenario, synthetic depth maps are rendered via Blender with precise bin and stereo-projection modeling. Physics-driven object scattering and domain randomization ensure variability. No post-hoc noise augmentation is performed; stereo-matching artifacts emerge naturally. Annotation is automatic given full pose control, enabling perfect boundary and mask ground-truth derivation for the 400 multi-object, 200 validation, 400 test, and 1,500 mono-object images.
4. Loss Functions and Optimization Protocols
The edge-driven regime introduces a contour-oriented loss:
Boundary pixels are upweighted (, ). In the crop model, AdamW optimizer is utilized (lr=, decay=0.05, layerwise decay=0.9) with 30 epochs per training stage and batch size 8 per GPU (Zhang et al., 18 Jan 2026). Synthetic FCN models use SGD (lr=, momentum=0.9, batch size 1, 120 epochs).
5. Performance Metrics and Evaluation
The primary evaluation metric is mean intersection-over-union (mIoU):
DepthCropSeg++ in agricultural vision achieves mIoU on a 6,760-image test set, outperforming both supervised baselines (), Segmentation Anything Model (), HQ-SAM (), and GWFSS (). Robustness is demonstrated on night-time (86.90%), unseen soybean (90.09%), and full-coverage canopy (99.86%) scenarios (Zhang et al., 18 Jan 2026).
For synthetic depth-based segmentation, contour detection F-score and boundary precision vastly exceed previous patch-based and DeepMask baselines. On multi-object synthetic test sets, average IoU reaches 0.89 and boundary precision 0.90; domain transfer introduces an F-score gap (synthetic-to-real) of –, attributed partly to annotation noise.
| Method | Avg IoU (Multi-obj) | Boundary P (Multi-obj) |
|---|---|---|
| DeepMask | 0.83 | 0.51 |
| FCN-DOL (ours) | 0.89 | 0.90 |
6. Practical Considerations and Deployment
DepthCropSeg++ models run efficiently on modern hardware. Agricultural segmentation experiments utilize Ubuntu 20.04, Python 3.8, PyTorch 1.9.0, four NVIDIA RTX A6000 GPUs, and Xeon Silver CPUs. Inference on images achieves 1.44 FPS, 2,473 GFLOPs, requiring 7.4 GB GPU memory (Zhang et al., 18 Jan 2026). The edge-driven robotic picking model fits in any 8 GB GPU; forward pass latency is 30 ms for inputs, with user interaction dominating cycle time (4 s per object). Limitations include large parameter count and sensitivity to seed location near boundaries, mitigatable via multi-seed strategies.
7. Significance and Applications
DepthCropSeg++ enables high-precision object or crop delineation in scenarios where manual labeling is prohibitive. The foundation model's generalization across species and environments establishes new performance benchmarks for downstream agricultural tasks (phenotyping, density estimation, weed control). The edge-mask duality approach, leveraging synthetic data, provides robust, real-time interactive segmentation for robotic waste sorting and object picking applications. Both show that depth and dynamic edge-focused architectures can overcome significant data annotation and generalization challenges (Zhang et al., 18 Jan 2026, Grard et al., 2018).