DepthCropSeg++: Depth Image Segmentation

Updated 25 January 2026

The paper presents a dual-method approach that merges a foundation crop segmentation model and a seed-driven instance segmentation method, achieving 93.11% mIoU on agricultural data.
It leverages depth information with specialized FADE upsampling and contour-oriented loss functions to effectively reduce manual annotation effort.
The methodology is applicable in precision agriculture and robotic picking, outperforming established baselines in both real and synthetic environments.

DepthCropSeg++ denotes two distinct but methodologically related research approaches addressing segmentation in depth images: (1) a cross-species, cross-environment crop segmentation foundation model for in-field agricultural tasks (Zhang et al., 18 Jan 2026), and (2) a seed-driven, edge-guided instance segmentation method primarily for robotic picking in industrial scenes, trained exclusively on synthetic depth data (Grard et al., 2018). Both paradigms leverage depth information for object delineation and employ specialized architectural and training strategies to circumvent manual pixel-level annotation bottlenecks.

1. Model Architectures

Crop Segmentation Foundation Model

DepthCropSeg++ in agricultural contexts utilizes the BEiT-Adapter-Large backbone, which entails a ViT encoder (patch size 16, 12 transformer blocks) augmented by locally-biased adapters. Each adapter processes multi-scale ViT features $\{F_l\}_{l=1}^4$ via a $3 \times 3$ convolution and local window attention to produce refined features $A_l$ . Decoder upsampling departs from standard bilinear approaches by introducing FADE (Feature Adaptive Dynamic Upsampling), generating position-specific kernels at every spatial position $p$ through weighted fusion of shallow encoder ( $E_l$ ) and global transformer decoder ( $D_l$ ) assets:

$\begin{aligned} \alpha(p) &= \sigma(W_g \cdot [E_l(p); D_l(p)] + b_g) \ k(p) &= \alpha(p) \odot k^e(p) + (1-\alpha(p)) \odot k^d(p) \ F_u(p) &= \sum_{(u,v) \in \Omega} k(p; u,v) \cdot F_{l+1}(\lfloor p/2 \rfloor + [u,v]) \end{aligned}$

where $\Omega$ is a $3 \times 3$ window. Edge-awareness is enhanced via dynamic upsampling: higher gating values $\alpha$ are allocated near object boundaries, preserving fine detail while maintaining semantic coherence within interiors (Zhang et al., 18 Jan 2026).

Seed-Based Instance Segmentation

In the robotic bulk extraction setting, DepthCropSeg++ is implemented as a VGG-16–based fully convolutional encoder–decoder with skip connections (Grard et al., 2018). The input stacks the normalized depth map and a user-provided binary seed location, enabling per-object segmentation. The network jointly predicts (1) inter-instance boundary logits $\widehat{\mathcal{E}}$ and (2) the seed-specific mask logits $\widehat{M}_s$ . The mask $M_s$ is extracted post-thresholding ( $\sigma(\widehat{M}_s) > 0.8$ ) by connected component analysis anchored at the user click. Skip connections ensure preservation of high-frequency boundary information, and the receptive field at the deepest layer spans $224 \times 224$ px.

2. Leveraging Depth Information and Edge-Mask Duality

Both approaches exploit the duality between edge localization and mask connectivity in depth imagery. In the seed-driven setting, boundaries $\mathcal{E}$ and masks $\{M_k\}$ are related by:

$M_s = \mathrm{ConnectedComponent}(\overline{\mathcal{E}}, s)$

Explicit supervision of boundaries guides the network toward sharper instance separation, mitigating label leakage between adjacent objects or crop rows.

Agricultural DepthCropSeg++ incorporates depth-informed pseudo-label generation. Monocular depth maps (from Depth Anything V2) undergo gradient-guided histogram thresholding to yield binary masks, which after filtering, constitute high-quality pseudo-label sources supplementing sparse annotation.

3. Training Strategies and Dataset Scale

Crop Segmentation

A two-stage self-training curriculum is employed. Stage 1 (coarse supervision) trains on a mix of manually annotated and depth-pseudo-labeled images using pixel-wise cross-entropy:

$\mathcal{L}_1 = -\frac{1}{|\Omega|} \sum_{i \in \Omega} [y^i \log p^i + (1 - y^i) \log (1 - p^i)]$

Stage 2 (refined supervision) utilizes model predictions to construct a trimap mask $m^i$ , focusing optimization on confident pixels via masked cross-entropy:

$\mathcal{L}_2 = -\frac{1}{\sum_i m^i} \sum_{i \in \Omega} m^i [y^i \log p^i + (1 - y^i) \log (1 - p^i)]$

Training on 28,601 images across over 30 crop species and 15 environmental conditions encompasses both pixel-annotated samples, pseudo-labeled depth images, and full-coverage foreground masks (Zhang et al., 18 Jan 2026).

Synthetic Data Pipeline

In the waste sorting scenario, synthetic depth maps are rendered via Blender with precise bin and stereo-projection modeling. Physics-driven object scattering and domain randomization ensure variability. No post-hoc noise augmentation is performed; stereo-matching artifacts emerge naturally. Annotation is automatic given full pose control, enabling perfect boundary and mask ground-truth derivation for the 400 multi-object, 200 validation, 400 test, and 1,500 mono-object images.

4. Loss Functions and Optimization Protocols

The edge-driven regime introduces a contour-oriented loss:

$\mathcal{L}(\theta) = \frac{1}{N} \sum_{n=1}^N [\ell(\lambda_e, \mathcal{E}^{(n)}, \widehat{\mathcal{E}}^{(n)}) + \ell(\lambda_m, M_s^{(n)}, \widehat{M}_s^{(n)})]$

$\ell(\lambda, Y, \widehat{Y}) = -\sum_{p \in \Omega} [(1 - Y_p)\log(1 - \sigma(\widehat{Y}_p)) + \lambda Y_p \log(\sigma(\widehat{Y}_p))]$

Boundary pixels are upweighted ( $\lambda_e=10$ , $\lambda_m=1$ ). In the crop model, AdamW optimizer is utilized (lr= $2 \times 10^{-5}$ , decay=0.05, layerwise decay=0.9) with 30 epochs per training stage and batch size 8 per GPU (Zhang et al., 18 Jan 2026). Synthetic FCN models use SGD (lr= $1 \times 10^{-4}$ , momentum=0.9, batch size 1, 120 epochs).

5. Performance Metrics and Evaluation

The primary evaluation metric is mean intersection-over-union (mIoU):

$mIoU = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FP_c + FN_c}$

DepthCropSeg++ in agricultural vision achieves $93.11\%$ mIoU on a 6,760-image test set, outperforming both supervised baselines ( $92.75\%$ ), Segmentation Anything Model ( $32.80\%$ ), HQ-SAM ( $44.54\%$ ), and GWFSS ( $69.50\%$ ). Robustness is demonstrated on night-time (86.90%), unseen soybean (90.09%), and full-coverage canopy (99.86%) scenarios (Zhang et al., 18 Jan 2026).

For synthetic depth-based segmentation, contour detection F-score and boundary precision vastly exceed previous patch-based and DeepMask baselines. On multi-object synthetic test sets, average IoU reaches 0.89 and boundary precision 0.90; domain transfer introduces an F-score gap (synthetic-to-real) of $~16$ – $18\%$ , attributed partly to annotation noise.

Method	Avg IoU (Multi-obj)	Boundary P (Multi-obj)
DeepMask	0.83	0.51
FCN-DOL (ours)	0.89	0.90

6. Practical Considerations and Deployment

DepthCropSeg++ models run efficiently on modern hardware. Agricultural segmentation experiments utilize Ubuntu 20.04, Python 3.8, PyTorch 1.9.0, four NVIDIA RTX A6000 GPUs, and Xeon Silver CPUs. Inference on $1024 \times 1024$ images achieves $\sim$ 1.44 FPS, 2,473 GFLOPs, requiring $\sim$ 7.4 GB GPU memory (Zhang et al., 18 Jan 2026). The edge-driven robotic picking model fits in any $\geq$ 8 GB GPU; forward pass latency is $\approx$ 30 ms for $512 \times 512$ inputs, with user interaction dominating cycle time ( $\approx$ 4 s per object). Limitations include large parameter count and sensitivity to seed location near boundaries, mitigatable via multi-seed strategies.

7. Significance and Applications

DepthCropSeg++ enables high-precision object or crop delineation in scenarios where manual labeling is prohibitive. The foundation model's generalization across species and environments establishes new performance benchmarks for downstream agricultural tasks (phenotyping, density estimation, weed control). The edge-mask duality approach, leveraging synthetic data, provides robust, real-time interactive segmentation for robotic waste sorting and object picking applications. Both show that depth and dynamic edge-focused architectures can overcome significant data annotation and generalization challenges (Zhang et al., 18 Jan 2026, Grard et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data (2026)

Object segmentation in depth maps with one user click and a synthetically trained fully convolutional network (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DepthCropSeg++.