D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

Published 26 Feb 2026 in cs.CV | (2602.23043v1)

Abstract: Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a transformer-based instance segmentation framework that significantly improves accuracy with minimal latency overhead.
It leverages a lightweight mask head and specialized training protocols integrating multi-scale features and auxiliary decoder supervision for robust performance.
The framework streamlines practical deployment through multi-backend support, enabling efficient inference on diverse platforms including edge devices.

D-FINE-seg: A Modular Transformer-Based Framework for Real-Time Instance Segmentation and Multi-Backend Deployment

Introduction and Motivation

D-FINE-seg extends the D-FINE object detection architecture to real-time instance segmentation by introducing a lightweight mask head, segmentation-specific training protocols, and streamlined deployment across major inference backends. The motivation originates from the scarcity of practical transformer-based segmentation models optimized for both runtime and cross-platform inference. D-FINE-seg confronts the challenge of delivering high segmentation accuracy with minimal latency inflation and addresses deployment friction by supporting ONNX, TensorRT, and OpenVINO from export to inference.

Architecture and Training Protocol

The detection backbone in D-FINE-seg inherits Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD) from D-FINE, both critical for improved localization and learning dynamics. The network follows a backbone (CNN) – HybridEncoder (multi-scale PAN) – Transformer Decoder structure. The mask head draws inspiration from Mask DINO (Li et al., 2022), utilizing decoder queries projected into per-instance mask embeddings. The mask computation is simplified: rather than ingesting high-resolution backbone features, the mask head exclusively consumes multi-scale PAN outputs and achieves the desired mask resolution via bilinear upsampling and stride-preserving convolutions.

Segmentation training augments box-cropped BCE and dice losses only within object ROIs, leveraging ground-truth masks resized to the mask output scale. Auxiliary supervision at intermediate decoder layers, as well as denoising mask supervision, improves accuracy without runtime cost. The Hungarian matcher is adapted to include mask dice overlap and sigmoid focal mask cost terms in the matching cost for more robust instance association.

Implementation and Practical Deployment

D-FINE-seg is developed as an open-source framework with end-to-end reproducibility and extensibility as primary objectives. The system features a unified configuration schema, grouped learning rate control, advanced augmentations, and systematic benchmarking across quantization levels and export formats. Mask evaluations are memory-efficient due to run-length encoding and batched computation. The software stack ensures compatibility with edge devices via OpenVINO INT8 quantization and multi-format export, addressing a key practical bottleneck for deployment in resource-constrained environments.

Experimental Validation and Numerical Results

The experimental protocol involves a head-to-head comparison with Ultralytics YOLO26(-seg) on the TACO dataset, with both models fine-tuned from COCO-pretrained weights. The evaluation specifically emphasizes F1-score at fixed thresholds, latency measured on TensorRT FP16 export, and additional reporting of mean Intersection-over-Union (IoU), Precision, and Recall. The models are benchmarked under controlled resource allocation (NVIDIA RTX 5070 Ti, Intel i5 12400f).

Key findings include:

D-FINE-seg achieves a mean relative F1-score improvement of approximately 65% over YOLO26-seg on instance segmentation, with an average latency overhead of ~10%.
In object detection, D-FINE exhibits about 70% higher F1-score than YOLO26, at roughly 1% additional latency.
COCO-style mask mAP is improved by ~41%, and box mAP by ~49% on average. Though YOLO26-seg-M marginally surpasses D-FINE-seg-M in mask mAP, D-FINE-seg dominates on all other model sizes (N, S, L, X).
On edge hardware (Intel N150, OpenVINO INT8), D-FINE-seg S delivers nearly double the F1-score compared to YOLO26-seg S, though with a latency tradeoff that is expected due to transformer-based architectures.

All evaluations explicitly exclude disk I/O and are inclusive of all preprocessing, model execution, and postprocessing.

Theoretical and Practical Implications

The design demonstrates that real-time instance segmentation via transformers can be achieved without compromising latency or export flexibility. By forgoing the high-resolution stride-4 backbone features in the mask head and instead leveraging well-fused encoder representations, D-FINE-seg delivers superior accuracy-latency trade-offs vis-à-vis YOLO26-based segmentation. The framework’s extensible and backend-agnostic structure streamlines deployment into production settings, especially on edge and heterogeneous computing devices where platform lock-in is a concern.

The mask supervision protocol—relying on ROI-cropped BCE/dice losses and auxiliary denoising mask supervision—improves mask quality without inference cost, and auxiliary decoder supervision further stabilizes optimization. The addition of mask-aware matching in the Hungarian assignment explicitly enhances the training signal for the segmentation head.

Limitations and Future Directions

While D-FINE-seg is validated on TACO with thorough methodology, its generalizability to other datasets and real-world scenarios requires further study. Pretraining the mask head on large-scale segmentation datasets (e.g., COCO) is expected to further benefit mask boundary precision and overall performance. Exploration of faster mask decoding or dynamic input scaling could further reduce latency, making transformer-based segmentation even more attractive for embedded workloads.

Conclusion

D-FINE-seg represents a systematic advance in transformer-based object detection and instance segmentation, demonstrating that strong accuracy-latency trade-offs and multi-backend support are achievable within a unified framework. Its modular design, reproducible pipelines, and superior empirical results position it as a credible alternative to iterative anchor-based systems, with substantial implications for both academic research and real-world deployment scenarios. Future enhancements in training protocols and broader validation will elucidate its full potential and applicability.