YOLOv10: Real-Time Object Detection

Updated 5 January 2026

YOLOv10 is an advanced one-stage object detection framework employing NMS-free inference and dual label assignment to achieve low latency and competitive accuracy.
It leverages architectural innovations such as a CSPNet-inspired backbone, rank-guided blocks, and pyramid spatial attention to optimize efficiency and performance on small-object detection.
Benchmark analyses show YOLOv10 achieves significant reductions in FLOPs, parameter count, and latency while maintaining or improving mAP compared to prior YOLO models and real-time DETR counterparts.

YOLOv10 is an advanced one-stage object detection framework developed to deliver end-to-end, real-time inference with improved speed, efficiency, and competitive accuracy across varied application domains. As the tenth major installment in the YOLO (“You Only Look Once”) family, YOLOv10 introduces several architectural and methodological innovations, with emphasis on NMS-free inference, dual label assignment, and holistic component-level optimizations. Benchmarks demonstrate significant reductions in FLOPs, parameter count, and latency, alongside improved or maintained average precision compared to previous YOLOs and real-time DETR models (Wang et al., 2024). This approach strategically targets deployment on resource-constrained devices while maintaining high performance in complex scenarios, notably small-object detection and multitask vision tasks.

1. Architectural Innovations and Module Design

YOLOv10 restructures the backbone, neck, and head to minimize computational overhead and facilitate direct, NMS-free output. The backbone employs CSPNet-inspired or rank-guided blocks, replacing heavier prior stages:

Backbone:
- CSPNet-based, with partial dense connections for efficient gradient flow and reduced redundancy (&&&1&&&, Choudhary et al., 2024).
- Rank-Guided Blocks split channels by an importance metric; spatial convolutions (3x3) combine with channel-projection (1x1) (Alif et al., 2024, Hussain, 2024).
- Spatial-Channel Decoupled Downsampling utilizes 1x1 conv to adjust channel width and depthwise 3x3 conv to reduce spatial resolution—recombining feature pathways to avoid information bottlenecks (Wang et al., 2024, Hussain, 2024).
Neck:
- PANet-based multi-scale feature aggregation, with lightweight bottleneck convolutions and selective large-kernel modules for expanded receptive field (Fergus et al., 2024, Hussain, 2024).
- Advanced attention: Pyramid Spatial Attention (PSA) block for multi-level spatial emphasis, particularly valuable in cluttered environments such as underwater scenes or medical images (Wuntu et al., 22 Sep 2025, Choudhary et al., 2024).
- Adaptive fusion strategies such as CSPStage blocks and enhanced connectivity for improved information flow (Tian et al., 2024).
Head:
- Decoupled dual-branch output: one-to-many for robust training, one-to-one for precise inference (Wang et al., 2024, Geetha et al., 2024, Tan et al., 2024).
- Lightweight classification head (depthwise separable convs) and standard regression blocks to balance localization accuracy and inference speed (Wang et al., 2024, Hussain, 2024).
Attention and Scaling:
- PSA and partial self-attention modules boost long-range dependency modeling with minimal parameter increase (Fergus et al., 2024, Ahmed et al., 2024).
- Large-kernel convolutions selectively placed in backbone/neck for greater receptive field (Fergus et al., 2024, Wang et al., 2024).

2. NMS-Free Dual Label Assignment and Loss Formulation

YOLOv10’s defining methodological advancement is its consistent dual label assignment, which enables NMS-free inference:

Dual Assignment Scheme: Training leverages both one-to-many (maximizing recall) and one-to-one (maximizing precision) matching metrics, typically using a task-aligned label assignment. During inference, only the one-to-one assignment is used, obviating post-processing NMS (Wang et al., 2024, Alif et al., 2024, Hussain, 2024, Geetha et al., 2024, Ahmed et al., 2024).
Matching Metric: For anchors $a_i$ and ground-truth $g_j$ ,

$m(\alpha, \beta) = s \cdot p^\alpha \cdot \mathrm{IoU}(a_i, g_j)^\beta$

with hyperparameters $(\alpha, \beta)$ , $s$ a spatial prior, and $p$ the class probability (Wang et al., 2024).

Loss Function: Standard formulation combines localization (CIoU or DFL), classification (BCE or CE), and objectness penalty,

$\mathcal{L} = \lambda_{\text{loc}}\,\mathcal{L}_{\mathrm{loc}} + \lambda_{\text{obj}}\,\mathcal{L}_{\mathrm{obj}} + \lambda_{\text{cls}}\,\mathcal{L}_{\mathrm{cls}}$

with typical weights $\lambda_{\text{loc}}=5.0, \lambda_{\text{obj}}=1.0, \lambda_{\text{cls}}=1.0$ (Choudhary et al., 2024, Alif et al., 2024). Distribution Focal Loss (DFL) is sometimes used to discretize box offsets (Choudhary et al., 2024).

CIoU Box Regression:

$L_{\mathrm{box}} = 1 - \mathrm{CIoU}(b, b^\star) = 1 - \left( \mathrm{IoU} - \frac{\rho^2(c, c^\star)}{c^2} - \alpha v \right)$

where $\rho^2$ measures center distance, $v$ aspect ratio, $\alpha$ a scale factor (Hung et al., 16 Sep 2025).

3. Quantitative Performance and Ablation Analysis

YOLOv10 achieves state-of-the-art efficiency and competitive accuracy in benchmarks against prior YOLOs and real-time DETRs:

COCO Test-Dev (640x640):

| Model | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | FPS (GPU) | Latency (ms) | |--------------|------------|-----------|---------|-------------|-----------|--------------| | YOLOv10-S | 7.2 | 21.6 | 46.8 | -- | ~400 | 2.49 | | YOLOv10-M | 15.4 | 59.1 | 51.3 | -- | ~211 | 4.74 | | YOLOv10-L | 24.4 | 120.3 | 53.4 | -- | ~137 | 7.28 | | YOLOv10-X | 29.5 | 160.4 | 54.4 | -- | ~93 | 10.70 | | YOLOv8-s | 11.2 | 28.6 | 44.9 | -- | ~128 | 1.20 | | YOLOv5s | 7.2 | 16.5 | 37.4 | -- | -- | 98 (CPU) | (Wang et al., 2024, Hussain, 2024)

Ablation Insights:
- Removing NMS and relying on dual label assignment yields ~20% speed improvement with negligible AP loss (Alif et al., 2024).
- Spatial-channel decoupling cuts FLOPs by 15%, confers +0.3% AP on small objects (Alif et al., 2024).
- Rank-guided blocks improve robustness to occlusion and further reduce parameters (Hussain, 2024, Alif et al., 2024).
- Splitting the head yields a small mAP gain with minor latency increase (Alif et al., 2024).
- Adding PSA, SPPF, and CSPStage modules boosts performance in small-object and cluttered scenarios, as shown in aquaculture, remote sensing, and medical applications (Tian et al., 2024, Wuntu et al., 22 Sep 2025, Choudhary et al., 2024).

4. Domain Applications and Transfer Learning

YOLOv10 is applied across marine ecology, medical imaging, retail automation, agricultural robotics, and safety monitoring:

Marine Biodiversity: YOLOv10-n achieves mAP@50=0.966 (DeepFish), 2.7M params, 8.4 GFLOPs, 29.3 FPS on CPU, suitable for real-time edge deployment (Wuntu et al., 22 Sep 2025).
Medical Imaging: mAP@50-95 of 51.9% (GRAZPEDWRI-DX) for pediatric wrist fracture detection, surpassing YOLOv9-E by 8.6 pp. YOLOv10n detects blood cells at mAP@50 ≈ 0.99, 72 FPS (T4 GPU), 6.0M params (Choudhary et al., 2024, Ahmed et al., 2024).
Agriculture: YOLOv10n achieves mAP@50=0.921, 5.5ms inference (fruitlet detection), with specialized modules such as C2fCIB and C2PSA (Sapkota et al., 2024).
Retail Automation: MidState-YOLO-ED (YOLOv10 with YOLOv8 head) performs [email protected]=0.890, 110 FPS, ~3.3M params — effective for real-time product checkout (Tan et al., 2024).
Kitchen Safety: YOLOv10 provides >300 FPS at 640px and model size ~18MB, though underperforms on fine-grained hazards compared to YOLOv5/YOLOv8 in the tested dataset (Geetha et al., 2024).

Transfer learning with layer freezing allows accuracy–efficiency trade-off: freezing the backbone preserves general-purpose features and saves up to 28% GPU memory with minimal mAP drop; aggressive freezing only suits simple or low-variation tasks (Dobrzycki et al., 5 Sep 2025).

5. Comparative Analysis versus Prior YOLOs and DETR Models

YOLOv10 delivers substantial improvements in model compactness and inference speed:

Model	Params (M)	FLOPs (G)	[email protected]	Inference (ms)	NMS-free
YOLOv10-S	7.2	21.6	46.8	2.49	Yes
YOLOv9-C	25.3	102	52.5	10.57	No
YOLOv8-X	68.0	258	53.9	16.86	No
RT-DETR-R101	76.0	259	54.3	13.71	Yes

(Wang et al., 2024)

YOLOv10 maintains or exceeds mAP with substantially reduced parameters and FLOPs relative to YOLOv9 and YOLOv8. NMS-free inference is a major contributor to reduced latency in real-world deployments (Wang et al., 2024, Hussain, 2024).

6. Strengths, Limitations, and Future Directions

Strengths:

End-to-end, real-time detection without NMS, improved latency and throughput.
Compelling accuracy–efficiency trade-off in low-power and small-object domains (embedded AUVs, agriculture, clinical devices) (Hung et al., 16 Sep 2025, Sapkota et al., 2024).
Architectural advances enable fast deployment with minimal hardware and energy requirements.

Limitations:

In some domain tests (e.g., fine-grained kitchen hazards, fine localization in agriculture) YOLOv10 lags behind specialized heads or higher-capacity variants (Geetha et al., 2024, Sapkota et al., 2024).
Highest mAP margins are still achieved by v9-c and YOLOv11 in some scenarios (Sapkota et al., 2024).
Data scarcity and segmentation ambiguity can reduce recall/precision, especially without task-adaptive heads (Geetha et al., 2024).

Directions for Research:

Multi-modal fusion and self-supervised pretraining (Alif et al., 2024).
Dynamic quantization and layer freezing for edge device deployment (Dobrzycki et al., 5 Sep 2025).
Explainable detection via visualization and feature attribution (Dobrzycki et al., 5 Sep 2025).
Task-aware adaptation including curriculum learning, domain-augmented label assignment, and head fusion modules for specialized tasks.

7. Concluding Perspective

YOLOv10 synthesizes advances in architectural streamlining, attention augmentation, and NMS-free inference to set new benchmarks in speed and resource-constrained deployment, particularly where real-time, small-object performance is critical. Its innovations—spatial–channel decoupling, partial self-attention, large-kernel modules, and consistent dual assignment—enable practitioners to achieve competitive or superior detection accuracy at greatly reduced latency and computational cost. The model’s flexibility for transfer learning, ablation-informed customization, and cross-domain applicability make it a foundational object detector for the current real-time vision landscape (Wang et al., 2024, Hussain, 2024, Dobrzycki et al., 5 Sep 2025).

Major Reference Works:

Wang et al., “YOLOv10: Real-Time End-to-End Object Detection” (Wang et al., 2024)
Hasan et al., “YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain” (Alif et al., 2024)
Xu et al., “YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision” (Hussain, 2024)
Lodhi et al., “Pediatric Wrist Fracture Detection in X-rays via YOLOv10 Algorithm and Dual Label Assignment System” (Ahmed et al., 2024)
Fergus et al., “Towards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data” (Fergus et al., 2024)