Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOLOv10: Real-Time Object Detection

Updated 5 January 2026
  • YOLOv10 is an advanced one-stage object detection framework employing NMS-free inference and dual label assignment to achieve low latency and competitive accuracy.
  • It leverages architectural innovations such as a CSPNet-inspired backbone, rank-guided blocks, and pyramid spatial attention to optimize efficiency and performance on small-object detection.
  • Benchmark analyses show YOLOv10 achieves significant reductions in FLOPs, parameter count, and latency while maintaining or improving mAP compared to prior YOLO models and real-time DETR counterparts.

YOLOv10 is an advanced one-stage object detection framework developed to deliver end-to-end, real-time inference with improved speed, efficiency, and competitive accuracy across varied application domains. As the tenth major installment in the YOLO (“You Only Look Once”) family, YOLOv10 introduces several architectural and methodological innovations, with emphasis on NMS-free inference, dual label assignment, and holistic component-level optimizations. Benchmarks demonstrate significant reductions in FLOPs, parameter count, and latency, alongside improved or maintained average precision compared to previous YOLOs and real-time DETR models (Wang et al., 2024). This approach strategically targets deployment on resource-constrained devices while maintaining high performance in complex scenarios, notably small-object detection and multitask vision tasks.

1. Architectural Innovations and Module Design

YOLOv10 restructures the backbone, neck, and head to minimize computational overhead and facilitate direct, NMS-free output. The backbone employs CSPNet-inspired or rank-guided blocks, replacing heavier prior stages:

  • Backbone:
    • CSPNet-based, with partial dense connections for efficient gradient flow and reduced redundancy (&&&1&&&, Choudhary et al., 2024).
    • Rank-Guided Blocks split channels by an importance metric; spatial convolutions (3x3) combine with channel-projection (1x1) (Alif et al., 2024, Hussain, 2024).
    • Spatial-Channel Decoupled Downsampling utilizes 1x1 conv to adjust channel width and depthwise 3x3 conv to reduce spatial resolution—recombining feature pathways to avoid information bottlenecks (Wang et al., 2024, Hussain, 2024).
  • Neck:
    • PANet-based multi-scale feature aggregation, with lightweight bottleneck convolutions and selective large-kernel modules for expanded receptive field (Fergus et al., 2024, Hussain, 2024).
    • Advanced attention: Pyramid Spatial Attention (PSA) block for multi-level spatial emphasis, particularly valuable in cluttered environments such as underwater scenes or medical images (Wuntu et al., 22 Sep 2025, Choudhary et al., 2024).
    • Adaptive fusion strategies such as CSPStage blocks and enhanced connectivity for improved information flow (Tian et al., 2024).
  • Head:
  • Attention and Scaling:

2. NMS-Free Dual Label Assignment and Loss Formulation

YOLOv10’s defining methodological advancement is its consistent dual label assignment, which enables NMS-free inference:

  • Dual Assignment Scheme: Training leverages both one-to-many (maximizing recall) and one-to-one (maximizing precision) matching metrics, typically using a task-aligned label assignment. During inference, only the one-to-one assignment is used, obviating post-processing NMS (Wang et al., 2024, Alif et al., 2024, Hussain, 2024, Geetha et al., 2024, Ahmed et al., 2024).
  • Matching Metric: For anchors aia_i and ground-truth gjg_j,

m(α,β)=spαIoU(ai,gj)βm(\alpha, \beta) = s \cdot p^\alpha \cdot \mathrm{IoU}(a_i, g_j)^\beta

with hyperparameters (α,β)(\alpha, \beta), ss a spatial prior, and pp the class probability (Wang et al., 2024).

  • Loss Function: Standard formulation combines localization (CIoU or DFL), classification (BCE or CE), and objectness penalty,

L=λlocLloc+λobjLobj+λclsLcls\mathcal{L} = \lambda_{\text{loc}}\,\mathcal{L}_{\mathrm{loc}} + \lambda_{\text{obj}}\,\mathcal{L}_{\mathrm{obj}} + \lambda_{\text{cls}}\,\mathcal{L}_{\mathrm{cls}}

with typical weights λloc=5.0,λobj=1.0,λcls=1.0\lambda_{\text{loc}}=5.0, \lambda_{\text{obj}}=1.0, \lambda_{\text{cls}}=1.0 (Choudhary et al., 2024, Alif et al., 2024). Distribution Focal Loss (DFL) is sometimes used to discretize box offsets (Choudhary et al., 2024).

  • CIoU Box Regression:

Lbox=1CIoU(b,b)=1(IoUρ2(c,c)c2αv)L_{\mathrm{box}} = 1 - \mathrm{CIoU}(b, b^\star) = 1 - \left( \mathrm{IoU} - \frac{\rho^2(c, c^\star)}{c^2} - \alpha v \right)

where ρ2\rho^2 measures center distance, vv aspect ratio, α\alpha a scale factor (Hung et al., 16 Sep 2025).

3. Quantitative Performance and Ablation Analysis

YOLOv10 achieves state-of-the-art efficiency and competitive accuracy in benchmarks against prior YOLOs and real-time DETRs:

  • COCO Test-Dev (640x640):

| Model | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | FPS (GPU) | Latency (ms) | |--------------|------------|-----------|---------|-------------|-----------|--------------| | YOLOv10-S | 7.2 | 21.6 | 46.8 | -- | ~400 | 2.49 | | YOLOv10-M | 15.4 | 59.1 | 51.3 | -- | ~211 | 4.74 | | YOLOv10-L | 24.4 | 120.3 | 53.4 | -- | ~137 | 7.28 | | YOLOv10-X | 29.5 | 160.4 | 54.4 | -- | ~93 | 10.70 | | YOLOv8-s | 11.2 | 28.6 | 44.9 | -- | ~128 | 1.20 | | YOLOv5s | 7.2 | 16.5 | 37.4 | -- | -- | 98 (CPU) | (Wang et al., 2024, Hussain, 2024)

4. Domain Applications and Transfer Learning

YOLOv10 is applied across marine ecology, medical imaging, retail automation, agricultural robotics, and safety monitoring:

  • Marine Biodiversity: YOLOv10-n achieves mAP@50=0.966 (DeepFish), 2.7M params, 8.4 GFLOPs, 29.3 FPS on CPU, suitable for real-time edge deployment (Wuntu et al., 22 Sep 2025).
  • Medical Imaging: mAP@50-95 of 51.9% (GRAZPEDWRI-DX) for pediatric wrist fracture detection, surpassing YOLOv9-E by 8.6 pp. YOLOv10n detects blood cells at mAP@50 ≈ 0.99, 72 FPS (T4 GPU), 6.0M params (Choudhary et al., 2024, Ahmed et al., 2024).
  • Agriculture: YOLOv10n achieves mAP@50=0.921, 5.5ms inference (fruitlet detection), with specialized modules such as C2fCIB and C2PSA (Sapkota et al., 2024).
  • Retail Automation: MidState-YOLO-ED (YOLOv10 with YOLOv8 head) performs [email protected]=0.890, 110 FPS, ~3.3M params — effective for real-time product checkout (Tan et al., 2024).
  • Kitchen Safety: YOLOv10 provides >300 FPS at 640px and model size ~18MB, though underperforms on fine-grained hazards compared to YOLOv5/YOLOv8 in the tested dataset (Geetha et al., 2024).

Transfer learning with layer freezing allows accuracy–efficiency trade-off: freezing the backbone preserves general-purpose features and saves up to 28% GPU memory with minimal mAP drop; aggressive freezing only suits simple or low-variation tasks (Dobrzycki et al., 5 Sep 2025).

5. Comparative Analysis versus Prior YOLOs and DETR Models

YOLOv10 delivers substantial improvements in model compactness and inference speed:

Model Params (M) FLOPs (G) [email protected] Inference (ms) NMS-free
YOLOv10-S 7.2 21.6 46.8 2.49 Yes
YOLOv9-C 25.3 102 52.5 10.57 No
YOLOv8-X 68.0 258 53.9 16.86 No
RT-DETR-R101 76.0 259 54.3 13.71 Yes

(Wang et al., 2024)

YOLOv10 maintains or exceeds mAP with substantially reduced parameters and FLOPs relative to YOLOv9 and YOLOv8. NMS-free inference is a major contributor to reduced latency in real-world deployments (Wang et al., 2024, Hussain, 2024).

6. Strengths, Limitations, and Future Directions

Strengths:

  • End-to-end, real-time detection without NMS, improved latency and throughput.
  • Compelling accuracy–efficiency trade-off in low-power and small-object domains (embedded AUVs, agriculture, clinical devices) (Hung et al., 16 Sep 2025, Sapkota et al., 2024).
  • Architectural advances enable fast deployment with minimal hardware and energy requirements.

Limitations:

  • In some domain tests (e.g., fine-grained kitchen hazards, fine localization in agriculture) YOLOv10 lags behind specialized heads or higher-capacity variants (Geetha et al., 2024, Sapkota et al., 2024).
  • Highest mAP margins are still achieved by v9-c and YOLOv11 in some scenarios (Sapkota et al., 2024).
  • Data scarcity and segmentation ambiguity can reduce recall/precision, especially without task-adaptive heads (Geetha et al., 2024).

Directions for Research:

7. Concluding Perspective

YOLOv10 synthesizes advances in architectural streamlining, attention augmentation, and NMS-free inference to set new benchmarks in speed and resource-constrained deployment, particularly where real-time, small-object performance is critical. Its innovations—spatial–channel decoupling, partial self-attention, large-kernel modules, and consistent dual assignment—enable practitioners to achieve competitive or superior detection accuracy at greatly reduced latency and computational cost. The model’s flexibility for transfer learning, ablation-informed customization, and cross-domain applicability make it a foundational object detector for the current real-time vision landscape (Wang et al., 2024, Hussain, 2024, Dobrzycki et al., 5 Sep 2025).

Major Reference Works:

  • Wang et al., “YOLOv10: Real-Time End-to-End Object Detection” (Wang et al., 2024)
  • Hasan et al., “YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain” (Alif et al., 2024)
  • Xu et al., “YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision” (Hussain, 2024)
  • Lodhi et al., “Pediatric Wrist Fracture Detection in X-rays via YOLOv10 Algorithm and Dual Label Assignment System” (Ahmed et al., 2024)
  • Fergus et al., “Towards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data” (Fergus et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLOv10.