Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOLOv10: Advanced Object Detection

Updated 14 February 2026
  • YOLOv10 is a one-stage detection framework that eliminates NMS with a dual-assignment strategy for faster and more accurate predictions.
  • It employs a CSP-style backbone with spatial–channel decoupled downsampling and large-kernel attention modules to enhance representation and small-object detection.
  • Optimized loss functions, robust data augmentation, and hardware-friendly quantization enable efficient deployment across embedded and high-throughput environments.

YOLOv10 is a one-stage convolutional object detection framework that advances the YOLO series in both architecture and optimization, specifically targeting real-time, NMS-free inference with improved accuracy–efficiency trade-offs. The model adopts a CSP-style backbone with spatial–channel decoupled downsampling, large-kernel convolution, and partial self-attention mechanisms, and fundamentally restructures post-training prediction by eliminating non-maximum suppression (NMS) in favor of a consistent dual-assignment training paradigm. This approach, together with a suite of design optimizations, positions YOLOv10 as state-of-the-art across edge, embedded, and high-throughput detection scenarios, excelling particularly in small-object and crowded-scene detection, and applicable across domains such as visual arts, precision agriculture, and marine biology (Wang et al., 2024, Hussain, 2024, Alif et al., 2024, Bruegger et al., 21 Jan 2025, Tariq et al., 14 Apr 2025, Wuntu et al., 22 Sep 2025).

1. Architectural Innovations and Design Principles

YOLOv10 introduces several targeted modifications over its predecessors (v8/v9), focusing on computational efficiency, scalable deployment, and end-to-end differentiability (Wang et al., 2024, Hussain, 2024, Alif et al., 2024, Wuntu et al., 22 Sep 2025).

1.1 NMS-Free Dual-Assignment Strategy

Training employs both one-to-many (O2M) and one-to-one (O2O) assignment heads. Each ground-truth box is matched densely to multiple anchors (O2M) and, in parallel, uniquely to its best anchor (O2O); both heads share the backbone/neck and are optimized jointly, but only the O2O head is used during inference, obviating the need for NMS and enabling true end-to-end detection (Wang et al., 2024, Hussain, 2024, Alif et al., 2024).

1.2 Spatial–Channel Decoupled Downsampling

Downsampling is split into spatial and channel operations. The spatial branch applies depthwise strided convolution, while the channel branch applies a 1×1 conv followed by pooling. Their outputs are fused, resulting in a ~20% reduction in FLOPs and improved representation of spatial detail (Wang et al., 2024, Hussain, 2024, Alif et al., 2024).

1.3 Large-Kernel and Attention Modules

Large-kernel convolution (e.g., 7×7–21×21) replaces or augments standard convs in selected stages to expand effective receptive field at low additional cost, boosting performance for small, sparse, and overlapping objects (Hussain, 2024, Wang et al., 2024). Partial Self-Attention (PSA) modules, applied after the final backbone stage, further enhance context aggregation (+0.5–1 AP) with minimal latency increase (Wang et al., 2024, Hussain, 2024, Wuntu et al., 22 Sep 2025).

1.4 Backbone, Neck, and Head

The backbone is a pruned, bottlenecked CSPNet variant; the neck is a feature-pyramid variant (typically FPN/PANet with PSA or CR-Attention), and the detection head is fully decoupled: classification and regression/objectness branches are separate, each using lightweight stacks of 1×1 and large-kernel convs (Wuntu et al., 22 Sep 2025, Wang et al., 2024).

2. Label Assignment, Training, and Optimization

2.1. Consistent Dual Assignments

A unified matching metric

m(α,β)=spαIoU(b^,b)βm(\alpha,\beta) = s \cdot p^{\alpha}\mathrm{IoU}(\hat b, b)^{\beta}

is adopted for both O2M and O2O assignments, with identical exponents to align selection across heads (minimizing supervision gap) (Wang et al., 2024). Targets are distributed accordingly, with O2O targets strictly one-hot and O2M yielding denser gradients.

2.2. Anchor-Free/Hybrid Strategy

YOLOv10 typically initializes anchor priors via offline K-means, but since O2O assignment is anchor-agnostic, inference is entirely anchor-free, with box deltas predicted via sigmoid (center offsets) and exponential (width/height) transforms (Alif et al., 2024, Wang et al., 2024, Wuntu et al., 22 Sep 2025).

2.3. Loss Functions

The overall loss function is a composite of three principal terms: L=λclsLcls+λboxLbox+λdflLDFLL = \lambda_{\text{cls}}\mathcal{L}_{\text{cls}} + \lambda_{\text{box}}\mathcal{L}_{\text{box}} + \lambda_{\text{dfl}}\mathcal{L}_{\text{DFL}} where

2.4. Data Augmentation and Optimization

Training employs a “bag of freebies” common to modern detectors: 4-image mosaic, mixup, HSV/color jitter, random scaling/translation, and Copy-Paste augmentation. SGD with cosine annealing or AdamW, large batch sizes (16–64), 100–500 epochs, and transfer learning (COCO, DOTA, or domain-specific datasets) are standard (Hussain, 2024, Wang et al., 2024, Wuntu et al., 22 Sep 2025, Bruegger et al., 21 Jan 2025).

3. Quantitative Performance and Comparative Results

Model complexity, accuracy, and throughput comparisons against state-of-the-art detectors are summarized below.

Model Params (M) FLOPs (G) mAP@[.5:.95] Latency (ms) FPS COCO / Domain
YOLOv10-N 2.3 6.7 38.5 1.79 559 COCO val (Hussain, 2024)
YOLOv10-S 7.2 21.6 46.3 2.39 418 COCO val (Hussain, 2024)
YOLOv10n 2.7 8.4 0.966 @.50 0.0324†† 29†† DeepFish (Wuntu et al., 22 Sep 2025)
YOLOv8-S 11.2 28.6 44.9 1.20 833 COCO val (Hussain, 2024)
YOLOv9-S 7.1 26.4 46.7 COCO val (Wang et al., 2024)

† A100, TensorRT unless otherwise noted. †† Intel i7-12700 (CPU), DeepFish data.

Key findings:

4. Domain Applications and Specialized Pipelines

YOLOv10’s architectural efficiency and strong small-object recall enable adoption across specialized detection domains:

  • Marine Biology: On DeepFish and OpenImages V7-Fish, YOLOv10-nano achieves mAP@50 = 0.966, mAP@[.5:.95] = 0.606, operating at 2.7 M params and 8.4 GFLOPs, with 29 FPS CPU inference (Wuntu et al., 22 Sep 2025).
  • Cultural Heritage (Fine-grained Art): Large-image punch detection pipeline, using sliding-window tiling, adjusts anchors and post-processing (IoM-NMS) for near-95% precision and 90% F1 at ultra-high resolution (Bruegger et al., 21 Jan 2025).
  • Agriculture: Precision crop and livestock monitoring benefit from small-object and crowded-scene capabilities, with model compression making field deployment feasible (Alif et al., 2024).

5. Hardware Deployment, Quantization, and Efficiency

YOLOv10 is optimized for both edge and high-throughput environments (Tariq et al., 14 Apr 2025, Hussain, 2024):

  • CPU Inference: OpenVINO yields highest throughput; YOLOv10-nano achieves 60–75 FPS at 640×640 on contemporary x86 CPUs (Tariq et al., 14 Apr 2025).
  • GPU Inference: TensorRT execution achieves >100 FPS; ONNX-TensorRT (FP16) is a compromise for ease and throughput.
  • Quantization: The nano variant is quantizable to 8-bit integer with <1 mAP degradation; channel pruning further reduces the memory footprint (Hussain, 2024).
  • Flexible Post-Processing: For extremely large images, custom Intersection-over-Minimum NMS may be retained outside YOLOv10 in the post-pipeline if required by specific application formats (Bruegger et al., 21 Jan 2025).

6. Comparison with Prior and Successor YOLO Variants

  • YOLOv8: Introduced anchor-free decoupled head, but suffers higher computational load, and still relies on NMS post-processing (Hussain, 2024, Wang et al., 2024).
  • YOLOv9: Added dynamic prediction assignment and GELAN backbone, but preserves NMS and higher memory use in some settings (Wang et al., 2024, Alif et al., 2024).
  • YOLOv10: Removes NMS, achieves better parameter and FLOP efficiency, lowers inference latency, and improves AP metrics by 0.5–2 points across scales (Alif et al., 2024, Hussain, 2024).

YOLOv10’s consistent dual assignment and architectural pruning (rank-guided block allocation) constitute its primary advances, making it applicable for both low-power edge and demanding high-throughput tasks.

7. Limitations and Observed Trade-Offs

  • Marginal Small-Object Recall Loss: Rank-guided block pruning trades ~0.5% small-object accuracy for significant FLOP reduction (Alif et al., 2024).
  • Training Complexity: Dual-head optimization increases training cost and code complexity.
  • Very Large Image/Tile Handling: Sliding-window and custom NMS may be necessary in kilopixel-scale images, as model is optimized for standard (≤1024 px) input (Bruegger et al., 21 Jan 2025).
  • Anchor Sensitivity: Although largely anchor-free at inference, initialization and training stability may vary with anchor settings in non-standard domains.

YOLOv10 thus represents a cohesive redesign of the YOLO paradigm, achieving simultaneous gains in efficiency, accuracy, and deployability, with broad utility from embedded edge devices to demanding research applications (Wang et al., 2024, Hussain, 2024, Alif et al., 2024, Tariq et al., 14 Apr 2025, Wuntu et al., 22 Sep 2025, Bruegger et al., 21 Jan 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLOv10 Object Detection Model.