Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOLO-11 Human Detector

Updated 27 January 2026
  • The paper presents YOLO-11 enhancements, incorporating C3k2 and C2PSA modules to significantly boost detection accuracy (up to 55% mAP) for human targets.
  • Efficient multi-scale feature fusion via SPPF and a decoupled head design enables robust inference and high FPS across model variants.
  • Optimized training with advanced augmentation techniques and tailored anchor assignment ensures reliable performance under occlusion and scale variations.

A YOLO-11-based 1^ detector is a specialized instantiation of the YOLOv11 object detection architecture trained exclusively for identifying humans in images or video streams. Leveraging the architectural advancements of YOLOv11—including the C3k2 and C2PSA modules as well as an optimized loss and augmentation regimen—it delivers state-of-the-art real-time detection accuracy and efficiency for human-centric and surveillance applications across diverse computational platforms (Khanam et al., 2024, Jegham et al., 2024, Jiang et al., 20 Feb 2025).

1. Architectural Principles of YOLOv11 for Human Detection

YOLOv11 maintains the canonical YOLO three-stage detection pipeline: Backbone, Neck, and Head. Key innovations for enhanced accuracy and efficiency are as follows:

  • Backbone: Utilizes a Conv-BN-SiLU (CBS) stem and stacked C3k2 (Cross Stage Partial with kernel size 2) blocks, replacing earlier bottlenecks (C2f/CSP) to accelerate inference and improve gradient propagation (Khanam et al., 2024). The backbone extracts hierarchical features at strides 8, 16, and 32, preparing multi-scale feature maps.
  • SPPF Layer: The Spatial Pyramid Pooling – Fast (SPPF) module at the deepest layer aggregates contextual information at kernel sizes 5, 9, and 13 using max pooling, concatenated by channel.
  • C2PSA Module: Introduced immediately after SPPF and optionally at intermediate scales in the Neck, the C2PSA (Cross Stage Partial with Parallel Spatial Attention) block re-weights spatial features. It employs a CSP split followed by parallel 1×1 convolutions and softmax-based spatial attention (Khanam et al., 2024, Jiang et al., 20 Feb 2025).
  • Neck: Fuses multi-scale features via upsample/concat operations, with C3k2 for efficient channel mixing and C2PSA to enhance small/occluded human detection (Jegham et al., 2024).
  • Head: Provides three detection branches (small, medium, large) with C3k2/CBS processing, outputting per-spatial-cell bounding box coordinates (x,y,w,h)(x, y, w, h), objectness score, and class logits.

This data flow can be summarized as: Input → CBS Stem → [C3k2]×n → SPPF → C2PSA → Neck(Upsample + Concat + C3k2) → Head(C3k2 + CBS) → Detect (Khanam et al., 2024).

2. Key Modules and Mathematical Formulation

C3k2 Block

  • Function: Efficiently splits input channels, processes half via two 3×3 convolutions, merges with the untouched half by concatenation, and then applies a 1×1 convolution.
  • Pseudocode:

1
2
3
4
5
# X: input tensor
X1, X2 = split(X, axis=channel)
Y = Conv3x3(SiLU(BN(Conv3x3(SiLU(BN(X1))))))
Z = concat(Y, X2, axis=channel)
output = Conv1x1(SiLU(BN(Z)))

  • Mathematics:

output=Conv1x1(SiLU(BN(Concat(Conv3x3(SiLU(BN(Conv3x3(SiLU(BN(X1)))))),X2))))\text{output} = \text{Conv1x1}(\text{SiLU}(\text{BN}(\text{Concat}(\text{Conv3x3}(\text{SiLU}(\text{BN}(\text{Conv3x3}(\text{SiLU}(\text{BN}(X_1)))))), X_2))))

(Khanam et al., 2024, Jegham et al., 2024).

SPPF Layer

  • Function: Enhances receptive field without computational bottleneck.
  • Mathematics:

SPPF(X)=Concat(X,P5,P9,P13),Pk=MaxPoolk×k(X)SPPF(X) = \mathrm{Concat}\left(X, P_5, P_9, P_{13}\right), \quad P_k = \mathrm{MaxPool}_{k\times k}(X)

(Khanam et al., 2024).

C2PSA Block

  • Purpose: Applies parallel spatial attention after CSP splitting, improving recall and precision for occluded or small humans.
  • Formulation:

A=softmax(KQT),output=X+α(Wv(AWqX))A = \mathrm{softmax}(KQ^T),\quad \text{output} = X + \alpha \left(W_v \cdot (A \odot W_q X)\right)

where Q,K,VQ, K, V are 1×1 convolutional projections, \odot denotes element-wise multiplication, and α\alpha is a learnable scalar (Khanam et al., 2024, Jiang et al., 20 Feb 2025).

3. Model Scaling, Variants, and Inference Trade-offs

YOLOv11 is released in five standard sizes to allow trade-offs between compute and accuracy:

Variant Parameters FPS (A100, 640×640) Typical mAP⟨50⟩ (COCO/person)
nano 1.6M 110 45%
small 7M 80 45%
medium 25M 45 50%
large 50M 25 52–53%
x-large 120M 12 54–55%

For dedicated, real-time human detection, the medium variant (YOLOv11m) is the standard compromise: approximately 50% COCO mAP⟨50⟩ at ~45 FPS, precision ≈0.89, recall ≈0.83, and mAP50–95 ≈0.795 for medium/large persons (Khanam et al., 2024, Jegham et al., 2024).

4. Loss Functions, Anchor Assignment, and Augmentation

  • Localization: Complete-IoU loss (CIoU):

Lloc=1CIoU(b,g)L_{loc} = 1 - \mathrm{CIoU}(b, g)

with

CIoU=IoU(b,g)ρ2(b,g)c2+αv\mathrm{CIoU} = \mathrm{IoU}(b, g) - \frac{\rho^2(b, g)}{c^2} + \alpha v

where ρ\rho is center distance and vv is aspect-ratio penalty (Khanam et al., 2024, Jiang et al., 20 Feb 2025).

  • Objectness and Class: Binary cross-entropy (BCE) for object probability and class label.
  • Total Loss:

L=λlocLloc+λobjLobj+λclsLclsL = \lambda_{loc} L_{loc} + \lambda_{obj} L_{obj} + \lambda_{cls} L_{cls}

with typical weights λloc=5.0\lambda_{loc}=5.0, λobj=1.0\lambda_{obj}=1.0, λcls=0.5\lambda_{cls}=0.5 (Khanam et al., 2024).

Anchor Assignment:

  • Default: Three anchors per scale, e.g., for 640×640: small ([10×14, 23×27, 37×58]), medium ([81×82, 135×169, 344×319]), large ([350×168, 322×319, 606×707]).
  • Optionally, human-specific anchors can be computed via k-means on (w,h) of person boxes; assignment is by maximum IoU with positive threshold at IoU>0.5 (Khanam et al., 2024, Jegham et al., 2024).

Augmentation: Mosaic and MixUp are primary, with random scale, HSV, horizontal flip, random cutout, affine transforms, and cutmix recommended for improved robustness to occlusion and scale variation (Khanam et al., 2024, Jegham et al., 2024, Jiang et al., 20 Feb 2025).

5. Training Regimen and Hyperparameters

  • Dataset: COCO 2017 filtered for "person" (~64k images), with potential augmentation from CrowdHuman or a custom set for dense, occluded scenes (Khanam et al., 2024).
  • Batch Size: 16 or 32 (per GPU, FP16), input image size 640×640 (scalable up to 800×800) (Jegham et al., 2024, Jiang et al., 20 Feb 2025).
  • Learning Rate and Schedule: AdamW optimizer with $\text{weight_decay}=5 \times 10^{-4}$, initial lr=0.01\text{lr}=0.01 scaled by batch size, linear warmup (3 epochs), cosine decay to 10410^{-4}.
  • Epochs: 100–200 as baseline, 300–500 for full convergence (early stopping if no mAP⟨50⟩ improvement over 30 epochs) (Jegham et al., 2024, Jiang et al., 20 Feb 2025).
  • Model Export: Trained weights exportable to ONNX, TensorRT, and other frameworks with batch normalization fusion and optional INT8 quantization for deployment acceleration (Jiang et al., 20 Feb 2025).

6. Performance, Ablations, and Best Practices

Quantitative evaluation on ODverse33's Security domain ("human"-style targets):

Metric YOLOv11 YOLOv8 YOLOv5
[email protected]:0.95 (mean) 0.5909 0.5849 0.5790
Small object mAP50–95 0.5025 0.4613 0.4724
Medium object mAP50–95 0.5306 0.5203 0.5113
Large object mAP50–95 0.6373 0.6333 0.6334

C2PSA modules provide a ≈6% recall boost for small objects. C3k2 blocks yield ≈3% higher mAP⟨50⟩ on medium/large objects relative to YOLOv8. Decoupled head architecture further stabilizes the classification-localization balance (Jiang et al., 20 Feb 2025). Int8 quantization delivers a ≈2× speedup with <1% mAP50 loss (Jiang et al., 20 Feb 2025).

Best practices include multi-threaded prefetch, FP16/INT8 for inference, disabling unused detection heads, and continuous monitoring of mAP⟨50⟩ and mAP⟨50:95⟩ during training to maintain performance on small or occluded humans (Khanam et al., 2024, Jegham et al., 2024, Jiang et al., 20 Feb 2025).

7. Workflow for Implementation and Deployment

  1. Instantiate YOLOv11 via the Ultralytics package; select C3k2/C2PSA backbone with decoupled head (Khanam et al., 2024).
  2. Preprocess datasets to a standard 640×640 square aspect; augment with Mosaic, MixUp, color jitter (Khanam et al., 2024, Jiang et al., 20 Feb 2025).
  3. Conduct training with AdamW, batch=16–32, for up to 300 epochs or until early stopping (Jegham et al., 2024).
  4. Optionally perform anchor re-clustering on training set box dimensions for optimal recall of atypical human aspect ratios (Khanam et al., 2024, Jegham et al., 2024).
  5. Export weights to ONNX or TensorRT, enable FP16/INT8 precision, and remove non-detection heads, optimizing for edge or server deployment (Jiang et al., 20 Feb 2025).
  6. For real-time applications, aim for end-to-end latency <5 ms per frame for >200 FPS on suitable hardware (Jegham et al., 2024).

By following these architectural, training, and deployment guidelines, practitioners obtain a detector with robust accuracy ([email protected]:0.95 >0.59; [email protected] >0.86 on human targets), low-latency real-time inference, and enhanced reliability under diverse environmental and occlusion scenarios (Jiang et al., 20 Feb 2025, Khanam et al., 2024, Jegham et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLO-11-Based Human Detector.