Thermal Underground Human Detection
- Thermal UHD is the automated detection of human presence using thermal imaging in underground settings, crucial for safety and emergency response.
- Key research involves dedicated thermal video datasets and deep learning benchmarks that improve mAP and F1-scores, highlighting methodical evaluation.
- Embedded solutions with ultra-low-resolution sensors and compact CNNs offer cost-effective, privacy-preserving detection in resource-constrained subterranean scenarios.
Thermal Underground Human Detection (Thermal UHD) encompasses the automated detection of human presence and postures in underground environments using thermal imaging as the primary sensory input. This approach is highly relevant for safety-critical applications in subterranean mining, where unreliable visibility, hazardous conditions, and emergency response constraints necessitate robust, privacy-preserving, and resilient human detection systems. Two principal research directions have dominated recent literature: (1) comprehensive, high-quality thermal video datasets and benchmarks for deep learning–based detection (Addy et al., 26 Jun 2025), and (2) ultra-low-resolution, resource-constrained embedded solutions for cost-sensitive deployments (Vandersteegen et al., 2022).
1. Thermal UHD: Dataset Acquisition and Configuration
Recent advancements have been catalyzed by dedicated datasets such as the "Thermal UHD" corpus, collected in the Missouri S&T Experimental Mine with authentic mining conditions (Addy et al., 26 Jun 2025). This dataset comprises 7,049 thermal images, each resized to 640 × 640 pixels, acquired from continuous video using a Spot CAM+IR thermal camera (mounted on Boston Dynamics’ Spot robot) featuring PTZ and 30× optical zoom. The camera enabled effective thermal imaging across –40 °C to +550 °C, with a native output resolution of 720 × 575 pixels.
Three acquisition phases intentionally mirror the operational and emergency diversity in underground mines:
- Phase 1 (Emergency Simulation): Introduction of portable heat sources and artificial smoke to replicate equipment fires and low visibility, challenging thermal signatures and introducing occlusions.
- Phase 2 (Normal-Work): Miners performing representative manual tasks under nominal ambient mine conditions.
- Phase 3 (Rest): Miners stationary, either resting or seated, to simulate typical tunnel occupancy.
Environmental variability (temperature, humidity) was monitored, and scenario realism enhanced with PPE, representational postures (standing, bending, sitting, squatting, lying), and tool presence.
2. Data Annotation and Preprocessing Protocols
Annotation employed the makesense.ai platform, with bounding boxes in normalized (x₁,y₁,x₂,y₂) coordinates tightly enclosing thermal silhouettes. Each miner was independently annotated, including instances with partial or total overlap, while background objects such as machines and tools remained unlabelled. Inter-annotator consistency was not reported.
Frames were extracted via bespoke OpenCV/Python scripts and uniformly resized for model compatibility, with normalization (presumed 0–255 to 0–1 scaling) applied. The corpus did not feature explicit thermal-specific augmentations such as synthetic noise, occlusion, or dynamic contrast scaling.
Class distributions across the 7,049 frames reveal notable imbalance:
| Posture | Proportion (%) |
|---|---|
| Standing | 44.6 |
| Bending | 24.2 |
| Lying | 17.4 |
| Squatting | 5.5 |
| Sitting | 8.3 |
No class- or scenario-balancing techniques were reported.
3. Model Architectures, Training, and Evaluation Methodology
A suite of object detectors was benchmarked:
- YOLOv8, YOLOv10, YOLOv11: Each evaluated in five standard variants (n, s, m, l, x) at input size 640 × 640, with default anchor configurations.
- RT-DETR: Real-Time Detection Transformer, L and X variants.
The common detection loss incorporated standard YOLO objectness, classification, and bounding box regression terms, with Complete IoU (CIoU) defined as:
where denotes center-point distance between predicted and ground truth boxes, is the diagonal length of the minimal enclosing box, and penalizes aspect ratio deviations.
Key hyperparameters included a fixed learning rate of 0.01, weight decay 0.0005, and momentum 0.937. Training proceeded for approximately 0–80 epochs (based on reported validation curves), with SGD for YOLOv11 and automatic optimizer selection for other variants. Dataset split was 65% for training (4,584 samples) and 35% for validation (2,465 samples), with no test partition.
Assessment adhered to IoU ≥0.5 for a correct detection (no higher-tier thresholds were employed), with evaluation metrics comprising [email protected] (mAP50), Precision, Recall, and F1-score:
- , with as precision-recall curve.
- .
4. Performance Results and Analysis
Transfer learning produced substantial gains, as shown below for YOLO variants—mAP50 improved by over 15–20 pp and F1 by over 20 pp:
| Model | mAP50 (w/o TL) | mAP50 (w/ TL) | F1 (w/o TL) | F1 (w/ TL) |
|---|---|---|---|---|
| YOLOv8-x | 61.3% | 78.8% | 64.6% | 85.4% |
| YOLOv10-x | 59.4% | 75.8% | 63.4% | 85.1% |
| YOLOv11-x | 57.1% | 79.0% | 61.1% | 85.1% |
| RT-DETR-X | 60.7% | 84.8% | 66.0% | 82.6% |
Best overall performance (by model variant):
- YOLO11-l: mAP50 80.2%
- YOLO11-n: mAP50 80.1%, Precision 80.5%, F1 75.9%
- YOLOv8-n: Recall 73.2%
- RT-DETR: Trailed YOLO variants in mAP50.
Qualitative results highlighted reliable detection of common postures (standing, bending, squatting, lying); confusions occurred predominantly for sitting (misclassified as standing/background) and bending (misclassified as standing/background).
5. Challenges, Limitations, and Recommendations
Core obstacles identified:
- Class imbalance: Dominance of "standing" samples (45%) resulted in poorer recognition of rare postures (especially bending and squatting).
- Thermal-specific noise: Intrinsic to the modality, including low contrast for distal subjects, signature variability due to ambient and body temperature fluctuations, and reduced structural detail.
- Occlusions: Artificial smoke and dust caused detection confidence degradation and increased false negatives.
Recommended research directions include:
- Dataset enrichment: Acquisition of more samples for rare postures and increased environmental heterogeneity.
- Thermal-specific augmentation: Synthetically inject noise, random occlusions, and contrast manipulations to improve detector robustness.
- Backbone development: Explore thermal-aware architectures or domain adaptation strategies.
- Sensor fusion: Combine thermal with LiDAR or RGB signals (where available) to alleviate single-modality weaknesses.
This suggests the next phase in dataset and algorithm design should systematically target class balance and environment diversity, as well as architectural specialization for thermal inputs.
6. Ultra-Low-Resolution Thermal UHD on Embedded MCUs
An alternate paradigm is ultra-low-resolution, low-cost embedded thermal UHD, exemplified by person detection using a 32 × 24-pixel Melexis MLX90640 imager and sub-$20 Cortex-M microcontrollers (Vandersteegen et al., 2022). For this resource-constrained setting, a custom compact CNN (depthwise separable, ReLU6, YOLOv2-style detection head) was pruned and quantized to sub-10k parameters (≈9.2 kB) for efficient memory utilization.
Preprocessing included temperature normalization, periodic exponential moving average background subtraction, optional frame differencing for motion, and median filtering for noise suppression. The model achieved F1-scores up to 91.62% (90° view, pruned+quantized), with inference times of 46 ms (STM32F746, ≈21 FPS) and peak SRAM usage below 31 kB.
Key recommendations for subterranean deployment involve recalibration for ambient temperature shifts, adaptive background modeling, relaxing objectness thresholds to maximize recall in occlusion-prone environments, and selecting operational thresholds based on application-specific false-alarm trade-offs.
7. Impact and Future Directions
Thermal UHD provides a high-fidelity benchmark for future deep learning models intended for underground safety and emergency response, with robust baseline performance metrics and deeply annotated, scenario-specific thermal imagery (Addy et al., 26 Jun 2025). Meanwhile, demonstrated feasibility of embedded systems using ultra-low-resolution sensors and compact CNN architectures yields privacy-preserving, cost-effective alternatives (Vandersteegen et al., 2022).
Future research priorities include class-balanced corpus expansion, integration of synthetic and real-world augmentation strategies, multi-modal sensor fusion, and adaptation of network architectures to the unique spectral and environmental properties of thermal underground imaging. These steps are required to close current performance gaps for under-represented scenarios and optimize real-world deployment under adverse, variable underground conditions.