Jetson Orin AGX Developer Kit
- Jetson Orin AGX Developer Kit is a high-performance embedded system designed for real-time AI on edge devices, ideal for UAV analytics.
- It supports advanced pipelines like FlyPose, delivering low inference latency (≈20 ms per frame) for tasks such as person detection and 2D pose estimation.
- Optimized with frameworks like TensorRT-FP32, the kit enables efficient deployment of complex neural networks for reliable aerial perception in diverse environments.
FlyPose is a lightweight, real-time top-down pipeline for robust 2D human pose estimation from aerial viewpoints, designed for unmanned aerial vehicles (UAVs) operating in human-populated environments. Addressing the unique challenges of aerial perception—including low resolution, steep viewing angles, and self-occlusion—FlyPose achieves substantial improvements in both person detection and pose estimation compared to previous baselines, with efficient deployment suitable for edge computing on on-board UAV hardware. Complemented by the release of the FlyPose-104 benchmark dataset, FlyPose establishes a new standard for aerial pose estimation research and deployment scenarios (Farooq et al., 9 Jan 2026).
1. Pipeline Architecture
FlyPose implements a two-stage, top-down architecture tailored for aerial imagery:
- Input: Full-HD (1920×1080) RGB or thermal frames are captured by an on-board UAV gimbal camera.
- Stage 1—Person Detection: RT-DETRv2-S, a single-stage object detector with a ResNet-18 backbone pretrained on both COCO and Objects365, produces person bounding boxes , each defined by . The detection head is optimized using a Normalized Wasserstein Distance Loss (NWDL), with the objective:
yielding an inference time of approximately 13 ms per frame on a Jetson Orin AGX with TensorRT-FP32.
- Stage 2—Pose Estimation: Detected regions are re-scaled (long edge to 256 px, short edge to 192 px, zero-padded for aspect ratio), and passed to ViTPose-S, a Vision Transformer with a heatmap head. The output comprises heatmaps , each encoding a COCO keypoint. The pose head uses mean-squared error:
and inference requires approximately 6.5 ms per crop on the same device.
The combined end-to-end pipeline achieves an effective latency of 19.5–20 ms per frame, equivalent to 50 frames per second, leaving computational margin for downstream on-board analytics.
2. Network Design
The two-stage design employs highly optimized network components:
- Person Detector (RT-DETRv2-S): Utilizes an 18-layer ResNet backbone generating feature maps at strides 4–32, with lightweight (6×6 layer) transformer encoder–decoder heads. Input resolution for training and evaluation is standardized to 1280 px on the shorter side. Northwest Distance Loss replaces the generalized IoU loss for bounding box regression.
- Pose Estimator (ViTPose variants): Built upon a Vision Transformer backbone, the S variant uses 12 transformer layers with 12-head attention and a hidden dimension of 384; patch size is 16×16, yielding token maps for subsequent heatmap decoding via deconvolutional upsampling (64×48→256×192). Additional variants (B/L/H) with deeper/larger architectures are evaluated for ablation. These choices enable real-time, high-recall pose inference while maintaining significant model compactness for on-board deployment.
3. Training Regimen
FlyPose relies on multi-stage, multi-dataset training and targeted augmentations to maximize aerial robustness:
Person Detector
- Pretraining on COCO + Objects365.
- Fine-tuning: 60 epochs on VisDrone2019-DET (single "person" class).
- Multi-dataset expansion: 60 further epochs on a composite of eight aerial datasets (including SeasDronesSea, Heridal SAR, VTSAR, DroneRGBT, VTUAV-Det, HIT-UAV-train, Manipal-UAV, TinyPerson), totaling 66,849 training and 21,164 validation images.
- COCO re-introduction: 50 epochs to restore frontal-view detection performance.
- Loss modification: 50 final epochs substituting NWDL for GIoU.
Optimization uses AdamW (learning rate , weight decay , batch sizes 16–32).
Pose Estimator
- Pretraining: COCO-Keypoints.
- Fine-tuning: On UAV-Human v1 train split (170–210 epochs).
- Augmentations: Half-body, rotation (30°), scale (30%), and newly introduced down-scaling (5–20%) to synthesize the appearance of small/distant subjects, crucial for aerial use.
- Optimization: AdamW (lr with step decay, batch size ≈64).
4. Empirical Evaluation
FlyPose demonstrates significant quantitative improvements on challenging aerial pose benchmarks:
| Model (Training Regime) | mAP (Test Sets Avg.) | AR (Avg.) | UAV-Human mAP | Jetson Latency (ms) |
|---|---|---|---|---|
| RT-DETRv2-S COCO only | 14.33 | 26.76 | — | 13 (det), — |
| + VisDrone only | 21.43 | 32.61 | — | — |
| + Multi-Dataset | 28.21 | 38.20 | — | — |
| + COCO re-introduced | 28.07 | 39.21 | — | — |
| + NWD Loss | 27.96 | 39.14 | — | — |
| ViTPose Variant | COCO mAP (pre) | UAV-Human mAP (finetuned) | A6000 Latency (ms) | Jetson Latency (ms) |
|---|---|---|---|---|
| AlphaPose (baseline) | — | 56.9 | — | — |
| ViTPose-S | 61.09 | 65.76 | 110.23 | 6.54 |
| ViTPose-H | 67.52 | 73.18 | 322.55 | n/a |
Relative to the AlphaPose baseline, FlyPose increases keypoint mAP by 16.3 points (73.18 vs. 56.9) on the UAV-Human dataset. Multi-dataset fine-tuning of RT-DETRv2-S yields a +6.8 mAP gain (COCO→aerial), demonstrating the importance of aerial-specific data in detector domain adaptation.
5. Real-Time UAV Deployment and System Integration
FlyPose is validated in operational conditions:
- End-to-end inference latency is ≈20 ms/frame (detection 13 ms + pose 6.5 ms + pre/postprocessing 0.5 ms).
- Deployment is on a quadrotor UAV carrying a Jetson Orin AGX Developer Kit (payload ≈4 kg), with gimbal-mounted camera. The system supports sustained 50 fps inference, with a one-time RTSP camera initialization delay of ≈300 ms.
- Implication: This design ensures adequate computational margin for on-board gesture or action recognition tasks and aligns with real-world, human-proximate UAV requirements (Farooq et al., 9 Jan 2026).
6. The FlyPose-104 Dataset
FlyPose introduces a novel, publicly available evaluation benchmark:
- Composition: 104 manually selected aerial images (from both FlyPose authors and public sources), containing 193 annotated person instances.
- Annotations: COCO-format bounding box, 17 keypoints per person, plus visibility flags.
- Characteristics: Frames span varied altitudes (5–50 m), top-down and steep camera angles, complex backgrounds (urban, snow, water, dirt), and feature heavy self-occlusion and small-scale persons, often with occluded facial landmarks.
- Partition: Test set only; leveraged for cross-dataset generalization studies and to highlight method limitations.
- Significance: FlyPose-104 exposes failure modes (notably on facial keypoints) and enables benchmarking under the most difficult aerial-keypoint scenarios (Farooq et al., 9 Jan 2026).
7. Context within Broader Landscape and Future Directions
FlyPose complements distributed, multi-view 3D pose systems such as AirPose, which fuses SMPL-X parameters across multiple UAVs by exchanging minimal viewpoint-independent latent representations (Saini et al., 2022). Unlike AirPose, which tackles 3D pose/shape fusion and cross-UAV calibration for markerless motion capture, FlyPose addresses 2D pose with stringent low-latency requirements in single-view, on-board settings.
A plausible implication is that FlyPose’s edge-efficiency and dataset resources can act as enablers for aerial multi-modal analytics, and its architectural strategies—multi-dataset aerial training, transformer-based pose heads, and loss function selection—will inform the design of higher-order aerial perception frameworks, including those seeking to extend single-view pipelines to 3D or multi-UAV systems.
Persistent challenges include domain adaptation to extreme aerial conditions, robustness to small/occluded instances, and integration with gesture/action recognition pipelines. The publication and analysis of failure cases via FlyPose-104 is specifically intended to catalyze methodological advances for robust top-down, small-scale aerial pose estimation (Farooq et al., 9 Jan 2026).