Jetson Orin AGX Developer Kit

Updated 16 January 2026

Jetson Orin AGX Developer Kit is a high-performance embedded system designed for real-time AI on edge devices, ideal for UAV analytics.
It supports advanced pipelines like FlyPose, delivering low inference latency (≈20 ms per frame) for tasks such as person detection and 2D pose estimation.
Optimized with frameworks like TensorRT-FP32, the kit enables efficient deployment of complex neural networks for reliable aerial perception in diverse environments.

FlyPose is a lightweight, real-time top-down pipeline for robust 2D human pose estimation from aerial viewpoints, designed for unmanned aerial vehicles (UAVs) operating in human-populated environments. Addressing the unique challenges of aerial perception—including low resolution, steep viewing angles, and self-occlusion—FlyPose achieves substantial improvements in both person detection and pose estimation compared to previous baselines, with efficient deployment suitable for edge computing on on-board UAV hardware. Complemented by the release of the FlyPose-104 benchmark dataset, FlyPose establishes a new standard for aerial pose estimation research and deployment scenarios (Farooq et al., 9 Jan 2026).

1. Pipeline Architecture

FlyPose implements a two-stage, top-down architecture tailored for aerial imagery:

Input: Full-HD (1920×1080) RGB or thermal frames are captured by an on-board UAV gimbal camera.
Stage 1—Person Detection: RT-DETRv2-S, a single-stage object detector with a ResNet-18 backbone pretrained on both COCO and Objects365, produces $N$ person bounding boxes $B = \{b_1, ..., b_N\}$ , each defined by $(x, y, w, h)$ . The detection head is optimized using a Normalized Wasserstein Distance Loss (NWDL), with the objective:

$\mathcal{L}_{\text{NWD}} = 1 - \exp\left(-\frac{W(b, b^*)}{\sigma^2}\right)$

yielding an inference time of approximately 13 ms per frame on a Jetson Orin AGX with TensorRT-FP32.

Stage 2—Pose Estimation: Detected regions are re-scaled (long edge to 256 px, short edge to 192 px, zero-padded for aspect ratio), and passed to ViTPose-S, a Vision Transformer with a heatmap head. The output comprises $K=17$ heatmaps $\hat{H}_k \in \mathbb{R}^{256\times192}$ , each encoding a COCO keypoint. The pose head uses mean-squared error:

$\mathcal{L}_{\text{MSE}} = \frac{1}{K} \sum_k \| H_k - \hat{H}_k \|_2^2$

and inference requires approximately 6.5 ms per crop on the same device.

The combined end-to-end pipeline achieves an effective latency of 19.5–20 ms per frame, equivalent to 50 frames per second, leaving computational margin for downstream on-board analytics.

2. Network Design

The two-stage design employs highly optimized network components:

Person Detector (RT-DETRv2-S): Utilizes an 18-layer ResNet backbone generating feature maps at strides 4–32, with lightweight (6×6 layer) transformer encoder–decoder heads. Input resolution for training and evaluation is standardized to 1280 px on the shorter side. Northwest Distance Loss replaces the generalized IoU loss for bounding box regression.
Pose Estimator (ViTPose variants): Built upon a Vision Transformer backbone, the S variant uses 12 transformer layers with 12-head attention and a hidden dimension of 384; patch size is 16×16, yielding token maps for subsequent heatmap decoding via deconvolutional upsampling (64×48→256×192). Additional variants (B/L/H) with deeper/larger architectures are evaluated for ablation. These choices enable real-time, high-recall pose inference while maintaining significant model compactness for on-board deployment.

3. Training Regimen

FlyPose relies on multi-stage, multi-dataset training and targeted augmentations to maximize aerial robustness:

Person Detector

Pretraining on COCO + Objects365.
Fine-tuning: 60 epochs on VisDrone2019-DET (single "person" class).
Multi-dataset expansion: 60 further epochs on a composite of eight aerial datasets (including SeasDronesSea, Heridal SAR, VTSAR, DroneRGBT, VTUAV-Det, HIT-UAV-train, Manipal-UAV, TinyPerson), totaling 66,849 training and 21,164 validation images.
COCO re-introduction: 50 epochs to restore frontal-view detection performance.
Loss modification: 50 final epochs substituting NWDL for GIoU.

Optimization uses AdamW (learning rate $\approx 1\text{e-}4$ , weight decay $\approx 1\text{e-}4$ , batch sizes 16–32).

Pose Estimator

Pretraining: COCO-Keypoints.
Fine-tuning: On UAV-Human v1 train split (170–210 epochs).
Augmentations: Half-body, rotation ( $\pm$ 30°), scale ( $\pm$ 30%), and newly introduced down-scaling (5–20%) to synthesize the appearance of small/distant subjects, crucial for aerial use.
Optimization: AdamW (lr $5\text{e-}4$ with step decay, batch size ≈64).

4. Empirical Evaluation

FlyPose demonstrates significant quantitative improvements on challenging aerial pose benchmarks:

Model (Training Regime)	mAP (Test Sets Avg.)	AR (Avg.)	UAV-Human mAP	Jetson Latency (ms)
RT-DETRv2-S COCO only	14.33	26.76	—	13 (det), —
+ VisDrone only	21.43	32.61	—	—
+ Multi-Dataset	28.21	38.20	—	—
+ COCO re-introduced	28.07	39.21	—	—
+ NWD Loss	27.96	39.14	—	—

ViTPose Variant	COCO mAP (pre)	UAV-Human mAP (finetuned)	A6000 Latency (ms)	Jetson Latency (ms)
AlphaPose (baseline)	—	56.9	—	—
ViTPose-S	61.09	65.76	110.23	6.54
ViTPose-H	67.52	73.18	322.55	n/a

Relative to the AlphaPose baseline, FlyPose increases keypoint mAP by 16.3 points (73.18 vs. 56.9) on the UAV-Human dataset. Multi-dataset fine-tuning of RT-DETRv2-S yields a +6.8 mAP gain (COCO→aerial), demonstrating the importance of aerial-specific data in detector domain adaptation.

5. Real-Time UAV Deployment and System Integration

FlyPose is validated in operational conditions:

End-to-end inference latency is ≈20 ms/frame (detection 13 ms + pose 6.5 ms + pre/postprocessing 0.5 ms).
Deployment is on a quadrotor UAV carrying a Jetson Orin AGX Developer Kit (payload ≈4 kg), with gimbal-mounted camera. The system supports sustained 50 fps inference, with a one-time RTSP camera initialization delay of ≈300 ms.
Implication: This design ensures adequate computational margin for on-board gesture or action recognition tasks and aligns with real-world, human-proximate UAV requirements (Farooq et al., 9 Jan 2026).

6. The FlyPose-104 Dataset

FlyPose introduces a novel, publicly available evaluation benchmark:

Composition: 104 manually selected aerial images (from both FlyPose authors and public sources), containing 193 annotated person instances.
Annotations: COCO-format bounding box, 17 keypoints per person, plus visibility flags.
Characteristics: Frames span varied altitudes (5–50 m), top-down and steep camera angles, complex backgrounds (urban, snow, water, dirt), and feature heavy self-occlusion and small-scale persons, often with occluded facial landmarks.
Partition: Test set only; leveraged for cross-dataset generalization studies and to highlight method limitations.
Significance: FlyPose-104 exposes failure modes (notably on facial keypoints) and enables benchmarking under the most difficult aerial-keypoint scenarios (Farooq et al., 9 Jan 2026).

7. Context within Broader Landscape and Future Directions

FlyPose complements distributed, multi-view 3D pose systems such as AirPose, which fuses SMPL-X parameters across multiple UAVs by exchanging minimal viewpoint-independent latent representations (Saini et al., 2022). Unlike AirPose, which tackles 3D pose/shape fusion and cross-UAV calibration for markerless motion capture, FlyPose addresses 2D pose with stringent low-latency requirements in single-view, on-board settings.

A plausible implication is that FlyPose’s edge-efficiency and dataset resources can act as enablers for aerial multi-modal analytics, and its architectural strategies—multi-dataset aerial training, transformer-based pose heads, and loss function selection—will inform the design of higher-order aerial perception frameworks, including those seeking to extend single-view pipelines to 3D or multi-UAV systems.

Persistent challenges include domain adaptation to extreme aerial conditions, robustness to small/occluded instances, and integration with gesture/action recognition pipelines. The publication and analysis of failure cases via FlyPose-104 is specifically intended to catalyze methodological advances for robust top-down, small-scale aerial pose estimation (Farooq et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

FlyPose: Towards Robust Human Pose Estimation From Aerial Views (2026)

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jetson Orin AGX Developer Kit.