Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jetson Orin AGX Developer Kit

Updated 16 January 2026
  • Jetson Orin AGX Developer Kit is a high-performance embedded system designed for real-time AI on edge devices, ideal for UAV analytics.
  • It supports advanced pipelines like FlyPose, delivering low inference latency (≈20 ms per frame) for tasks such as person detection and 2D pose estimation.
  • Optimized with frameworks like TensorRT-FP32, the kit enables efficient deployment of complex neural networks for reliable aerial perception in diverse environments.

FlyPose is a lightweight, real-time top-down pipeline for robust 2D human pose estimation from aerial viewpoints, designed for unmanned aerial vehicles (UAVs) operating in human-populated environments. Addressing the unique challenges of aerial perception—including low resolution, steep viewing angles, and self-occlusion—FlyPose achieves substantial improvements in both person detection and pose estimation compared to previous baselines, with efficient deployment suitable for edge computing on on-board UAV hardware. Complemented by the release of the FlyPose-104 benchmark dataset, FlyPose establishes a new standard for aerial pose estimation research and deployment scenarios (Farooq et al., 9 Jan 2026).

1. Pipeline Architecture

FlyPose implements a two-stage, top-down architecture tailored for aerial imagery:

  • Input: Full-HD (1920×1080) RGB or thermal frames are captured by an on-board UAV gimbal camera.
  • Stage 1—Person Detection: RT-DETRv2-S, a single-stage object detector with a ResNet-18 backbone pretrained on both COCO and Objects365, produces NN person bounding boxes B={b1,...,bN}B = \{b_1, ..., b_N\}, each defined by (x,y,w,h)(x, y, w, h). The detection head is optimized using a Normalized Wasserstein Distance Loss (NWDL), with the objective:

LNWD=1exp(W(b,b)σ2)\mathcal{L}_{\text{NWD}} = 1 - \exp\left(-\frac{W(b, b^*)}{\sigma^2}\right)

yielding an inference time of approximately 13 ms per frame on a Jetson Orin AGX with TensorRT-FP32.

  • Stage 2—Pose Estimation: Detected regions are re-scaled (long edge to 256 px, short edge to 192 px, zero-padded for aspect ratio), and passed to ViTPose-S, a Vision Transformer with a heatmap head. The output comprises K=17K=17 heatmaps H^kR256×192\hat{H}_k \in \mathbb{R}^{256\times192}, each encoding a COCO keypoint. The pose head uses mean-squared error:

LMSE=1KkHkH^k22\mathcal{L}_{\text{MSE}} = \frac{1}{K} \sum_k \| H_k - \hat{H}_k \|_2^2

and inference requires approximately 6.5 ms per crop on the same device.

The combined end-to-end pipeline achieves an effective latency of 19.5–20 ms per frame, equivalent to 50 frames per second, leaving computational margin for downstream on-board analytics.

2. Network Design

The two-stage design employs highly optimized network components:

  • Person Detector (RT-DETRv2-S): Utilizes an 18-layer ResNet backbone generating feature maps at strides 4–32, with lightweight (6×6 layer) transformer encoder–decoder heads. Input resolution for training and evaluation is standardized to 1280 px on the shorter side. Northwest Distance Loss replaces the generalized IoU loss for bounding box regression.
  • Pose Estimator (ViTPose variants): Built upon a Vision Transformer backbone, the S variant uses 12 transformer layers with 12-head attention and a hidden dimension of 384; patch size is 16×16, yielding token maps for subsequent heatmap decoding via deconvolutional upsampling (64×48→256×192). Additional variants (B/L/H) with deeper/larger architectures are evaluated for ablation. These choices enable real-time, high-recall pose inference while maintaining significant model compactness for on-board deployment.

3. Training Regimen

FlyPose relies on multi-stage, multi-dataset training and targeted augmentations to maximize aerial robustness:

Person Detector

  1. Pretraining on COCO + Objects365.
  2. Fine-tuning: 60 epochs on VisDrone2019-DET (single "person" class).
  3. Multi-dataset expansion: 60 further epochs on a composite of eight aerial datasets (including SeasDronesSea, Heridal SAR, VTSAR, DroneRGBT, VTUAV-Det, HIT-UAV-train, Manipal-UAV, TinyPerson), totaling 66,849 training and 21,164 validation images.
  4. COCO re-introduction: 50 epochs to restore frontal-view detection performance.
  5. Loss modification: 50 final epochs substituting NWDL for GIoU.

Optimization uses AdamW (learning rate 1e-4\approx 1\text{e-}4, weight decay 1e-4\approx 1\text{e-}4, batch sizes 16–32).

Pose Estimator

  • Pretraining: COCO-Keypoints.
  • Fine-tuning: On UAV-Human v1 train split (170–210 epochs).
  • Augmentations: Half-body, rotation (±\pm30°), scale (±\pm30%), and newly introduced down-scaling (5–20%) to synthesize the appearance of small/distant subjects, crucial for aerial use.
  • Optimization: AdamW (lr 5e-45\text{e-}4 with step decay, batch size ≈64).

4. Empirical Evaluation

FlyPose demonstrates significant quantitative improvements on challenging aerial pose benchmarks:

Model (Training Regime) mAP (Test Sets Avg.) AR (Avg.) UAV-Human mAP Jetson Latency (ms)
RT-DETRv2-S COCO only 14.33 26.76 13 (det), —
+ VisDrone only 21.43 32.61
+ Multi-Dataset 28.21 38.20
+ COCO re-introduced 28.07 39.21
+ NWD Loss 27.96 39.14
ViTPose Variant COCO mAP (pre) UAV-Human mAP (finetuned) A6000 Latency (ms) Jetson Latency (ms)
AlphaPose (baseline) 56.9
ViTPose-S 61.09 65.76 110.23 6.54
ViTPose-H 67.52 73.18 322.55 n/a

Relative to the AlphaPose baseline, FlyPose increases keypoint mAP by 16.3 points (73.18 vs. 56.9) on the UAV-Human dataset. Multi-dataset fine-tuning of RT-DETRv2-S yields a +6.8 mAP gain (COCO→aerial), demonstrating the importance of aerial-specific data in detector domain adaptation.

5. Real-Time UAV Deployment and System Integration

FlyPose is validated in operational conditions:

  • End-to-end inference latency is ≈20 ms/frame (detection 13 ms + pose 6.5 ms + pre/postprocessing 0.5 ms).
  • Deployment is on a quadrotor UAV carrying a Jetson Orin AGX Developer Kit (payload ≈4 kg), with gimbal-mounted camera. The system supports sustained 50 fps inference, with a one-time RTSP camera initialization delay of ≈300 ms.
  • Implication: This design ensures adequate computational margin for on-board gesture or action recognition tasks and aligns with real-world, human-proximate UAV requirements (Farooq et al., 9 Jan 2026).

6. The FlyPose-104 Dataset

FlyPose introduces a novel, publicly available evaluation benchmark:

  • Composition: 104 manually selected aerial images (from both FlyPose authors and public sources), containing 193 annotated person instances.
  • Annotations: COCO-format bounding box, 17 keypoints per person, plus visibility flags.
  • Characteristics: Frames span varied altitudes (5–50 m), top-down and steep camera angles, complex backgrounds (urban, snow, water, dirt), and feature heavy self-occlusion and small-scale persons, often with occluded facial landmarks.
  • Partition: Test set only; leveraged for cross-dataset generalization studies and to highlight method limitations.
  • Significance: FlyPose-104 exposes failure modes (notably on facial keypoints) and enables benchmarking under the most difficult aerial-keypoint scenarios (Farooq et al., 9 Jan 2026).

7. Context within Broader Landscape and Future Directions

FlyPose complements distributed, multi-view 3D pose systems such as AirPose, which fuses SMPL-X parameters across multiple UAVs by exchanging minimal viewpoint-independent latent representations (Saini et al., 2022). Unlike AirPose, which tackles 3D pose/shape fusion and cross-UAV calibration for markerless motion capture, FlyPose addresses 2D pose with stringent low-latency requirements in single-view, on-board settings.

A plausible implication is that FlyPose’s edge-efficiency and dataset resources can act as enablers for aerial multi-modal analytics, and its architectural strategies—multi-dataset aerial training, transformer-based pose heads, and loss function selection—will inform the design of higher-order aerial perception frameworks, including those seeking to extend single-view pipelines to 3D or multi-UAV systems.

Persistent challenges include domain adaptation to extreme aerial conditions, robustness to small/occluded instances, and integration with gesture/action recognition pipelines. The publication and analysis of failure cases via FlyPose-104 is specifically intended to catalyze methodological advances for robust top-down, small-scale aerial pose estimation (Farooq et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jetson Orin AGX Developer Kit.