Aerial-Ground Person Re-Identification

Updated 6 February 2026

Aerial-Ground Person Re-Identification is a cross-domain person retrieval task that matches UAV and ground camera images despite significant viewpoint and resolution differences.
Recent innovations integrate geometry-aware attention, attribute-guided prompting, and generative feature synthesis to overcome occlusion, scale variability, and environmental challenges.
Benchmark results using methods like VDT, GSAlign, and GeoReID demonstrate rapid progress and highlight the practical applications in security and surveillance.

Aerial-Ground Person Re-Identification (AG-ReID) is a cross-domain person retrieval task focused on matching pedestrian instances observed by aerial (UAV) cameras to those captured by ground-based (CCTV, wearable) cameras. AG-ReID is uniquely characterized by its extreme cross-viewpoint geometric distortions, resolution and scale variability, and environmental challenges such as occlusion, illumination, and clutter. This field has evolved rapidly, driven by security, surveillance, and multi-platform analytics needs in large-scale and public outdoor environments.

1. Problem Definition and Challenges

AG-ReID extends classical ReID from homogeneous ground-ground settings to heterogeneous platforms, formally seeking an embedding $f: x \rightarrow \mathbb{R}^d$ such that, for image pairs $(x_i^A, x_j^G)$ of the same identity $y_i = y_j$ , their feature distance $\|f(x_i^A) - f(x_j^G)\|$ is minimized, while embedding different IDs far apart. The modality gap is dominated by:

Viewpoint Disparity: UAVs capture severe top-down or oblique perspectives, leading to fore-shortening, unnatural aspect ratios, and body-part displacement not present in ground images.
Resolution Loss: At high altitudes, person bounding boxes may be $< 30$ px tall, severely limiting fine appearance details.
Scale and Occlusion Variability: Large altitude and pose variations, dynamic backgrounds, and partial visibility further corrupt matching signals.
Attribute and Appearance Shifts: Clothing, pose, and soft-biometric attribute distributions may differ between platform contexts.

Standard ReID baselines exhibit severe performance degradation as these challenges intensify, motivating dedicated architectural and data-centric innovations (Nguyen et al., 2023, Zhang et al., 2024, Wang et al., 10 Mar 2025, Nguyen et al., 28 Jun 2025, Hambarde et al., 4 Jan 2026).

2. Datasets and Benchmark Protocols

Aerial-ground ReID research relies on a suite of purpose-built benchmarks, each reflecting distinct operational realms:

Dataset	#IDs	#Images/Tracklets	Platforms	Protocols	Attribute Annotations
AG-ReID.v1	388	21,983	UAV, CCTV	A→G, G→A	15 soft attributes
AG-ReID.v2	1,615	100,502	UAV, CCTV, wearable	A→C, C→A, A→W, W→A	15 soft attributes
CARGO	5,000	108,563	5 UAV, 8 ground	ALL, A↔G, G↔G, A↔A	-
LAGPeR	4,231	63,841	7 UAV, 14 ground	A→G, G→A, G→G	Yes
AG-VPReID	6,632	32,321 tracklets	UAV (15–120 m), CCTV, wearable	A→G, G→A	15 attributes
VReID-XFD/DetReIDX	371	11,288 tracklets	UAV (5.8–120 m), ground	A→G, G→A, A→A	16 soft-biometrics
AG-VPReID.VIR	1,837	4,861 tracklets	UAV, CCTV, wearable (RGB, IR)	G→A, A→G, G→G, A→A	Yes

These datasets are characterized by extensive cross-viewpoint annotation, with AG-ReID.v2 and AG-VPReID featuring multi-session, multi-altitude captures and soft attribute labeling for explainability and attribute-guided learning (Nguyen et al., 2024, Nguyen et al., 11 Mar 2025, Nguyen et al., 24 Jul 2025, Li et al., 25 Oct 2025).

Benchmark protocols span image-based, video-based, and visible-IR modalities. Metrics include Cumulative Matching Characteristic (CMC) at various ranks and mean Average Precision (mAP).

3. Core Methodological Advances

Recent AG-ReID solutions integrate cross-view decoupling, attribute awareness, geometric rectification, dynamic selection, and generative modeling:

View Decoupling and Prompting

View-Decoupled Transformer (VDT): Learns identity (meta) and view-specific tokens via hierarchical subtractive separation, enforcing view-invariant representations through orthogonal loss (Zhang et al., 2024). VDT is foundational to later prompt-based models, including SeCap (Wang et al., 10 Mar 2025) and DTST (Wang et al., 2024).
Prompt-Tuning & Attribute Integration (LATex, SeCap): LATex utilizes prompt-tuned CLIP backbones to inject attribute-based text knowledge, harnessing an Attribute-aware Image Encoder (AIE), Prompted Attribute Classifier Group (PACG), and Coupled Prompt Template (CPT) for attribute-aligned retrieval (Hu et al., 31 Mar 2025). SeCap introduces input-adaptive PRM and local LFRM modules over the VDT backbone, dynamically generating prompts conditioned on the view-invariant token (Wang et al., 10 Mar 2025).

Geometric and Semantic Alignment

GSAlign: Combines Learnable Thin Plate Spline (LTPS) warping, which corrects geometric deformation via adaptive control points and progressive feature-level spatial transforms, with a Dynamic Alignment Module (DAM) that generates visibility-aware masks for semantic occlusion handling (Li et al., 25 Oct 2025). This approach yields state-of-the-art improvements, especially in extreme A↔G protocols.
Geometry-Conditioned Attention (GeoReID): Addresses the physical limitations of standard dot-product attention under substantial geometric disparity by introducing the Geometry-Induced Query-Key Transformation (GIQT), which directly conditions the attention similarity metric on camera altitude and orientation via low-rank residuals. A geometry-conditioned prompt generator further injects global priors per view, resulting in robust cross-domain alignment (Hambarde et al., 29 Jan 2026).

Generative and Cross-View Representations

SD-ReID: Leverages a two-stage framework combining discriminative ViT feature extraction with a view-aware latent Stable Diffusion module. The generative SD module, conditioned on both identity and cross-view features, synthesizes view-specific representations that bridge large semantic gaps. The View-Refine Decoder aligns features when inference lacks direct cross-view observations (Hu et al., 13 Apr 2025).
Dynamic Token Selection (DTST): Selects salient tokens in a ViT backbone using a differentiable Top-K scheme (Gumbel-Softmax relaxation), concentrating the representation on the most identity-informative regions while discarding distractors. Integrated with VDT-style decoupling, this yields improved efficiency and accuracy (Wang et al., 2024).

Temporal and Multi-Stream Integration

AG-VPReID-Net: Employs a three-stream fusion of (1) Adapted Temporal-Spatial modeling (integrating temporal 3D body shape/gait through SMPL regressor and GRU), (2) Normalized Appearance (UV-texture aggregation and multi-scale DGC), and (3) Multi-Scale Attention via frozen CLIP ViT, with adaptive weighting (Nguyen et al., 11 Mar 2025).
MTF-CVReID and X-TFCLIP: Introduce lightweight adapter-based modules addressing altitude, modality, and view biases. Modules harmonize color/contrast shifts (CSFN), multi-resolution scaling (MRFH), inter-view feature alignment (IVFA), and encode temporal patterns (HTPL, TDM), achieving real-time efficiency with high accuracy (Rashidunnabi et al., 4 Nov 2025, Nguyen et al., 28 Jun 2025).

4. Training Strategies, Losses, and Optimization

Most AG-ReID frameworks utilize composite loss functions combining classification (cross-entropy, ID), triplet ranking, attribute/auxiliary losses, and specialized regularizers:

Prompt/Token Losses: Prompted architectures add contrastive CLIP losses, attribute classification, and prompt regularization terms (e.g., LATex CPT loss, PRM/LFRM in SeCap).
Geometric Regularization: Geometric methods (GSAlign, GeoReID) penalize excessive deformation or prompt magnitude, enforce alignment (L_align) for semantic masks, and adapt local-global loss balancing for optimal prompt and query-key correction.
Memory/Temporal Losses: Multi-stream and video-based models introduce video-to-memory contrastive losses (V2M), hierarchical temporal consistency, and memory-based center/contrastive terms.
Optimization Schedules: Strategies include staged or joint training, frozen-backbone adapters (prompt/layer freezing), AdamW/SGD with cosine decay, and in some cases, two-phase optimization for generative-SR-ReID hybrids (Endrei et al., 13 Jan 2026).

Batch sizes typically range 32–128 images per iteration, and inference is performed via nearest-neighbor retrieval in the embedding space, often after feature concatenation or adaptive pooling.

5. Empirical Performance and Comparative Results

Quantitative results from major AG-ReID benchmarks attest to the rapid progress and remaining difficulties:

Method	CARGO (A↔G) R1/mAP	AG-ReID v1 (A→G) R1/mAP	AG-ReID v2 (A→W) R1/mAP
VDT (Zhang et al., 2024)	48.12 / 42.76	82.91 / 74.44	-
GSAlign (Li et al., 25 Oct 2025)	64.89 / 61.55	86.74 / 84.00	91.47 / 89.78
LATex (Hu et al., 31 Mar 2025)	66.99 / 58.59	84.41 / 75.85	90.09 / 83.50
SD-ReID (Hu et al., 13 Apr 2025)	53.12 / 46.44	85.16 / 75.40	-
GeoReID (Hambarde et al., 29 Jan 2026)	72.02 / 64.61	87.02 / 79.46	93.21 / 88.03 (A→W)
DTST (Wang et al., 2024)	47.50 / 42.96	-	-
SeCap (Wang et al., 10 Mar 2025)	-	84.03 / 76.16	-

GSAlign, GeoReID, and LATex set the empirical state-of-the-art in most protocols, with GSAlign and GeoReID especially excelling under high distortion regimes due to explicit spatial and similarity rectification. Notably, generative (SD-ReID) and selective/token-focused (DTST) solutions excel in settings with pronounced viewpoint/scale change and distractors.

Video-based and far-distance (XFD) settings present severe challenges: the best systems (e.g., SAS-VPReID, MTF-CVReID) reach only 43.93% mAP in A→G even on curated test sets, highlighting intrinsic modality gaps. Multi-modal (visible-IR), video, and extremely high-altitude evaluations underscore the need for continued algorithmic and data advances (Hambarde et al., 4 Jan 2026, Rashidunnabi et al., 4 Nov 2025, Nguyen et al., 24 Jul 2025).

6. Current Limitations and Directions

Despite significant progress, AG-ReID models remain fundamentally challenged by:

Extreme Geometric and Resolution Distortion: No approach fully closes the performance gap at altitudes above 80m or under nadir (top-down) views, where both appearance and motion cues collapse.
Cross-Modal and Zero-Shot Domain Shift: Infrared, wearable, and synthetic-to-real generalization is only partly addressed. Robustness to dynamic environments, clothing change, and background clutter is still limited.
Attribute and Prompt Supervision: Attribute-aware and prompt-compositional methods depend on annotation quality; errors in predicted attributes or geometric predictions propagate to downstream retrieval performance.
Compute-Efficiency Trade-offs: While adapter, prompt, and token selection modules greatly reduce trainable parameter count compared to full fine-tuning, SOTA systems still operate at substantial computational overhead, especially in multi-stream video pipelines.

Ongoing directions include robust attribute detectors, geometry- and context-adaptive prompting, self-supervised and unsupervised domain adaptation, GAN-based data augmentation, and explicit multi-sensor fusion (RGB-IR-LiDAR).

7. Synthesis and Outlook

Aerial-Ground Person Re-Identification is a rapidly progressing subfield at the intersection of vision, geometry, and cross-modal understanding. Methodological innovations such as geometry-conditioned attention (GIQT, GeoReID), attribute-based prompting (LATex, SeCap), deep generative feature augmentation (SD-ReID), and multi-stream/video integration (AG-VPReID-Net, MTF-CVReID, GSAlign) have continuously advanced the empirical frontier on diverse and realistic benchmarks (Nguyen et al., 11 Mar 2025, Hu et al., 31 Mar 2025, Rashidunnabi et al., 4 Nov 2025, Li et al., 25 Oct 2025, Hu et al., 13 Apr 2025, Hambarde et al., 29 Jan 2026, Nguyen et al., 28 Jun 2025, Hambarde et al., 4 Jan 2026, Nguyen et al., 2024, Wang et al., 2024, Wang et al., 10 Mar 2025).

AG-ReID benchmarks illustrate persistent gaps between aerial–ground recognition and traditional ReID regimes, revealing fundamental limitations imposed by physical, geometric, and perceptual constraints. The field is trending toward specialized, factorized approaches that exploit both domain-specific priors and modality-robust architectures. Future innovations will likely emerge at the intersection of geometry-aware learning, cross-modal data fusion, generative feature synthesis, and prompt-based adaptation, with implications for both theoretical understanding and real-world deployment at unprecedented scale.