Real-Time Oriented Object Detection Transformer in Remote Sensing Images

Published 16 Mar 2026 in cs.CV | (2603.15497v1)

Abstract: Recent real-time detection transformers have gained popularity due to their simplicity and efficiency. However, these detectors do not explicitly model object rotation, especially in remote sensing imagery where objects appear at arbitrary angles, leading to challenges in angle representation, matching cost, and training stability. In this paper, we propose a real-time oriented object detection transformer, the first real-time end-to-end oriented object detector to the best of our knowledge, that addresses the above issues. Specifically, angle distribution refinement is proposed to reformulate angle regression as an iterative refinement of probability distributions, thereby capturing the uncertainty of object rotation and providing a more fine-grained angle representation. Then, we incorporate a Chamfer distance cost into bipartite matching, measuring box distance via vertex sets, enabling more accurate geometric alignment and eliminating ambiguous matches. Moreover, we propose oriented contrastive denoising to stabilize training and analyze four noise modes. We observe that a ground truth can be assigned to different index queries across different decoder layers, and analyze this issue using the proposed instability metric. We design a series of model variants and experiments to validate the proposed method. Notably, our O2-DFINE-L, O2-RTDETR-R50 and O2-DEIM-R50 achieve 77.73%/78.45%/80.15% AP50 on DOTA1.0 and 132/119/119 FPS on the 2080ti GPU. Code is available at https://github.com/wokaikaixinxin/ai4rs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel detection framework using Angle Distribution Refinement (ADR) to iteratively refine angle predictions for oriented objects.
It employs a Chamfer distance cost for precise geometric matching, outperforming traditional metrics in aligning oriented bounding boxes.
The paper presents Oriented Contrastive Denoising (OCD) to stabilize training, achieving up to 80.15% AP and 297 FPS in remote sensing tasks.

Real-Time Oriented Object Detection Transformer for Remote Sensing Images: An Expert Analysis

Introduction

Oriented object detection in remote sensing is fundamentally distinct from generic planar detection, with the ubiquity of arbitrarily rotated objects (e.g., ships, aircraft, containers) posing unique geometric and computational challenges. This work introduces a real-time end-to-end oriented detection framework—O $^2$ -RTDETR, O $^2$ -DFINE, and O $^2$ -DEIM—tackling deficiencies in angle representation, geometric matching, and training instability. These models are positioned as the first real-time oriented object detection transformers, bringing modern DETR-based advantages to demanding remote sensing settings, while eschewing the latency penalties of post-processing techniques like rotated NMS.

Figure 1: Compared to existing real-time object detectors, the O $^2$ family achieves competitive accuracy-speed trade-offs.

Limitations of Prior Art and the Bottleneck of Rotated NMS

Traditional CNN-based real-time detectors (e.g., YOLOX-R, RTMDet-R, PP-YOLOE-R) merely append angle branches to standard bounding box heads, then rely on rotated NMS. Rotated NMS introduces execution time that grows superlinearly with the number of predicted boxes, complicates reproducibility due to sensitivity to confidence and IoU thresholds, and becomes the de facto runtime bottleneck in dense aerial scenes.

Figure 2: The number of retained oriented boxes is sensitive to the confidence threshold in rotated NMS.

Figure 3: Rotated NMS’s execution time rises steeply with the number of candidate boxes on modern hardware.

In contrast, end-to-end transformers for horizontal detection eliminate NMS but, prior to this work, lacked explicit modeling for orientation, thus failing to address remote sensing’s requirements.

The cornerstone is Angle Distribution Refinement (ADR), which reformulates angle prediction as the iterative refinement of probability distributions over both the orientation and the oriented box’s parameters. Instead of regressing $\theta$ directly as a scalar, ADR predicts distributions for edge distances (external rectangle) and vertex offsets, then refines these distributions residualy at each transformer decoder layer. This approach captures both uncertainty and fine granularity in rotation, unifies spatial and angular optimization, and efficiently addresses the unstable convergence properties of previous methods relying on Dirac delta angle regression.

Figure 4: Overview of O $^2$ -DFINE with ADR: Decoupled distributional learning for geometric parameters, refined across transformer layers.

Chamfer Distance Cost for Bipartite Matching

Classic label assignment in DETR variants uses L1, KL divergence, or Hausdorff distances, each with empirical failure cases for oriented boxes: L1 can favor spatially incorrect matches given small angle, KL divergence degenerates for squares with varying angles, and Hausdorff is sensitive to outlier points.

This work introduces a Chamfer distance cost, converting oriented boxes to vertex point sets and calculating average bidirectional nearest-neighbor distances, thus realizing precise geometric alignment while being permutation-invariant and robust to outliers.

Figure 5: The Chamfer distance cost ensures geometry-aligned bipartite matching compared to KL, L1, and Hausdorff metrics.

Oriented Contrastive Denoising for Stable Training

To address the instability in Hungarian bipartite assignment—where a ground truth may be matched to different queries across decoder layers (assignment fluctuation)—the authors propose Oriented Contrastive Denoising (OCD). OCD generates positive/negative pairs by injecting controlled noise into the oriented box parameters, with four noise modes: box noise, angle noise, geometric (joint) noise, and probability noise (covariance perturbation). This dual perturbation stabilizes training, sharpens positive/negative discrimination, and is comprehensively evaluated via a novel instability metric.

Figure 6: The OCD strategy introduces perturbations to anchor learning, demonstrated for each noise mode.

Figure 7: OCD significantly reduces matching instability across transformer layers.

Empirical Results: Accuracy, Efficiency, and Ablations

O $^2$ -DFINE, O $^2$ -RTDETR, and O $^2$ -DEIM outperform prior art in both speed and accuracy across DOTA-v1.0, DOTA-v1.5, DIOR-R, and FAIR1M-v1.0, with AP $_{50}$ up to 80.15% on DOTA1.0 and inference speeds up to 297 FPS for smaller variants on 2080Ti hardware. Notably, the performance gains are observed at identical computational cost and parameter count as their non-oriented DETR baselines, highlighting robustness to model scaling.

Key ablation findings:

Chamfer distance cost increases AP $_{50}$ by 1–2% over L1 or KL, with best performance using only the four vertex points.
Angle Distribution Refinement with 32 bins yields peak accuracy; excessive binning yields diminishing returns.
Box noise in OCD contributes the most to AP, whereas excessive geometric/probability noise is less effective.
Increasing OCD denoising queries improves AP, but with saturation beyond 200 queries.
All proposed modules yield monotonic AP improvements, with a cumulative gain of over 3.7% without extra FLOPs or latency.
Figure 8: Qualitative visualization of four OCD noise modes and their effect on box perturbation.

Figure 9: O $^2$ -DFINE produces feature response maps with more object-aligned activation than competing methods.

Qualitative Analysis

The visualization demonstrates that O $^2$ -DFINE and O $^2$ -RTDETR provide more precise localizations, especially for large, densely distributed, and low-light objects, where classic frameworks typically yield redundant or misaligned oriented boxes.

Figure 10: Qualitative comparison on challenging scenarios: O $^2$ models yield tighter and more robust oriented bounding boxes.

Practical and Theoretical Implications

Practically, the presented approach enables robust oriented detection at real-time speeds with deterministic inference, making it deployable in settings requiring high-throughput and geometrically reliable remote sensing analysis. By eliminating NMS and embracing a unified transformer pipeline, operational complexity is reduced, and latency is predictable and stable.

Theoretically, the introduction of residual distributional refinement and geometry-aligned matching (via Chamfer distance) provides a paradigm for joint uncertainty modeling and matching in structured prediction tasks with rotation or affine degrees of freedom.

The OCD mechanism and its instability-based analysis open up new directions for controlling assignment stability, potentially benefiting other domains where set prediction is required.

Future Directions

Further research could extend distributional parameterization to richer geometric forms (polygons, splines), couple the orientation modeling to context-aware global features, or integrate assignment stability as a differentiable regularizer. Additionally, transfer to semi-supervised or weakly supervised settings and deployment on resource-constrained platforms are natural directions.

Conclusion

This series of real-time oriented object detection transformers closes the gap between high-precision rotation-aware detection and the operational demands of remote sensing. With Angle Distribution Refinement, Chamfer distance cost, and Oriented Contrastive Denoising, these models combine strong geometric modeling, assignment stability, and computational efficiency, providing a new foundation for deployable oriented detection in high-throughput aerial or satellite imagery analysis (2603.15497).

Markdown Report Issue