- The paper introduces a novel detection framework using Angle Distribution Refinement (ADR) to iteratively refine angle predictions for oriented objects.
- It employs a Chamfer distance cost for precise geometric matching, outperforming traditional metrics in aligning oriented bounding boxes.
- The paper presents Oriented Contrastive Denoising (OCD) to stabilize training, achieving up to 80.15% AP and 297 FPS in remote sensing tasks.
Introduction
Oriented object detection in remote sensing is fundamentally distinct from generic planar detection, with the ubiquity of arbitrarily rotated objects (e.g., ships, aircraft, containers) posing unique geometric and computational challenges. This work introduces a real-time end-to-end oriented detection frameworkโO2-RTDETR, O2-DFINE, and O2-DEIMโtackling deficiencies in angle representation, geometric matching, and training instability. These models are positioned as the first real-time oriented object detection transformers, bringing modern DETR-based advantages to demanding remote sensing settings, while eschewing the latency penalties of post-processing techniques like rotated NMS.
Figure 1: Compared to existing real-time object detectors, the O2 family achieves competitive accuracy-speed trade-offs.
Limitations of Prior Art and the Bottleneck of Rotated NMS
Traditional CNN-based real-time detectors (e.g., YOLOX-R, RTMDet-R, PP-YOLOE-R) merely append angle branches to standard bounding box heads, then rely on rotated NMS. Rotated NMS introduces execution time that grows superlinearly with the number of predicted boxes, complicates reproducibility due to sensitivity to confidence and IoU thresholds, and becomes the de facto runtime bottleneck in dense aerial scenes.
Figure 2: The number of retained oriented boxes is sensitive to the confidence threshold in rotated NMS.
Figure 3: Rotated NMSโs execution time rises steeply with the number of candidate boxes on modern hardware.
In contrast, end-to-end transformers for horizontal detection eliminate NMS but, prior to this work, lacked explicit modeling for orientation, thus failing to address remote sensingโs requirements.
Model Architecture and Angle Distribution Refinement
The cornerstone is Angle Distribution Refinement (ADR), which reformulates angle prediction as the iterative refinement of probability distributions over both the orientation and the oriented boxโs parameters. Instead of regressing ฮธ directly as a scalar, ADR predicts distributions for edge distances (external rectangle) and vertex offsets, then refines these distributions residualy at each transformer decoder layer. This approach captures both uncertainty and fine granularity in rotation, unifies spatial and angular optimization, and efficiently addresses the unstable convergence properties of previous methods relying on Dirac delta angle regression.
Figure 4: Overview of O2-DFINE with ADR: Decoupled distributional learning for geometric parameters, refined across transformer layers.
Chamfer Distance Cost for Bipartite Matching
Classic label assignment in DETR variants uses L1, KL divergence, or Hausdorff distances, each with empirical failure cases for oriented boxes: L1 can favor spatially incorrect matches given small angle, KL divergence degenerates for squares with varying angles, and Hausdorff is sensitive to outlier points.
This work introduces a Chamfer distance cost, converting oriented boxes to vertex point sets and calculating average bidirectional nearest-neighbor distances, thus realizing precise geometric alignment while being permutation-invariant and robust to outliers.
Figure 5: The Chamfer distance cost ensures geometry-aligned bipartite matching compared to KL, L1, and Hausdorff metrics.
Oriented Contrastive Denoising for Stable Training
To address the instability in Hungarian bipartite assignmentโwhere a ground truth may be matched to different queries across decoder layers (assignment fluctuation)โthe authors propose Oriented Contrastive Denoising (OCD). OCD generates positive/negative pairs by injecting controlled noise into the oriented box parameters, with four noise modes: box noise, angle noise, geometric (joint) noise, and probability noise (covariance perturbation). This dual perturbation stabilizes training, sharpens positive/negative discrimination, and is comprehensively evaluated via a novel instability metric.
Figure 6: The OCD strategy introduces perturbations to anchor learning, demonstrated for each noise mode.
Figure 7: OCD significantly reduces matching instability across transformer layers.
Empirical Results: Accuracy, Efficiency, and Ablations
O2-DFINE, O2-RTDETR, and O2-DEIM outperform prior art in both speed and accuracy across DOTA-v1.0, DOTA-v1.5, DIOR-R, and FAIR1M-v1.0, with AP50โ up to 80.15% on DOTA1.0 and inference speeds up to 297 FPS for smaller variants on 2080Ti hardware. Notably, the performance gains are observed at identical computational cost and parameter count as their non-oriented DETR baselines, highlighting robustness to model scaling.
Key ablation findings:
- Chamfer distance cost increases AP50โ by 1โ2% over L1 or KL, with best performance using only the four vertex points.
- Angle Distribution Refinement with 32 bins yields peak accuracy; excessive binning yields diminishing returns.
- Box noise in OCD contributes the most to AP, whereas excessive geometric/probability noise is less effective.
- Increasing OCD denoising queries improves AP, but with saturation beyond 200 queries.
- All proposed modules yield monotonic AP improvements, with a cumulative gain of over 3.7% without extra FLOPs or latency.
Figure 8: Qualitative visualization of four OCD noise modes and their effect on box perturbation.
Figure 9: O2-DFINE produces feature response maps with more object-aligned activation than competing methods.
Qualitative Analysis
The visualization demonstrates that O2-DFINE and O2-RTDETR provide more precise localizations, especially for large, densely distributed, and low-light objects, where classic frameworks typically yield redundant or misaligned oriented boxes.
Figure 10: Qualitative comparison on challenging scenarios: O2 models yield tighter and more robust oriented bounding boxes.
Practical and Theoretical Implications
Practically, the presented approach enables robust oriented detection at real-time speeds with deterministic inference, making it deployable in settings requiring high-throughput and geometrically reliable remote sensing analysis. By eliminating NMS and embracing a unified transformer pipeline, operational complexity is reduced, and latency is predictable and stable.
Theoretically, the introduction of residual distributional refinement and geometry-aligned matching (via Chamfer distance) provides a paradigm for joint uncertainty modeling and matching in structured prediction tasks with rotation or affine degrees of freedom.
The OCD mechanism and its instability-based analysis open up new directions for controlling assignment stability, potentially benefiting other domains where set prediction is required.
Future Directions
Further research could extend distributional parameterization to richer geometric forms (polygons, splines), couple the orientation modeling to context-aware global features, or integrate assignment stability as a differentiable regularizer. Additionally, transfer to semi-supervised or weakly supervised settings and deployment on resource-constrained platforms are natural directions.
Conclusion
This series of real-time oriented object detection transformers closes the gap between high-precision rotation-aware detection and the operational demands of remote sensing. With Angle Distribution Refinement, Chamfer distance cost, and Oriented Contrastive Denoising, these models combine strong geometric modeling, assignment stability, and computational efficiency, providing a new foundation for deployable oriented detection in high-throughput aerial or satellite imagery analysis (2603.15497).