RGBX-Grounding Benchmark
- RGBX-Grounding Benchmark is a comprehensive evaluation framework that fuses RGB, TIR, depth, and event modalities to enhance object localization in adverse imaging conditions.
- It benchmarks multi-modal natural language object grounding across complex, cross-domain scenarios, emphasizing fusion, spatial reasoning, and modal robustness.
- The framework utilizes rigorously annotated datasets and protocols, driving advances in sensor-agnostic model training and cross-modal interoperability.
The RGBX-Grounding Benchmark designates a class of evaluation frameworks for visual grounding that move beyond conventional RGB-only visual content, systematically incorporating complementary sensing modalities such as depth, thermal infrared (TIR), and event-based vision. These datasets and their associated protocols define rigorous, multi-modal testbeds centered on natural language object localization in complex, real-world and cross-domain settings, explicitly quantifying the capacity of models to fuse spatial, spectral, and semantic information under adverse imaging conditions and distribution shifts (Miyanishi et al., 2023, Zhao et al., 31 Dec 2025, Wu et al., 31 Jan 2026).
1. Motivation and Scope
Traditional visual grounding tasks have been constrained to RGB data acquired under controlled or “clean” environments—conditions poorly representative of operational domains such as robotics, autonomous systems, surveillance, and all-day/all-weather perception. This constraint limits both the robustness and generalization capacity of models, especially in scenarios with low illumination, atmospheric degradation (fog, rain), occlusion, or noncentral object placement (Zhao et al., 31 Dec 2025). The RGBX paradigm was introduced to address this gap by integrating complementary modalities (X ∈ {depth, TIR, event}) that remain informative where RGB degrades, and by promoting architectures and learning protocols robust to cross-modal, cross-sensor, and cross-annotator domain shifts (Miyanishi et al., 2023, Wu et al., 31 Jan 2026).
2. Benchmark Definitions and Task Structure
RGBX-Grounding benchmarks are instantiated over supervised testbeds in which a model is presented with multi-modal sensory input (at minimum, spatially aligned RGB and one or more X modalities), a natural-language referring expression , and tasked with predicting the spatial location (e.g., 2D/3D bounding box) of the target object.
Key grounding task formulations include:
- RGB + Thermal (TIR) Grounding: As defined in RGBT-Ground, input pairs of RGB and TIR images (spatially aligned), each associated with a free-form referring expression and an object-level bounding box; the model predicts the box position from uni-modal or fused input (Zhao et al., 31 Dec 2025).
- RGB + Depth/Event Grounding: RGBX-R1 formalizes a sequence-based task: a language query , a pair of template images (RGB and X-modality), and a temporal search set; the model returns box predictions in all sequential frames, leveraging both modalities at each step (Wu et al., 31 Jan 2026).
- 3D RGB-D Visual Grounding: Cross3DVG tasks emulate cross-dataset object grounding in RGB-D reconstructed scenes, emphasizing sensor and annotation shifts (Miyanishi et al., 2023).
In all variants, the strict evaluation of multi-modal fusion, domain adaptation, and spatial reasoning across both known and unknown (to the model) modalities is central.
3. Dataset Construction and Properties
3.1 RGBT-Ground
Draws from FLIR, M³FD, and MFAD detection sets to assemble 38,760 instances of spatially paired RGB and TIR images, each with high-quality, model-generated and human-verified referring expressions. Scene and object-level annotations encompass environmental (illumination, weather, occlusion) and geometric properties. The dataset is split into Train (26,604), Validation (2,032), and a stratified Test (10,124) with subsets for strong/weak lighting and small-object stress tests. Notably, the average referring expression length is 14.24 words—substantially higher than legacy RGB-only sets, reflecting the complexity of the task (Zhao et al., 31 Dec 2025).
3.2 RGBX-Grounding (RGBX-R1)
Samples from LasHeR, RGBT234 (thermal), DepthTrack, RGBD2022 (depth), and VisEvent (event modality) to construct 7,432 multi-image grounding (MIG) sequences—each comprising a pair of template images and six search frames per modality. Rigorous annotation includes ground-truth bounding boxes and chain-of-thought (VM-CoT) reasoning steps, filtered by model box IoU and human review. The aggregate includes ≈59,000 images with train/test splits per modality. This design supports task-specific evaluation under conditions of severe scene or modality shift (Wu et al., 31 Jan 2026).
3.3 Cross3DVG and RIORefer
The Cross3DVG benchmark and RIORefer dataset extend to 3D point clouds: 1,380 RGB-D scans from 3RScan, 31,801 unique objects, and 63,602 free-form referring expressions. The annotation protocol includes object-unique linguistic descriptions, a verification pass (description-to-object grounding), and statistical accuracy validation (90.3% ground-truth accuracy; only 5% requiring re-annotation). Category distributions are long-tailed and sensor-specific, stressing cross-domain transfer (Miyanishi et al., 2023).
4. Evaluation Protocols and Metrics
RGBX-Grounding tasks adopt rigorous quantitative metrics to evaluate grounding performance:
- Intersection-over-Union (IoU): is the primary localization criterion across all benchmarks.
- [email protected] ([email protected]): Given a threshold , is the fraction of queries, objects, or frames where predicted . For sequence tasks, [email protected] is averaged across all frames (Zhao et al., 31 Dec 2025, Wu et al., 31 Jan 2026, Miyanishi et al., 2023).
- Accuracy@K: For 3D grounding, , where is the total number of queries and is the predicted rank of the ground-truth box.
- mIoU (mean IoU): Optionally reported as mean IoU over frames or instances; benchmarks emphasize [email protected] for primary comparison.
Splits are constructed to ensure robustness evaluation (e.g., zero-shot sensor transfer in Cross3DVG, low-light/long-tail subsets in RGBT-Ground, cross-modal sequence masking in RGBX-R1).
5. Baselines, Model Architectures, and Training Paradigms
5.1 Model Classes
- Off-the-shelf MLLMs: e.g., LLaVA-OV-7B, InternVL2-8B, Qwen2.5-VL, Qwen3-Plus; incapable of non-RGB grounding without explicit adaptation (Wu et al., 31 Jan 2026).
- Multi-Image Grounding models: e.g., Migician, UniVG-R1; support temporal and multi-modal input.
- Fine-tuned models: Classical supervised fine-tuning (SFT) and hybrid pipelines integrating chain-of-thought (CoT) or reinforcement learning.
5.2 RGBT-VGNet
Implements CLIP-based ViT backbones with LoRA-style adapters for RGB/TIR, asymmetric adapter ranks (), and language-aware visual synergy (LAVS) modules for cross-modal attention. Grounding prediction uses fusion transformers and a regression head over all tokens plus a dedicated [Reg] token. Trained under AdamW for 120 epochs at resolution (Zhao et al., 31 Dec 2025).
5.3 RGBX-R1
Employs a two-stage paradigm:
- Cold-Start Supervised Fine-Tuning (CS-SFT): Guides the reasoning process via VM-CoT, promoting emergence of modality-specific representations.
- Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT): Utilizes MuST reward, comprising format, modality-understanding, and spatio-temporal components (the latter penalizes inertial numeric guessing of box coordinates by weighting later frames more strongly). PPO-style update enforces KL-regularization to a reference CoT policy (Wu et al., 31 Jan 2026).
5.4 Cross3DVG Architectures
- VoteNet+MLP: Pure 3D methods leveraging point cloud detection and GRU-encoded text fusion.
- VoteNet+Transformer/DETR3D+Transformer: Incorporate multi-head self- and cross-attention for localization; DETR3D anchor-free transformer detectors offer performance gains.
- CLIP-Enhanced Multi-View Fusion: Uses multi-view RGB representations via CLIP image and text encoders, matching transformers, and weighted 2D proposal fusion with camera-to-proposal selection (Miyanishi et al., 2023).
6. Experimental Results and Analysis
6.1 Cross-Modal and Cross-Domain Robustness
Across all RGBX-Grounding benchmarks, zero-shot and cross-domain scenarios reveal substantial drops in performance compared to within-domain settings:
- Cross3DVG: [email protected] decreases by 8–12% absolute (e.g., from 36.6% to 24.8% for the best baseline transferring ScanRefer→RIORefer), with oracle upper bounds (oracle localization or detection) indicating headroom up to 78.5% [email protected] (Miyanishi et al., 2023).
- RGBX-R1: Attains 46.53% [email protected] across three modalities, a 22.71% improvement over baselines; absolute gains per modality are 27.6% (Thermal), 116.2% (Depth), 56.3% (Event) relative to supervised fine-tuned baselines. Simple supervised fine-tuning on X-modality sequences yields less than 20% [email protected] (Wu et al., 31 Jan 2026).
- RGBT-Ground: RGBT-VGNet achieves 67.19% [email protected] on low-light (TestB) and 52.22% on small-object (TestC) splits, outperforming RGB-only or TIR-only baselines by 5–10% absolute. The average referring expression is notably longer than prior benchmarks, contributing to grounding complexity. Models trained from scratch on COCO-derived data fail to generalize ([email protected] <30%) (Zhao et al., 31 Dec 2025).
6.2 Ablation and Design Insights
Key findings include:
- Multi-view and multi-modal fusion: Number of views (or modalities) contributes up to 1–2% gain in [email protected]; text-guided CLIP image feature selection is essential.
- Normalized geometric input: Robust performance is observed when per-point normals are provided, while raw RGB can hurt cross-sensor transfer.
- Chain-of-thought supervision: Guidance chains (VM-CoT) are critical for spatial and modality alignment; reinforcement learning without CoT degrades performance.
7. Current Limitations and Directions for Expansion
RGBX benchmarks expose open challenges:
- Long-tail object categories: Underrepresented, sensor-specific classes produce grounding failures, demonstrating the need for data expansion and explicit long-tail modelling (Miyanishi et al., 2023).
- Spatio-temporal consistency: MLLMs tend to “drift” predicted box sequences in sequence tasks; dedicated reward terms such as MuST’s offset weighting are necessary to prevent inertial guessing (Wu et al., 31 Jan 2026).
- Modality-agnostic learning: Pre-trained MLLMs lack innate capacity to interpret or fuse X-modality streams; as little as 1% of target modality data suffices for transfer in the presence of proper VM-CoT supervision (Wu et al., 31 Jan 2026).
- Broader modality inclusion: Existing benchmarks center on RGB, TIR, depth, and event; future domains include radar, LiDAR, and region-masked or segmentation variants (Zhao et al., 31 Dec 2025).
- Unified cross-modal pre-training: Effective pre-training over large-scale heterogeneous (web + 3D scene) corpora remains an open research area (Miyanishi et al., 2023).
Table: Comparative Properties of Major RGBX Benchmarks
| Benchmark | Modalities | Data Size | Task Type |
|---|---|---|---|
| Cross3DVG | RGB-D (3D scans) | 63.6K expr. | 3D object grounding |
| RGBX-Grounding | RGB+TIR/Depth/Event | 7,432 MIGs | Sequence image grounding |
| RGBT-Ground | RGB+TIR (image pairs) | 38,760 | Uni-/multi-modal 2D VG |
The systematic integration of diverse sensing modalities into grounding benchmarks, as institutionalized by RGBX-Grounding, has established a new standard for evaluating multi-modal vision-LLMs under real-world, robustness-critical regimes, and demonstrates the necessity of modality-aware architecture and learning design (Miyanishi et al., 2023, Zhao et al., 31 Dec 2025, Wu et al., 31 Jan 2026).