Papers
Topics
Authors
Recent
Search
2000 character limit reached

RSHR-Bench: Ultra-High-Res RS Evaluation

Updated 26 December 2025
  • RSHR-Bench is a benchmark that evaluates vision-language models on ultra-high-resolution remote sensing imagery, emphasizing realistic operational-scale images and complex visual reasoning.
  • It assembles over 5,300 full-scene images from diverse RS sources, retaining native resolutions up to 29,200×27,620 pixels to ensure authentic evaluation conditions.
  • The benchmark supports multiple task families—including VQA, image captioning, and single-image evaluation—and employs rigorous adversarial filtering and human verification to mitigate text-only biases.

RSHR-Bench is a benchmark for evaluating vision–LLMs (VLMs) and multimodal LLMs (MLLMs) on ultra-high-resolution remote sensing (RS) imagery. Developed to address deficiencies in prior RS benchmarks, which typically rely on low-resolution datasets or contain inadequately designed reasoning tasks, RSHR-Bench targets the operational scale encountered in real-world satellite and UAV remote-sensing workflows. This resource consists of thousands of full-scene images maintaining their native extreme resolution and is structured to enable comprehensive assessment of visual understanding, visual reasoning, and scene-level interpretation in challenging RS scenarios (Dang et al., 19 Dec 2025).

1. Motivation and Benchmark Design

RSHR-Bench was introduced to counter two main limitations observed in prior RS benchmarks: dependence on downsampled (low-resolution) scenes and reasoning tasks solvable by text-only LLMs. The benchmark is designed to evaluate genuine visual understanding on operational-scale RS data, where image sizes routinely exceed 4,000 pixels on the long side with up to approximately 3×1083 \times 10^8 native pixels per image. This supports systematic study of VLM and MLLM capabilities in scenarios that demand extensive spatial reasoning, fine-grained object perception, and robust multi-turn dialog with large-scale scenes.

2. Corpus Assembly and Properties

The RSHR-Bench image corpus comprises 5,329 full-scene images drawn from six widely used RS sources: DOTA v1.0/v2.0, XLRS-Bench, MiniFrance, FAIR1M, HRSCD, and an in-house UAV collection (with per-frame resolution near 100 MP). All images retain their original resolution, with an average dimension of approximately 8,700 × 8,065 pixels and a maximum up to 29,200 × 27,620 pixels. This ensures coverage across various domains and sensing platforms. Image selection was governed by criteria max(W,H)4000\max(W,H)\ge4000 and W×H3×108W\times H\le3\times10^{8}, directly reflecting operational usage.

3. Task Families and Annotation Protocol

RSHR-Bench defines four distinct task families designed to probe a wide range of perception and reasoning competencies:

  1. Multiple-choice VQA: 3,864 closed-set questions targeting fine-grained perception and reasoning.
  2. Open-ended VQA: Reformulated versions of the multiple-choice items into free-form prompts, resulting in 1,932 vetted answer pairs.
  3. Image Captioning: 3,913 images annotated with both holistic scene summaries and directional (top, bottom, left, right) region descriptions.
  4. Single-image Evaluation: 50 images (4K–200 MP) each annotated with ten subtasks (totaling 500 human-written question–answer pairs) spanning perception, reasoning, and captioning.

Within VQA, perception tasks are categorized as Color Detection, Shape/Margin Recognition, Orientation Detection, Object Classification, Object Spatial Relationship, Object Grounding, Regional Grounding, Object Counting, and Regional Counting. Reasoning tasks are partitioned into Anomaly Detection/Interpretation, Future Prediction, Multi-Region Joint Contrast (including multi-image and single-image, multi-box), and Object State Judgment. The design accommodates multi-turn dialogue (e.g., sequential anomaly/future-prediction queries) and multi-image fusion for realistic RS evaluation workflows.

4. Adversarial Filtering and Human Verification

To mitigate the risk of language-prior exploitation—where a model answers based on textual patterns rather than genuine visual input—a rigorous two-stage adversarial filtering and validation process was employed:

  • Stage 1 (Adversarial Filtering): Strong text-only LLMs (Qwen3-8B, Llama3-8B) answered each VQA item without images. Items for which models performed too well were revised or discarded.
  • Stage 2 (Human Verification): Six trained annotators, spending approximately 300 hours, independently produced and audited question–answer pairs. The criteria included correctness, precise visual grounding, avoidance of textual hints, and unambiguous phrasing. Generation incorporated Qwen2.5-VL-7B and GPT-5 Thinking, with provenance tracking and ~100 GPU hours of model calls.

Iterative rewriting reduced text-only answerability to below 30% accuracy while retaining the visual complexity necessary for meaningful RS evaluation.

5. Model Evaluation and Performance Analysis

Fourteen representative models were evaluated on RSHR-Bench, including remote-sensing-specific VLMs (EarthDial, GeoChat, GeoLLaVA-8K, VHM), open-source VLMs (InternVL, MiniCPM2, Phi-Vision, Qwen2.5-VL, DeepSeek-VL, VILA-HD), closed-source MLLMs (GPT-5, GPT-4o, GPT-4o mini, Gemini-2.5-pro), and text-only LLMs (Llama3-8B, Qwen3-8B):

Task Open-source VLMs Closed-source VLMs Text-only LLMs
Multiple-choice VQA acc. ~25% up to ~50% --
Open-ended VQA (perception/reasoning) <50% / 40–60% <50% / 40–60% 30–40% (reasoning)
Captioning (BLEU-4) ≤5 ≤5 --
Captioning (METEOR/ROUGE-L) ≤35 ≤35 --
Single-image eval. acc. ≈30% (4K–8K px) degrades further at 100–200 MP --

Observed performance gaps are substantial across perception and reasoning tasks, with open-source models typically clustering at chance-level accuracy and closed-source systems peaking near 50%. Captioning scores confirm limited scene-understanding. Notably, text-only LLMs achieved 30–40% on reasoning tasks in the absence of images, underscoring the effectiveness of the adversarial filtering pipeline and revealing a persistent challenge in visual grounding.

6. Significance and Availability

RSHR-Bench establishes a rigorous, high-resolution testbed for remote-sensing visual understanding, supporting a spectrum of tasks from perception to complex reasoning, and both multi-turn and multi-image dialog contexts. The benchmark’s emphasis on operational-scale images and adversarially filtered, human-validated queries addresses critical mismatches identified in prior benchmarks. All code, datasets, prompts, and evaluation scripts are publicly disseminated at https://github.com/Yunkaidang/RSHR, thereby providing an extensible platform for future developments in RS VLM and MLLM research (Dang et al., 19 Dec 2025).

7. Implications for Future Research

The persistent model shortfalls observed on RSHR-Bench, particularly under ultra-high-resolution and complex reasoning conditions, highlight the unsolved challenge of visually grounded intelligence at industrial RS scales. A plausible implication is the necessity for new architectures or representation strategies capable of handling extreme image resolution, robust visual grounding, and reasoning under severe visual diversity. The inclusion of multi-turn and multi-image dialogue tasks indicates future directions in dialog-centric and situational inference paradigms. The strong baseline established by RSHR-Bench is positioned to catalyze targeted advances in multimodal RS intelligence, improved fusion mechanisms, and adversarially robust evaluation methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RSHR-Bench.