MSVQA: Multimodal Visual Understanding Dataset

Updated 30 November 2025

MSVQA is a large-scale multimodal visual question answering and grounding dataset designed to evaluate continual learning in rapidly changing real-world scenarios.
It integrates four distinct settings—High Altitude, Underwater, Low Altitude, and Indoor—to stress-test model robustness against varying object scales, occlusions, and semantic complexities.
The dataset employs rigorous annotation protocols and introduces the UNIFIER architecture to mitigate catastrophic forgetting in multimodal large language models.

The Multimodal Visual Understanding Dataset (MSVQA) is a large-scale multimodal visual question answering (VQA) and grounding benchmark constructed to rigorously evaluate Multimodal LLMs (MLLMs) under the challenging setting of continual learning with real-world scenario shifts. MSVQA is explicitly designed to expose catastrophic forgetting, a prominent obstacle in the sequential adaptation of MLLMs to evolving environments. The benchmark integrates four distinct visual scenarios—High Altitude, Underwater, Low Altitude, and Indoor—providing a range of object densities, perspectives, occlusions, and semantic requirements. Each scenario is systematically re-annotated for fine-grained VQA and object grounding tasks, ensuring clear differentiation and enabling controlled study of scenario-adaptive model behavior (Jiang et al., 23 Nov 2025).

1. Dataset Composition and Scenario Selection

MSVQA incorporates four semantically and visually differentiated scenarios, each selected for unique characteristics that stress varying capacities of visual recognition and reasoning in MLLMs:

High Altitude: Derived from FAIR1M's remote-sensing airport imagery, features crowded scenes densely populated with small aircraft (≤30 pixels), often exceeding 30 objects per image. The orthogonal perspective and minimal contextual cues emphasize fine-grained model classification and precise localization under high density and small scale.
Underwater: Sourced from the RUOD23 dataset, this scenario presents images degraded by color attenuation, scatter, and heavy camouflage. Object counts can exceed 80 per image. Tasks involve fine-grained species classification, quantification within overlapping schools, and grounding of small marine species.
Low Altitude: Based on the Drones Detection & Tracking Challenge (DTM), includes UAV imagery with oblique angles, frequent occlusions, and objects (vehicles and pedestrians) densely arranged (again, up to 80+ per frame). This setting evaluates robust object localization under clutter, occlusion, and varying perspectives.
Indoor: Extracted from EPIC-KITCHENS first-person video key-frames. Indoor images are characterized by dynamic viewpoints, frequent occlusions (e.g., hands, utensils), and lens aberrations. Tasks are tailored to action speculation (verb–noun pair recognition among multiple-choice distractors) and standard object grounding, in scenes presenting high semantic complexity.

Each scenario is constructed by re-annotating images from the originating corpus using standardized templates for VQA and object grounding. Typical scenario sizes are: High Altitude (~12,500 images), Underwater (~15,000), Low Altitude (~18,000), and Indoor (~10,000), totaling approximately 55,000 images. Every image is paired with five QA templates, yielding on the order of 250,000 distinct question–answer pairs (Jiang et al., 23 Nov 2025).

2. Annotation Protocols and Quality Control

Data annotation follows a rigorous multi-stage protocol:

Preprocessing: Source images are filtered by semantic content (e.g., airport-only scenes), resized and cropped to ≤1500×1500 px using overlapping sliding windows with 200 px stride, ensuring small targets are fully captured in at least one patch.
Human Verification: Annotators verify and correct bounding boxes for targets in each crop. VQA question–answer pairs are instantiated from scenario-specific templates, with outputs stored in structured JSON format.
Quality Assurance: Cross-validation by a secondary annotator is performed on 15% of samples, and automated consistency checks (e.g., ground-truth count vs. bounding boxes for grounding tasks) filter out annotation mismatches. Only samples passing both human- and programmatic validation are included in the released data (Jiang et al., 23 Nov 2025).

3. Tasks, Question Types, and Evaluation Metrics

MSVQA encompasses five principal question types (four for Indoor), designed to probe both low-level perception and high-level reasoning:

Counting: Quantitative queries over object populations (1–80+ objects).
Classification: Multi-label selection among candidate classes given in the question.
True/False: Binary questions regarding presence/absence of specific subtypes.
Visual Grounding: Outputting lists of bounding boxes for all targets.
Fine-Grained Visual Grounding: Localizing a specified subtype of target via a single bounding box.
Action Speculation (Indoor only): Four-way multiple-choice questions on verb–noun activity pairs with systematically constructed distractors.

Evaluation Metrics:

Task Type	Metric Description	Closed-Form Equation
Counting	Exact match (1.0), near-miss (<1, 0.5), else (0.0)	$score_i = 1.0$ if $\|pred-gt\|=0$ ; $0.5$ if $\|pred-gt\|<1$ ; else $0.0$
Classification (MultiLabel)	Normalized score with false-positive penalty	$max(((n_{correct}-0.5\,n_{wrong})/n_{all}),0)$
True/False, Action	Exact match	1 if correct, 0 else
Visual Grounding	$F_1$ matched IoU≥0.5 between predicted and ground-truth boxes	$F_1 = 2\,Precision\,Recall/(Precision+Recall)$
Fine-Grained Grounding	$F_1$ as above, averaged with standard grounding in 3 outdoor scenarios

Indoor differs, presenting only Action Speculation and grounding tasks (Jiang et al., 23 Nov 2025).

4. Continual-Learning Protocols and Scenario Shifts

Data are partitioned by image to ensure split integrity. For continual learning, each scenario's images are divided into $T$ "tasks" or "steps" (T = 5, 10, 20), and at each step $t$ , only images from the respective scenario are provided. After each step, the model is evaluated across all scenario test-sets, supporting quantification of forward transfer (to new scenarios) and backward transfer (retention on prior scenarios). One-step cross-scenario evaluation is also performed (training on A, immediate testing on A and B, iterated over all 4×4 scenario pairs) (Jiang et al., 23 Nov 2025).

MSVQA's scenario structure introduces distributional shifts along several axes:

Object scale and density: High Altitude vs. Low Altitude shift object size and instance density dramatically.
Appearance and illumination: Underwater scenes introduce unique variances in color and texture.
Occlusion and perspective: Indoor first-person data presents frequent occlusions, dynamic viewpoints, and lens artifacts.
Semantic complexity: Tasks range from rote counting to nuanced verb–noun pair determination.

Such scenario dynamics are a primary cause of catastrophic forgetting in models using naive fine-tuning, as self-attention and feed-forward representations drift after domain adaptation, leading to loss of previously acquired scenario competence (Jiang et al., 23 Nov 2025).

5. Architectural Motivations: Unifier and Scenario Decoupling

To counteract forgetting induced by scenario shifts, MSVQA is accompanied by the UNIFIER architecture, which modifies the standard Transformer vision backbone. Each vision block is endowed with a $K$ -branch Cross-Scenario Representation (CSR) module, with $K$ corresponding to the number of scenarios.

Formally, the modified $l$ -th block processes input as:

$\begin{align*} a_l &= A_l(LN(r_{l-1})) + r_{l-1} \ r_l &= s_l(LN(a_l)) + p_l \ p_l &= P_l\left[\,\phi_l^1(a_l)\oplus\dots\oplus\phi_l^K(a_l)\right] \end{align*}$

Only the $\phi_l^k$ branch for the current scenario is updated; all others remain frozen. A global prototype $\mu_l = (1/K)\sum_{k=1}^K \phi_l^k(a_l)$ is computed. Two types of consistency loss constrain the branches:

$L_{c}^{l,k} = KL(\overline{softmax}(\phi_l^{k,fe}/\tau) \,\|\, \overline{softmax}(\mu_l^{fe}/\tau)) + KL(\overline{softmax}(\phi_l^{k,em}/\tau) \,\|\, \overline{softmax}(\mu_l^{em}/\tau))$

$L_{p}^{l} = KL(\overline{softmax}(p_{l,new}^{fe}/\tau) \,\|\, \overline{softmax}(p_{l,old}^{fe}/\tau)) + KL(\overline{softmax}(p_{l,new}^{em}/\tau) \,\|\, \overline{softmax}(p_{l,old}^{em}/\tau))$

$L_{vcc} = \frac{1}{L}\sum_{l=1}^L [L_p^l + \sum_{k=1}^K L_c^{l,k}]$

These losses softly align each scenario branch to the global prototype and penalize deviations between updated and prior representations, maintaining scenario-specific plasticity while preserving previously acquired knowledge (Jiang et al., 23 Nov 2025).

6. Significance and Future Applications

MSVQA is the first large-scale benchmark purposely structured to emulate real-world dynamics of continual multimodal perception and reasoning. Its analytic protocols quantify both knowledge acquisition and retention, and dataset variation enables probing of generalization under real deployment conditions encountered by MLLMs. The pronounced modality, density, and semantic shifts provide a demanding test bed for new continual learning algorithms and domain-adaptive architectures. The dataset's design and UNIFIER's architectural response provide a framework for future research in robust lifelong learning, cross-modal reasoning, and the mitigation of catastrophic forgetting in multimodal agents (Jiang et al., 23 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Visual Understanding Dataset (MSVQA).