Memory Quality Control (MQC-SAM)

Updated 24 January 2026

The paper introduces MQC-SAM, a two-stage memory-quality-aware framework that integrates motion-consistency calibration and decoupled attention memory integration to boost RS-RVOS performance.
It leverages CroBIM-V with SAM2 to generate an initial mask calibrated by motion trajectories, refining segmentation for small, low-contrast targets using first-frame-only prompts.
Experimental results on RS-RVOS Bench demonstrate substantial gains over state-of-the-art methods by mitigating error accumulation and ensuring high-quality memory updates.

Memory Quality Control with Segment Anything Model (MQC-SAM) is a specialized online framework for remote sensing referring video object segmentation (RS-RVOS). Developed to address the unique challenges of segmenting small, low-saliency objects in satellite videos using only causal (first-frame) linguistic prompts, MQC-SAM combines temporal motion consistency calibration with memory-quality-aware sequential mask inference. The method is tightly coupled with the CroBIM-V system and uses the SAM2 segmentation model to deliver state-of-the-art performance on the first large-scale remote-sensing RVOS benchmark, RS-RVOS Bench (Jiang et al., 17 Jan 2026).

1. RS-RVOS Bench: Dataset and Evaluation Protocol

MQC-SAM was proposed jointly with RS-RVOS Bench, the domain’s first large-scale, causality-aware benchmark for RS-RVOS. RS-RVOS Bench comprises 111 satellite video sequences with approximately 25,000 total frames and 213,000 temporal referring annotations using an explicit protocol where all linguistic prompts are first-frame only. This excludes access to future or global context, enforcing a strict online-inference regime. Targets span 11 fine-grained semantic classes (e.g., aircraft, yacht, vehicle), with dense annotation (≈8.5 expressions/frame) and domain-specific spatial referencing (e.g., “center-right vehicle”). Challenging conditions such as occlusion and distractors are prevalent.

Online RVOS is defined as producing a binary segmentation mask Rₜ for each frame given only a prompt Q generated from the initial frame I₀ and inputs up to time t. Evaluation metrics include:

Region similarity $\mathcal{J}$ : Frame-wise IoU between prediction and ground truth, averaged over all frames with the target present.
Contour accuracy $\mathcal{F}$ : Boundary-based F-measure.
Combined score: $½(\mathcal{J}+\mathcal{F})$ .
[email protected] for both region and contour metrics.

Comparison with state-of-the-art baselines (ReferFormer, SOC, ReferDINO, SAMWISE) revealed substantial room for improvement, especially under these uniquely causal and visually challenging conditions (Jiang et al., 17 Jan 2026).

2. Problem Motivation: Error Accumulation and Memory Contamination

RS-RVOS in the remote sensing context is marked by two primary obstacles:

Biased Memory Initialization: Early misalignment between mask and actual object due to weak saliency, low contrast, or environmental noise results in defective initial memory, impairing downstream frame-by-frame instance localization.
Indiscriminate Memory Accumulation: Integrating visual features from all past frames without quality control, particularly when occlusion, distractors, or segmentation failures occur, encodes noise and propagates errors—compounding deviation and deteriorating segmentation fidelity.

MQC-SAM directly addresses both phenomena via motion-validated (not just semantic) calibration and multi-factorial dynamic attention control.

3. MQC-SAM: Two-Stage Memory-Quality-Aware Framework

MQC-SAM consists of two hierarchical stages: Initial Memory Calibration and Sequential Quality-Controlled Integration.

3.1 Stage I: Initialization and Motion-Consistency Calibration (TMCC)

Vision–Language Fusion: CroBIM with SAM2 produces the initial mask $R_0$ using the first-frame-only prompt and image.
Motion-Consistency Calibration (TMCC): $R_0$ is refined with short-term, multi-frame motion trajectory priors to correct structural deviations and firmly anchor the memory to the true target.

Key steps and notation:

Compute optimal window length $n^*$ for motion statistics by minimizing the overlap deviation beyond a threshold $\tau_{\mathrm{motion}}$ .
Calculate pixelwise averaged motion vector magnitude map $D(x,y)$ .
Identify confidently moving core $R_{\mathrm{moving}}$ , extract mean $\mu_d$ and std $\sigma_d$ of motion within $R_{\mathrm{moving}}$ .
Expand $R_{\mathrm{moving}}$ to $R_{\mathrm{expand}}$ by including pixels with $|D-\mu_d|<\alpha\sigma_d$ , refine to $R_{\mathrm{refine}}$ by removing outliers with $|D-\mu_d|>\beta\sigma_d$ .
Dual verification: For connected masklet $C_k$ , maximize the score

$S(C_k, R_{\mathrm{anchor}}) = \frac{|C_k \cap R_{\mathrm{anchor}}|}{\min(|C_k|, |R_{\mathrm{anchor}}|)} + \lambda \exp\left(-\frac{d_\partial(C_k, R_{\mathrm{anchor}})}{\sigma_s}\right).$

The highest scoring component $C_{k^*}$ designates $R_{\mathrm{motion}}$ ; final mask is $R_{\mathrm{calibrated}} = R_{\mathrm{anchor}} \cup R_{\mathrm{motion}}$ , smoothed via morphological operators.

3.2 Stage II: Decoupled Attention Memory Integration (DAMI)

At each timestep $t$ :

Construct a memory bank $\mathcal{M}_t$ with high-quality mask features.
Fuse three orthogonal attention mechanisms:

$F_{\mathrm{fused}} = \omega_1\,\mathrm{Attn}_{\mathrm{sem}}(q_t,A_{\mathrm{sem}}) + \omega_2\,\mathrm{Attn}_{\mathrm{short}}(q_t,\mathcal{F}_{\mathrm{mem}}) + \omega_3\,\mathrm{Attn}_{\mathrm{disc}}(q_t,\mathcal{F}_{\mathrm{disc}}).$

- Semantic anchor attention: static vision-language fused embedding from $(I_0, \tilde R_{\mathrm{calibrated}})$ . - Short-term evolution attention: FIFO buffer of per-frame mask representations for $L$ most recent frames, updated only if frame-level innovation score $G(I_t)$ (quantifying deviation from historical mean) passes threshold $\tau_{\mathrm{gain}}$ .

$G(I_t)=\frac{1}{|\Omega|}\sum_p\|f_t(p)-\hat f_t(p)\|_2$

- Discriminative prototype attention: Prototype pool updated only for high-confidence, low-ambiguity predictions.

Dynamic memory update blends last memory and present fused features based on a sigmoid-modulated innovation weight $\alpha_t$ .

This design ensures that only high-confidence, non-redundant features enter the memory, filtering out unreliable observations commonly caused by occlusion or segmentation drift.

4. Experimental Results and Ablation

On the RS-RVOS Bench test set, MQC-SAM outperforms state-of-the-art RS-RVOS methods by a substantial margin. Notable results (full test set):

Method	½(𝒥+ℱ)↑	𝒥↑	J-Rec↑	ℱ↑	F-Rec↑
ReferFormer	0.526	0.296	0.196	0.756	0.811
OnlineRefer	0.324	0.149	0.083	0.500	0.551
VISA	0.215	0.118	0.014	0.312	0.241
SAMWISE	0.481	0.317	0.306	0.645	0.682
SOC	0.539	0.348	0.267	0.730	0.761
ReferDINO (Swin-B)	0.562	0.303	0.210	0.821	0.880
MQC-SAM	0.712	0.506	0.541	0.918	0.952

Component-wise ablation demonstrates:

Baseline (CroBIM+SAM2): 0.655 in ½(𝒥+ℱ)
+TMCC only: 0.699 (+0.044)
+DAMI only: 0.678 (+0.023)
Full (TMCC + DAMI): 0.712 (+0.057)

This suggests that both motion consistency calibration and quality-controlled attention integration contribute significant, complementary improvements.

5. Context, Limitations, and Future Perspectives

RS-RVOS Bench enforces first-frame-only prompting and online-only inference, driving development away from traditional global-context or retrospective methods, and towards robust, error-tolerant temporal modeling. MQC-SAM leverages temporal motion statistics to combat the initialization issue, alongside dynamic gating and ambiguity detection to prevent error propagation—a class of techniques likely to be broadly useful in causality-aware segmentation and tracking.

A plausible implication is that memory-quality control concepts (dynamic attention, innovation-based gating, dual semantic–motion verification) could generalize to other scenarios: long-term RVOS, open-world segmentation, and video-language understanding problems in which memory contamination and error cascade are prevalent.

Observed limitations include dependency on reliable motion cues (weak or ambiguous motion may limit calibration gains), and possible error fallback when prototypes or semantic anchors are misaligned in dense distractor scenes. Continued advances in domain-adaptive motion encoding and fine-grained semantic discrimination are likely areas for future work.

6. Significance for Remote Sensing and General RVOS

By establishing both a new benchmark and a top-performing methodology, MQC-SAM and RS-RVOS Bench provide a foundation for principled progress in remote sensing RVOS. The rigorous causality constraints and focus on small, low-contrast targets distribute future research momentum towards memory-quality management, hybrid semantic–motion integration, and domain-specific evaluation criteria (Jiang et al., 17 Jan 2026). Adoption of these strategies for other domains—especially where long-term temporal stability, occlusion robustness, and annotation density are central—remains an important direction for the field.

Markdown Report Issue Upgrade to Chat

References (1)

CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Quality Control with Segment Anything Model (MQC-SAM).