CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection

Published 2 Apr 2026 in cs.CV | (2604.02160v1)

Abstract: Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a training-free method that uses competition-aware posterior calibration and geometry token consistency for robust open-vocabulary change detection.
It achieves significant performance gains on benchmarks with up to 4.98 F1 point improvements while offering faster, efficient inference.
The approach integrates semantic calibration with region-level consensus, reframing change detection for scalable and reliable remote sensing applications.

Consistency-Regularized Posterior Reasoning for Open-Vocabulary Change Detection

Motivation and Problem Setting

Open-vocabulary change detection (OVCD) in remote sensing seeks to identify where land-cover semantics change across time for arbitrary queried concepts, obviating the need for a fixed label space. Traditional change detection (CD) methods, grounded in binary or closed-set semantic change detection paradigms, are fundamentally restricted by their predetermined class spaces and heavy reliance on pixel-level supervision. Recent foundation models have stimulated novel open-vocabulary paradigms, leveraging pretrained vision-language interfaces for open-world recognition. However, directly transferring such models to dense change localization introduces substantial challenges, principally the lack of consistent, comparable, and semantically-aligned dense responses across time, particularly under variable appearance conditions (illumination, seasonality, atmospheric effects) and weak cross-concept competition.

CoRegOVCD addresses these challenges by regularizing dense posterior differencing with structural and regional consensus, thus mitigating pseudo changes and enhancing semantic reliability without training or adaptation. The pivotal innovation lies in recasting OVCD as competition-aware posterior calibration and leveraging geometry tokens for robust structural verification.

Figure 1: Paradigm comparison illustrating the shift from explicit mask/instance matching to dense posterior-based reasoning in OVCD.

Methodological Framework

CoRegOVCD implements a fully training-free, modular pipeline with the following interrelated stages:

Dense Concept Score Construction: Leveraging prompt-conditioned Segment Anything Model~3 (SAM~3), per-pixel concept confidence scores are aggregated from instance and dense semantic branches over a prompt vocabulary.
Competitive Posterior Calibration (CPC): Instead of considering raw scores, CPC applies a competition-aware normalization between the queried and maximal non-query concepts, exponentiated to accentuate competition suppression. This yields stable, temporally comparable queried-concept posteriors.
Semantic Posterior Delta (SPD): The core semantic change signal is quantified as the absolute difference between CPC posteriors across time. SPD may be maximized over sets of semantically related prompts for robustness.
Geometry-Token Consistency Gate (GeoGate): For structural verification, a geometric encoder instantiated with Depth Anything~3 extracts spatial tokens; the cosine distance between temporally paired tokens forms a gate map highlighting structural discrepancies. Only semantically and structurally consistent changes are retained downstream.
Regional Consensus Discrepancy (RCD): SPD and GeoGate outputs are fused using a parameterized gating mechanism with an additive compensation term to recover weak semantic cues when geometry cues are strong. SLIC superpixels, computed over the mean temporal image, impose local region-level consensus, regularizing spatial coherence.
Lightweight Final Mask Inference: Post-fusion, 8-bit quantization and thresholding convert the coherent score map into a binary change mask. Morphological filtering further prunes spurious fragments, retaining compact, reliable regions.
Figure 2: Overview of CoRegOVCD pipeline combining CPC, SPD, GeoGate, and RCD stages for query-conditioned bi-temporal inference.

Empirical Evaluation and Component Analysis

Quantitative Performance

CoRegOVCD establishes new state-of-the-art results among training-free OVCD methods on four challenging benchmarks—LEVIR-CD, WHU-CD-256, DSIFN, and SECOND. On the LEVIR-CD and WHU-CD-256 datasets, F1 $_C$ gains of 2.24–4.98 points are observed over previous strong baselines such as OmniOVCD and AdaptOVCD. For the SECOND benchmark, covering six semantic categories, CoRegOVCD achieves a class-average F1 $_C$ of 47.50%, raising IoU $_C$ to 31.67%—with especially marked improvements on visually and semantically ambiguous classes (tree, water, low vegetation).

Ablation studies demonstrate that CPC and GeoGate are crucial: omitting CPC reduces SECOND class-average F1 $_C$ from 47.50% to 46.65%, while excluding GeoGate leads to a drastic drop to 36.21%. Removing the additive term in RCD diminishes DSIFN F1 $_C$ from 64.45% to 37.74%, underscoring the importance of both semantic calibration and structural consensus.

Prompt and Query Robustness

A thorough prompt substitution analysis Figure 3 confirms the semantic generality of CoRegOVCD's modeling: multiple query variants exhibit competitive results per class, affirming the validity of prompt set aggregation. However, performance degrades with progressively weaker semantic alignment, indicating that prompt engineering and vocabulary selection remain non-trivial for certain classes, especially composite or visually heterogeneous categories.

Figure 3: Query substitution results on SECOND showing class-dependent robustness of alternative text prompts.

Efficiency and Scalability

CoRegOVCD's dense, posterior-based inference path offers favorable speed–accuracy trade-offs. It runs approximately 2.4–8.1 $\times$ faster and with lower peak memory than previous proposal- and matching-based pipelines, while achieving equal or superior accuracy at high-throughput inference.

Qualitative Analysis

Qualitative comparisons (cf. Figure~\ref{fig:secondqual} in the original manuscript) highlight the advantages of posterior reasoning: output masks are more spatially complete and display fewer unsupported or fragmented predictions compared to DynamicEarth, AdaptOVCD, and other competitors. Visualization of intermediate responses reveals that GeoGate robustly attenuates appearance-driven false positives, and RCD consolidates region-level agreement before the final binarization.

Practical and Theoretical Implications

CoRegOVCD demonstrates that dense posterior-level inference, regularized by cross-temporal geometric evidence and regional consensus, outperforms explicit instance-matching strategies for training-free OVCD. The practical advantages include reduced system complexity, improved efficiency, and enhanced robustness to pseudo changes induced by atmospheric or phenological variability—critical for scalable remote sensing change monitoring.

The theoretical contribution lies in reframing open-vocabulary localization as a competition-aware, posterior-differencing task, where joint semantic and geometric constraints suffice for robust training-free change detection. This decouples OVCD performance from the limitations of hard mask correspondences and enables scalable inference over arbitrary user queries.

Prospects for Future Research

Several open directions are revealed. Structured threshold transfer and automatic prompt set optimization could further enhance flexibility. Joint modeling of temporal coherence beyond pairwise inference (e.g., on video or multi-temporal stacks), and the integration of additional sensor modalities (e.g., SAR, LiDAR) within the structural verification stage, represent promising avenues. Additionally, adaptive region partitioning and uncertainty quantification at the posterior level may drive further improvements for open-world remote sensing applications.

Conclusion

CoRegOVCD advances training-free open-vocabulary change detection by shifting the modeling focus to regularized, competition-aware posterior differencing. The synergy between semantic calibration, geometric verification, and region-level consensus enables robust, efficient, and concept-generalizable change detection—establishing a new methodological paradigm for open-world analysis in remote sensing.

Markdown Report Issue