Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual Semantic Adaptive Watermark (VISA-Mark)

Updated 19 January 2026
  • VISA-Mark is a framework that adaptively embeds watermarks by aligning signals with local semantic and visual evidence in digital content.
  • It employs techniques such as prefix-tuning for LVLMs, conditional VAE generation, and patchwise frequency-domain embedding to enhance robustness.
  • Empirical results demonstrate high detection accuracy and minimal visual distortion, even under advanced generative and paraphrase-based attacks.

A Visual Semantic Adaptive Watermark (VISA-Mark) is a watermarking framework characterized by the adaptive alignment of watermark signals with the underlying semantic and visual structures of digital content—either natural images or large vision-LLM (LVLM) outputs. VISA-Mark implementations steer the embedding procedure so that watermark signals are injected or distributed specifically in regions or token selections that are maximally supported by visual evidence or semantic stability. Prominent technical instantiations include prefix-tuning modules for LVLMs (Zheng et al., 12 Jan 2026), conditional VAE-generated visible marks constrained by bi-level optimization (Liu et al., 3 Jun 2025), and patchwise frequency-domain strategies focused on non-melting points robust to paraphrase-based generative attacks (Dixit et al., 28 Jun 2025). VISA-Mark aims to reconcile strict detectability standards with preservation of visual and semantic fidelity under threat models targeting both classical and generative attacks.

1. Foundational Principles and Definitions

The term VISA-Mark denotes watermarking methods that adapt the generation or embedding of watermark signals not merely to the global content, but to the local semantic or visually salient evidence. This adaptation manifests in distinct approaches across modalities:

  • In LVLMs, VISA-Mark aligns watermark signals with tokens that are visually supported, modulating injection strength by both visual evidence weights and model uncertainty (Zheng et al., 12 Jan 2026).
  • In image watermarking, VISA-Mark restricts modifications to regions stable under semantic paraphrasing—termed “Non-Melting Points” (NMPs)—and/or frameworks where watermark symbol generation and placement is governed by a learned generator conditioned on semantic attributes (Liu et al., 3 Jun 2025, Dixit et al., 28 Jun 2025).

A common methodological core is evidence quantification—computing weights or selecting regions based on metrics such as cosine similarity to visual embeddings, paraphrase-driven stability, or generative reconstruction hardness.

2. Technical Architectures and Adaptive Mechanisms

VISA-Mark for LVLMs introduces a prefix-tuning module (ϕ\phi) that extracts dynamic Visual-Evidence Weights (w(i)w(i)) for each token based on visual and caption context. This evidence drives:

  • Adaptive Vocabulary Partitioning: At every decoding step tt, the token vocabulary V\mathcal V is partitioned into a “green list” Gt\mathcal G_t and “red list” Rt\mathcal R_t, adjusted via top-K evidence selection and entropy regularization to ensure high-evidence tokens remain favored.
  • Logit Perturbation: Only tokens in Gt\mathcal G_t receive an adaptive bias δt,v\delta_{t,v} proportional to their evidence weight and the normalized token-entropy HnormH_{\rm norm}, yielding perturbed logits t\ell'_t.
  • Offline Prefix Training: The prefix module is trained by aligning its logit shift patterns with static relevance scores derived offline from noun-phrase entity similarity in image-caption pairs.

Two complementary paradigms emerge:

  • Visible Watermark Generation via Conditional VAE: The watermark δ\delta is generated by Gw(z,c)G_w(z, c), where zz is latent and cc encodes shape and spatial condition. Predominantly, δ\delta denotes a legally interpretable digit, character, or initial.
  • Bi-level Objective:
    • Outer Level: Minimizes similarity s(x(δ),xT)s(x^*(\delta), x_T), maximizing reconstruction error after inpainting,
    • Inner Level: Solves x(δ)=argmaxxlogpG(xy(δ);λ)x^*(\delta) = \arg \max_x \log p_G(x|y(\delta); \lambda) via generative priors (flow/diffusion models).
  • Differentiable Masking/Meta-Learning: The mask MδM_\delta is made differentiable (via sigmoids), and inner optimization is unrolled for gradient access.
  • Semantic Region Localization (NMPs): Saliency detectors (e.g. XRAI) extract candidate regions; paraphrased variants undergo stability scoring by IoU. Only highly stable NMPs are selected for watermarking.
  • Frequency-Domain Multi-Channel Embedding: Watermark spectra are injected across RGB channels in each NMP patch via

$F'_{i}^{c}(u,v) = F_i^c(u,v)+\alpha_{i,c} W^c(u,v),$

followed by inverse DFT to reconstruct the watermarked patch.

  • Noisy Burnishing: Band-limited noise or adversarial pixel perturbation is applied to obfuscate the spatial distribution of NMPs, preventing attacker relocalization without degrading the watermark strength.

3. Detection, Fidelity, and Evaluation

LVLM Implementation

Detection operates by recomputing PRF-seeded green lists for a candidate text sequence y1:Ty_{1:T} and z-score statistical testing on green list token selection counts. The fidelity of the watermarked text is quantified by KL-divergence to the unwatermarked model (Lfid\mathcal{L}_{\mathrm{fid}}), with empirical Chair-I reductions.

Image Implementation

Detection in patchwise frequency-domain VISA-Mark computes normalized correlations ρj,c\rho_{j,c} between extracted spectra and expected watermark patterns WcW^c. A logistic score determines presence, tuned by threshold τ\tau and slope β\beta. Probabilistic visible watermark schemes measure attack difficulty via vPSNRv_{\mathrm{PSNR}}—the difference between inpainted and original image PSNR.

4. Robustness Against Advanced Threats

VISA-Mark frameworks explicitly target attack surfaces beyond standard distortions (JPEG, noise, brightness):

  • Visual Paraphrase Attacks (Dixit et al., 28 Jun 2025): By focusing watermark injection within NMPs—semantically central regions stable under paraphrasing—the framework maintains high detection probability (WDP: 0.90–0.85 at paraphrase strengths s=0.1s=0.1–$0.2$) where comparable frequency schemes (ZoDiac, Meta WAM) fail.
  • Generative Inpainting Removal (Liu et al., 3 Jun 2025): The bi-level optimization in Harvim yields visible marks that are substantially harder to reconstruct by flow and diffusion methods (e.g., ΔvPSNR=5.45\Delta v_{\mathrm{PSNR}}=5.45 dB reduction vs. Flow-R on CelebA).
  • Adversarial Saliency Distortion: Noisy burnishing further obfuscates NMP localization, frustrating reverse-engineering and extraction attempts even by sophisticated saliency detectors.

5. Implementation Details and Empirical Performance

LVLM VISA-Mark (Zheng et al., 12 Jan 2026) achieves competitive trade-offs:

  • Visual Consistency: 7.8% Chair-I improvement over KGW.
  • Attack Resilience: 99.3% detection AUC under word-level textual attacks.
  • Efficiency: Latency overhead is \sim1–1.45 s for 256 token generations on LLaVA and Qwen3-VL, respectively.

Patchwise image VISA-Mark (Dixit et al., 28 Jun 2025) delivers:

  • Distortion-Free Watermarking: PSNR \approx 29.84 dB, SSIM \approx 0.93.
  • Detection Probability: WDP \geq 0.99 pre-attack; \geq 0.85 under strong generative paraphrase.
  • Failure of Blind Removal: SLBR and DeNet exhibit negligible v_PSNR improvement, indicating practical irrecoverability of Harvim-generated watermark regions.

All VISA-Mark systems incorporate mechanisms to ensure that watermark signals are visually or semantically interpretable:

  • Legibility Constraints (Liu et al., 3 Jun 2025): Watermark shapes are sampled from the output range of a VAE trained for human readability.
  • Distortion Regulation (Dixit et al., 28 Jun 2025): Adaptive enhancement algorithms tune reconstructed images to maximize SSIM above threshold ss^*, while maintaining watermark strength.
  • Coverage and Smoothness Controls: Differentiable mask parameters and regularization terms ensure that coverage does not overwhelm visual content and that mask transitions remain smooth.

A plausible implication is that these design choices inherently satisfy legal standards for copyright marking and facilitate robust forensic detection in high-contention AI-generated media environments.

7. Comparative Perspective and Methodological Significance

VISA-Mark frameworks represent a divergence from prior watermarking strategies that either impose indiscriminate pseudo-random biases (e.g., KGW), rely on costly semantic-aware sampling, or embed signals in global, non-adaptive patterns. By integrating evidence-adaptive signal placement, entropy modulation, and semantic region targeting, VISA-Mark methods reconcile the detectability-fidelity trade-off at both text and image modality frontiers. This is substantiated by empirical superiority in both detection (AUC, WDP) and fidelity (Chair-I, SSIM, v_PSNR) across public benchmarks and strong attack models (Zheng et al., 12 Jan 2026, Liu et al., 3 Jun 2025, Dixit et al., 28 Jun 2025).

Ongoing research continues to refine VISA-Mark adaptivity mechanisms in response to advancing generative attacks and legal requirements for AI-generated content provenance.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual Semantic Adaptive Watermark (VISA-Mark).