Lexicon-Guided Masking: Methods & Applications

Updated 30 January 2026

Lexicon-guided masking is a technique that integrates curated lexica into masking processes, selectively ablating tokens to focus model learning on key linguistic or visual elements.
It is applied across domains such as masked language modeling, multimodal transformer probing, neural machine translation, and image inpainting, often yielding impressive accuracy improvements (e.g., >74% top-5 accuracy in verb recovery).
While offering enhanced control and interpretability, challenges remain with static lexica and coarse token selection, prompting research into adaptive and context-sensitive masking strategies.

Lexicon-guided masking is a family of masking strategies in deep learning that leverages curated token sets, terminologies, or dictionaries—collectively referred to as “lexica”—to directly control which regions or tokens in linguistic or multimodal representations are ablated or hidden. This mechanism enables probing or supervision of models with respect to specific lexical categories, fosters improved representation learning for compositional phrase structures, facilitates controlled generation or translation, and expands the interpretability of pretrained models by routing attention to linguistically or semantically distinct units. Lexicon-guided masking has been adopted in masked language modeling (MLM), multimodal transformer probing, lexically constrained sequence-to-sequence modeling, and segmentation-guided inpainting in computer vision.

1. Formal Definitions and Core Principles

The center of lexicon-guided masking is the integration of external lexical knowledge into the masking process for neural models. Let $T = (w_1, \dots, w_n)$ be a token sequence, and $\mathcal{L}$ be a lexicon—a set of tokens, n-grams, or terms of interest. The masking process defines a set $M \subseteq \{1, ..., n\}$ of mask indices such that $w_i$ is masked iff $w_i \in \mathcal{L}$ (possibly up to n-gram or phrase span matching).

Given a masking operation $\operatorname{Mask}(T;M)$ producing a corrupted input $\hat{T}$ where selected tokens are replaced by a special symbol (e.g., [MASK]), the learning objective, especially in pretraining or probing, is typically to reconstruct the masked tokens:

$\mathcal{L}_\text{MLM}(T, \hat{T}) = - \sum_{i \in M} \log p_\theta(w_i \mid \hat{T})$

Lexicon-guided masking thus allows targeted attention to rare, functionally critical, or semantically interpretable tokens and has been adopted for analytical, augmentation, or efficiency purposes across domains (Beňová et al., 2024, Levine et al., 2020, Lee et al., 2021).

2. Guided Masking in Multimodal Transformer Probing

Beňová et al. ("Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking" (Beňová et al., 2024)) introduce a principled probing technique: guided masking, referred to here as lexicon-guided masking for the domain of multimodal transformer models.

Given an input image $I$ and caption $C = (w_1, ..., w_n)$ , with a lexicon $V_{verb}$ of all verbs occurring in the test set, guided masking proceeds as follows:

Identify positions $M \subseteq \{1,..,n\}$ such that $w_i \in V_{verb}$ .
Replace $w_i$ at these positions with [MASK], and present $(I, C')$ to the pretrained multimodal model.
Use the model’s masked language head to predict $w_i$ at the masked indices, measuring top- $k$ accuracy.

This approach probes word classes (verbs, nouns, adjectives, numerals, etc.), and, through ablation of visual tokens such as subject ROIs, quantifies the degree of multimodal grounding for each class—demonstrating, for instance, that modern multimodal transformers recover the correct verb in their top 5 predictions with $>$ 74% (SVO-Probes) and $>$ 80% (V-COCO) accuracy when guided masking is used. This is substantially above the performance of text-only probing (BERT: 36%/58.5%), and also exposes multimodal grounding by the drop when visual cues are ablated (Beňová et al., 2024).

Model	SVO-Probes top-5 (guided masking)	V-COCO top-5 (guided masking)
ViLBERT	73.9%	81.1%
LXMERT	74.6%	80.5%
UNITER	74.4%	81.3%
VisualBERT	74.3%	80.2%
BERT only	36.1%	58.5%

Guided masking, as a probe, does not require adversarial data or architectural modification and can measure model competence on any lexical category identified by a POS tagger and optionally post-processed through WordNet or domain-specific dictionaries.

3. Lexicon-Guided Masking in LLM Pretraining

The PMI-Masking framework ("PMI-Masking: Principled masking of correlated spans" (Levine et al., 2020)) generalizes masking strategies in MLMs by synthesizing corpus statistics and external lexica. Instead of uniform random masking, tokens and phrases that exhibit high collocation or mutual information (measured via $PMI_n$ across the corpus) are masked as units, forcing deeper interdependency discovery by the model.

Lexicon-guided adaptation proceeds by incorporating lexicon-derived spans into the masking vocabulary:

Seed the masking set with lexicon entries (specialized terms, entities, etc.).
Score spans by $PMI_n$ and optionally filter to retain only meaningful, contextually dependent candidate units.
During masking, prioritize lexicon entries that are rare or particularly salient, even when they would not pass minimum frequency thresholds required for PMI reliability.
Optionally assign a higher sampling weight to lexicon-derived units during span selection.

Empirical results indicate that such informed masking accelerates convergence and improves end-task performance on SQuAD, RACE, and GLUE benchmarks (Table 3, (Levine et al., 2020)). The framework can be further extended with dynamic lexica or combined with external terminologies for domain adaptation.

4. Lexicon-Guided Masking in Sequence-to-Sequence Learning and Translation

Lee et al. ("Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction" (Lee et al., 2021)) apply lexicon-guided masking for robust terminology control in neural machine translation (NMT). In this context, a lexicon $\mathcal{C} = \{ (s_i, t_i) \}$ of source-target term pairs directs training or evaluation masking, particularly for multi-word and domain-specific terms.

The Source-Conditioned Masked Span Prediction (SSP) approach implements the following:

At each training update, select spans in the target sequence $y$ via random span sampling (SpanBERT-style, not necessarily aligned to the lexicon).
Mask these spans and train the model simultaneously with maximum likelihood (full data) and mask-reconstruction objectives, conditioned on the source $x$ .
Although the span masking is stochastic and not strictly lexicon-guided, the methodology is directly extensible: lexicon entries can be prioritized for span masking to further enforce their correct learning.

This technique consistently improves both overall SacreBLEU and domain-terminology adherence metrics (Term% and LSM-2) across diverse languages and specialized domains, especially for terms longer than two tokens, and without the need for constrained decoding at inference (Lee et al., 2021).

5. Guided Masking and Segmentation in Vision-Language Generation

InstructVTON ("InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On" (Han et al., 24 Sep 2025)) extends lexicon-guided masking concepts to interactive image inpainting, where lexicon guidance arises from free-text instructions parsed via vision-LLMs. The process combines:

Parsing style-related lexica from user instructions to extract structured “sub-instructions” attached to garment types.
Multi-level partitioning: human-body part segmentation and clothing region segmentation, modeled as partitions $\mathcal{B}$ and $\mathcal{C}$ over the image, respectively.
Rule-based mapping of instructions and garment types to specific image regions to mask ( $M$ ), by selecting relevant body parts and clothing traces.

$M$ is constructed as:

$\hat{m} = \Bigl(\bigcup_{c_j \in \hat{\mathcal{C}}_{\cap v}} c_j\Bigr) \cup \Bigl(\bigcup_{b_i \in \hat{\mathcal{B}}_{\cap v}} b_i\Bigr)$

This lexicon-guided masking enables optimal, style-harmonized region selection, yielding superior mask efficiency ( $>$ 0.82 on dresses, $>$ 0.89 on upper-body; see Table below) relative to prior approaches (Han et al., 24 Sep 2025).

Category	CatVTON	IDM-VTON	InstructVTON
Dresses	0.6876	0.7334	0.8269
Upper body	0.8379	0.8196	0.8924
VITON-HD Total	0.6877	0.6889	0.7808

InstructVTON demonstrates that lexicon-guided, VLM-informed masking outperforms hand-crafted or input-dependent segmentation strategies in complex, iterative generative pipelines.

6. Methodological Distinctions, Applications, and Limitations

Lexicon-guided masking can be instantiated as:

Probing frameworks (e.g., MLM heads for specific POS categories or domain terms),
Training-time augmentation in pretraining or multitask learning,
Rule-based or learned binary mask generation in conditional image synthesis,
Masking policies for enhanced terminology preservation in NMT,
Segment ablation for interpretability or adversarial testing in multimodal systems.

Across domains, the use of external lexica or domain knowledge directly counteracts the weaknesses of purely random or corpus-statistical masking (e.g., low coverage of rare or compositional units, inefficient learning of grounded representations).

However, limitations recurrently observed include:

Static lexica that may overlook dynamic or context-specific expressions (Levine et al., 2020).
Coarseness when only high-frequency or high-PMI terms are eligible for masking, leaving rare idiomatic or compositional terms underexposed (Levine et al., 2020, Lee et al., 2021).
In image-based systems, mask region definition is often limited by the granularity of available segmenters or heuristic rules rather than learned optimization (Han et al., 24 Sep 2025).
Probe-only masking does not provide corrective learning, creating a gap between probing diagnosis and model improvement (Beňová et al., 2024).

A plausible implication is that integration of adaptive lexicon discovery, differentiable mask optimization, and contextual lexicon weighting constitutes an open research direction.

7. Outlook and Generalization

Lexicon-guided masking constitutes an extensible mechanism for fine-grained control and interpretability in deep models, with demonstrated efficacy in language modeling, translation, multimodal understanding, and conditional generation. Its design supports flexible specialization to new domains (e.g., medical, legal, instructional), new modalities (audio, video via aligned lexica), and dynamic updating as model knowledge evolves.

Ongoing lines of research encompass:

Dynamic lexicon construction during training to capture emerging phrases or relationships (Levine et al., 2020).
Integrating external knowledge bases for enhanced context—allied with mask sampling and objective modulation.
Addressing sampling efficiency, computational cost, and mask quality routing in segmentation-guided pipelines.

In all cases, lexicon-guided masking is underpinned by the principle of targeted, knowledge-driven ablation to foster both robust learning and interpretable model probing (Beňová et al., 2024, Levine et al., 2020, Lee et al., 2021, Han et al., 24 Sep 2025).