Contamination-Aware Assessment in SLMs

Updated 1 January 2026

Contamination-aware assessment is a quantitative framework that injects controlled syntactic and semantic corruptions into fine-tuning datasets to evaluate SLM robustness.
It employs metrics like accuracy, semantic similarity, grammatical correctness, and pattern adherence to measure performance degradation.
Empirical findings reveal that even minimal syntactic corruption drastically degrades SLM performance, highlighting challenges in deploying instruction-tuned models.

Contamination-aware assessment is a rigorous quantitative framework developed to diagnose and measure the impact of corrupted fine-tuning data on the behavioral robustness of instruction-tuned small LLMs (SLMs) (Scaria et al., 10 Nov 2025). Distinct from generic robustness evaluation, contamination-aware protocols explicitly inject controlled amounts and types of corruption—syntactic or semantic—into fine-tuning data, then systematically benchmark model degradation across multiple dimensions of output quality, adherence to harmful patterns, and core linguistic competence. This approach is central for reliable deployment of SLMs in resource-constrained settings, where data integrity cannot be assumed and the risk of performance collapse is acute.

1. Formalism and Core Metrics

At its foundation, contamination-aware assessment defines a clean instruction-tuning dataset $\mathcal{D}_a$ of size $N$ , and for each contamination type $t$ constructs a fully transformed (i.e., corrupted) version $\mathcal{D}_a^t$ . Contamination fraction $c \in \{0.25, 0.50, 0.75, 1.00\}$ parametrizes the mixed training set, $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ . Four key metrics are evaluated post-tuning:

Accuracy $A_t(c)$ : fraction of correctly answered test examples.
Semantic similarity $S_t(c)$ : mean cosine similarity of output embeddings (all-mpnet-base-v2) to references.
Grammatical correctness $G_t(c)$ : fraction judged grammatically correct by a validated LLM judge.
Pattern adherence $P_t(c)$ : fraction of outputs matching the injected transformation.
Secondary: Lexical overlap scores (BLEU, ROUGE, METEOR).

Performance drops ( $N$ 0), failure rates ( $N$ 1), and metric degradation curves are central to the evaluation protocol.

2. Transformation Categories and Implementation

Contamination-aware assessment rigorously controls both syntactic and semantic corruption modes:

Syntactic:
- Character reversal (crev): Each answer string $N$ 2 mapped to $N$ 3.
- Word reversal (wrev): Each $N$ 4 mapped to $N$ 5.
Semantic:
- Irrelevant response (irr): Each question $N$ 6 paired with a random answer $N$ 7 from the corpus ( $N$ 8).
- Counterfactual (cfact): Responses generated via adversarial prompts to Gemini 2.5 Flash simulating alternate realities.

Pseudocode formalization: $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 7

3. Experimental Protocols and Model Families

Comprehensive assessment was performed on 23 SLMs (270M–4B parameters) spanning six families (Gemma3, Llama3.2, OLMo2, Phi4, Qwen2.5, SmolLM2), with base and instruction-tuned variants. Each model underwent fine-tuning on all contamination types $N$ 9 and fractions $t$ 0, yielding 16 contaminated settings plus clean baseline per model; training extended five epochs with AdamW optimizer ( $t$ 1, cosine schedule, weight decay 0.1).

Test set comprised 2,018 diverse QA items generated by GPT-4o then cleaned. Evaluation strictly followed automated LLM-as-Judge (Gemini 2.0 Flash) validated by human annotator agreement ( $t$ 2).

4. Empirical Findings: Asymmetry and Capability Curse

Contamination-aware protocols reveal profound asymmetries in SLM vulnerability:

Transformation	$t$ 3	$t$ 4	$t$ 5
crev	85%	1%	84%
wrev	85%	45%	40%
cfact	85%	60%	25%
irr	85%	80%	5%

Syntactic contamination (crev/wrev) causes catastrophic failure at only 25%: $t$ 6, $t$ 7; wrev accuracy collapses to $t$ 8 by 75%.
Semantic corruption (cfact, irr) exhibits threshold resilience: even $t$ 9 leaves $\mathcal{D}_a^t$ 0 intact; $\mathcal{D}_a^t$ 1 declines gradually.

Semantic similarity and grammaticality mirror these curves:

Syntactic crev: $\mathcal{D}_a^t$ 2 at $\mathcal{D}_a^t$ 3, $\mathcal{D}_a^t$ 4.
wrev: $\mathcal{D}_a^t$ 5, $\mathcal{D}_a^t$ 6 at $\mathcal{D}_a^t$ 7.
Semantic types: $\mathcal{D}_a^t$ 8– $\mathcal{D}_a^t$ 9 up to $c \in \{0.25, 0.50, 0.75, 1.00\}$ 0, $c \in \{0.25, 0.50, 0.75, 1.00\}$ 1.

The “capability curse”—larger models (e.g. Phi4_Mini_IT) adhere more strictly to harmful semantic transformations, with $c \in \{0.25, 0.50, 0.75, 1.00\}$ 2 versus $c \in \{0.25, 0.50, 0.75, 1.00\}$ 3 for smaller (SmolLM2_360M_IT) models.

5. Alignment Effects: Inconsistent Robustness Gains

Comparison of base versus instruction-tuned models on contamination reveals non-uniform effects. For crev@25%:

Llama3.2_3B_base: $c \in \{0.25, 0.50, 0.75, 1.00\}$ 4;
Llama3.2_3B_IT: $c \in \{0.25, 0.50, 0.75, 1.00\}$ 5.

Gemma3_4B_IT slightly outperforms its base on wrev@25% for grammaticality ( $c \in \{0.25, 0.50, 0.75, 1.00\}$ 6, $c \in \{0.25, 0.50, 0.75, 1.00\}$ 7), but this improvement is inconsistent across families. Broadly, alignment neither reliably enhances nor degrades contamination resistance. No statistical significance tests beyond standard error shading of performance curves were reported.

6. Protocols for Contamination-Robust Training and Deployment

A contamination-aware SLM evaluation protocol comprises:

Assemble clean $c \in \{0.25, 0.50, 0.75, 1.00\}$ 8; identify relevant $c \in \{0.25, 0.50, 0.75, 1.00\}$ 9{crev, wrev, irr, cfact}.
For $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 0 in $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 1, create $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 2.
Fine-tune model variants identically on these mixtures.
Measure $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 3 on a held-out test set.
Plot $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 4; set contamination thresholds $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 5 (e.g., $\mathcal{D}_t(c) = (1-c)\,\mathcal{D}_a \cup c\,\mathcal{D}_a^t$ 6 unacceptable).

Mitigation strategies include data validation filters (e.g., n-gram anomaly detection), adversarial curriculum contamination, contamination-aware curriculum scheduling, and benchmark-integrated monitoring.

7. Implications, Limitations, and Recommendations

Contamination-aware assessment demonstrates that minimal syntactic pattern injection causes universal collapse in SLM performance, far outstripping the degradation from semantic corruption. The capability curse cautions against the naive assumption that larger, more capable models are intrinsically more robust to fine-tuning errors; instead, they amplify pattern adherence to harmful transformations. Alignment procedures may not confer additional resilience and can, in specific cases, reduce it.

For credible instruction-tuned SLM deployment, contamination-aware protocols must be integrated into development and evaluation cycles. Empirical evidence mandates strict data-quality controls, explicit measurement and reporting of contamination-related degradation, and renewed scrutiny of alignment processes—especially for SLMs intended for high-stakes, resource-constrained environments (Scaria et al., 10 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Sensitivity of Small Language Models to Fine-tuning Data Contamination (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contamination-Aware Assessment.