Context Sensitivity Fingerprints

Updated 22 January 2026

Context Sensitivity Fingerprints (CSF) are defined as compact, model‐specific summaries that capture variation in stereotype-selection rates over dimensions like time, location, style, and audience.
The CSF methodology systematically evaluates bias dispersion and paired contrasts using a high-resolution grid of context framings, supported by rigorous bootstrap and FDR statistical controls.
Empirical findings reveal that context-induced shifts in bias scores can significantly alter safety claims, underscoring the need for context-sensitive evaluations in model deployment.

A Context Sensitivity Fingerprint (CSF) is a compact, model-specific summary quantifying how the stereotype-selection rate of a LLM varies as prompt context is systematically altered along controlled dimensions such as place, time, style, and audience. Rather than reporting a single summary “bias score,” a CSF captures both the dispersion of bias over these context dimensions and the magnitude of contrast effects between particular context framings. This approach exposes contexts where a model’s risk for stereotype selection is most variable, providing a stress test for the robustness and generality of alignment to anti-stereotypical behavior under non-adversarial, routine context variation. The CSF protocol formalizes metrics, significance correction, and pragmatic workflows for robust evaluation on the Contextual StereoSet benchmark (Basu et al., 15 Jan 2026).

1. Formal Structure and Definition

A CSF comprises two main classes of statistics: (1) per-dimension dispersion, quantifying how much the stereotype-selection rate varies across levels of a context dimension for a given model, and (2) mean paired contrasts, measuring the average shift in bias score when moving between specific context framings. These are computed over a test grid that holds stereotype content fixed while systematically varying context along multiple axes.

Let $I$ denote the set of stimulus items, and for each $i \in I$ , let $C_i$ be the set of context framings. The stereotype-selection rate for model $m$ , item $i$ , and context $c$ is:

$SS_{m,i,c} = \mathbb{I}\{y_{m,i,c} = S\}$

where $y_{m,i,c}$ is the model’s chosen completion and $S$ denotes the stereotypical option.

For dimension $d$ (e.g., year), with levels $\ell \in L_d$ , aggregate rates:

$SS_{m,i,\ell} = \frac{1}{|C_{i,\ell}|} \sum_{c \in C_{i,\ell}} SS_{m,i,c}$

The per-item standard deviation across these levels is:

$\sigma_{m,i}^{(d)} = \sqrt{ \frac{1}{|L_d|} \sum_{\ell \in L_d} \left( SS_{m,i,\ell} - \bar{SS}_{m,i}^{(d)} \right)^2 }$

where $\bar{SS}_{m,i}^{(d)} = \frac{1}{|L_d|} \sum_{\ell} SS_{m,i,\ell}$ .

The reported dispersion metric per dimension is:

$\sigma_{d} = \frac{1}{|I|} \sum_{i \in I} \sigma_{m,i}^{(d)}$

Mean paired contrasts between context levels $a$ and $b$ for a chosen dimension are:

$\Delta SS^{a-b}_m = \frac{1}{|I|} \sum_{i \in I} (SS_{m,i,a} - SS_{m,i,b})$

2. Contextual Dimensions and Experimental Grid

The underlying evaluation grid systematically explores controlled context variables:

Location: 12 levels (e.g., various places)
Year: 5 levels (e.g., 1990, 2000, ..., 2030)
Style: 3 levels (e.g., “gossip,” “direct,” etc.)
Observer Framing: 2 levels (out-group vs similar audience)

This creates a combinatorial grid of 360 unique context framings per item for detailed diagnostic evaluation, and a budgeted protocol with a subset for large-scale screening.

High-resolution diagnostic analysis with the full 360-context grid supports detection of contexts with maximal stereotype signal dispersion (e.g., $\sigma_{loc} \approx 0.07$ , $\sigma_{obs}$ up to 0.12), while budgeted evaluations (exp2; 72 contexts + 2 baselines) offer broader coverage with coarser but scalable estimates. Notably, model-family heterogeneity is observed, with magnitudes and even directions of effects (e.g., “gossip” vs “direct” style) diverging between model types.

3. Statistical Inference: Bootstrap and Multiple Testing

Uncertainty in CSF metrics is quantified via nonparametric bootstrap over items. For each statistic $\theta$ (either a dispersion or contrast), the empirical distribution is generated by $B$ bootstrap replicates:

Bootstrap sample $I^{(b)} \subset I$ with replacement
Recompute $\theta^{(b)}$ on $I^{(b)}$

Percentile-based bootstrap confidence intervals are reported:

$\mathrm{CI}_{1-\alpha}(\theta) = [\theta_{(\alpha/2)}, \theta_{(1-\alpha/2)}]$

Multiple hypothesis testing for primary paired contrasts is controlled via the Benjamini–Hochberg (BH) false discovery rate procedure at level $q$ . For $m$ p-values, the largest $k$ such that $p_{(k)} \le \frac{k}{m}q$ is determined; all contrasts up to rank $k$ are flagged significant, with $q$ -values reported as minimum FDR levels for each test.

4. CSF Computation Workflow

Practical computation proceeds as follows:

data ← load all items I and context grid C
for each model m:
  for each item i in I:
    for each context c in C_i:
      y[c] ← model m’s stereotype/anti/unrelated choice
      SS[c] ← (y[c]=='S') ? 1 : 0
    for each dimension d:
      for each level ℓ in levels(d):
        C_{i,ℓ} ← {c ∈ C_i | context c has level ℓ}
        SS_{i,ℓ} ← mean of SS[c] over c in C_{i,ℓ}
      σ_{m,i}^{(d)} ← std-dev of {SS_{i,ℓ} : ℓ ∈ L_d}
    record SS_{i,c} and SS_{i,ℓ} for contrasts
  for each dimension d:
    σ_d ← mean of σ_{m,i}^{(d)} over i
  for each contrast (a,b):
    ΔSS^{a-b} ← mean over i of (SS_{i,a} - SS_{i,b})
  # bootstrap CIs
  for b in 1..B:
    I_b ← sample-with-replacement(I)
    recompute σ_d^{(b)}, ΔSS^{(b)} using I_b
  CIs ← percentile intervals from bootstrap replicates
  # significance tests & BH–FDR
  for each contrast:
    p_val ← sign-flip permutation test on {SS_{i,a} - SS_{i,b}}
  q_vals ← BH_correct({p_val}, family_size = m)
  report CSF = {SS_m, σ_loc, σ_yr, σ_style, σ_obs, ΔSS^{gossip-direct}, ΔSS^{dissimilar-similar}, ΔSS^{1990-2030}, CIs, q_vals}

5. Empirical Findings and Model Implications

Contextual StereoSet evaluation across 13 models revealed several robust context effects. For instance, anchoring to “1990” (vs “2030”) consistently increased stereotype-selection across all models ( $p<0.05$ ), gossip style increased it in 5 of 6 models tested under full-grid protocol, and out-group observer framing produced shifts up to 13 percentage points. These shifts replicated across vignettes in hiring, lending, and help-seeking scenarios. In budgeted screening, dispersion for location and year was often amplified, while style effects could reverse in direction for certain open-weight models. Temporal effects (1990 > 2030) exhibited high robustness across evaluation tracks.

High dispersion ( $\sigma_d$ ) or large, statistically significant paired contrasts ( $\Delta SS$ ) flag unreliability of aggregate bias scores, indicating that model safety claims from fixed-condition tests may not generalize across real-world prompt variability. This supports a methodological shift from “Is this model biased?” toward “Under what conditions does bias appear?” (Basu et al., 15 Jan 2026).

6. Methodological Significance and Practical Use

CSFs offer a methodological advancement in the robustness analysis of LLM alignment. Rather than providing a single summary statistic, CSFs constitute a multidimensional diagnostic that surfaces vulnerabilities to non-adversarial context shifts. Two evaluation tracks enable both high-resolution diagnostic mapping (360-context grid) and scalable screening (budgeted protocol, 4,229 items). All reported metrics are accompanied by bootstrap CIs and FDR-corrected significance, ensuring robust quantification of context dependency in stereotype selection.

The practical recommendation is to evaluate not only bias magnitude but its context-sensitivity profile before real-world deployment. A plausible implication is that models judged ‘safe’ by simplistic, fixed-context bias benchmarks may remain brittle when exposed to routine, real-world contextual diversity.

7. Comparison and Context Within Bias Evaluation

Conventional stereotype bias metrics often collapse across context, implicitly assuming invariance to variations in prompt framing. CSF breaks from this approach by operationalizing “contextual fragility” as a first-class object of study. By systematically probing context dimensions and quantifying effect sizes with rigorous statistical controls, CSF enables a granular understanding of when and why bias emerges.

No claim is made regarding ground-truth bias rates—CSF is strictly a tool for stress-testing evaluation robustness rather than dictating normative bias or alignment levels. This distinguishes CSF from prior work focused solely on average or fixed-condition measurements, instead shifting evaluation focus to the interaction between bias and everyday prompt diversity (Basu et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Sensitivity Fingerprints (CSF).