In-Context Probing (ICP)

Updated 25 December 2025

In-Context Probing (ICP) is a framework that systematically manipulates LLM activations to uncover the neural basis of in-context learning.
ICP employs causal interventions, behavioral heatmaps, and structured prompt manipulation to isolate and quantify key subsystems within language models.
ICP supports applications in data attribution, membership inference, and robust classifier construction, bridging theoretical insights with practical model auditing.

In-Context Probing (ICP) is a methodology and analytical framework designed to interrogate, quantify, and explain the mechanisms underlying in-context learning (ICL) in LLMs and allied neural architectures. ICP encompasses behavioral, mechanistic, and attributional approaches, uniting causal interventions, structured prompt manipulation, and detailed representational analysis to isolate and characterize the subsystems responsible for ICL, including their architectural dependencies and vulnerabilities. In addition, ICP has emergent roles in auditing privacy, data attribution, and robust classifier construction, extending far beyond classical probe paradigms.

1. Conceptual Foundations and Definitions

At its core, ICP formalizes probing as the systematic manipulation and interrogation of an LLM's representations, activations, or outputs in response to controlled in-context prompts. Unlike basic ICL benchmarks—which typically measure surface-level accuracy after appending demonstrations to a prompt—ICP explicitly seeks to (a) identify where within the model the capability for ICL is encoded, (b) how this encoding enables the model to generalize or transfer across tasks, and (c) how these mechanisms vary across model architectures and task typologies.

In “Understanding In-Context Learning Beyond Transformers,” functional ICP constructs are developed via the extraction of function vectors (FVs) associated with specific heads in attention or state-space modules. For a given head $(l,j)$ and task $t$ , the function vector is: $\bar a_{l,j}^t = \frac{1}{|P_t|} \sum_{p \in P_t} a_{l,j}(p)$ where $a_{l,j}(p)$ denotes the final-token value vector output for a prompt $p$ and $P_t$ is a set of task demonstrations (Wang et al., 27 Oct 2025).

Separately, recent instantiations of ICP for data attribution measure the change in model log-likelihoods when prepending candidate training examples into the context for held-out test queries. The empirical improvement in log-likelihood acts as a proxy for influence-based data valuation (Jiao et al., 2024, Lu et al., 18 Dec 2025).

2. Methodological Pillars and Causal Interventions

ICP comprises intertwined behavioral and mechanistic methodologies:

Behavioral Probing (AIE Heatmaps): The Average Indirect Effect (AIE) of a head is the expected logit shift toward the correct answer when the head’s FV for task $t$ is injected into its representation for a corrupted prompt. This is computed via: $\mathrm{AIE}(a_{l,j}) = \frac{1}{|\mathcal T|} \sum_{t \in \mathcal T}\frac{1}{|\tilde P_t|} \sum_{\tilde p \in \tilde P_t}\left(f(\tilde p \mid a_{l,j} := \bar a_{l,j}^t)[y] - f(\tilde p)[y]\right)$ where $f(\cdot)[y]$ is the logit for the ground truth label (Wang et al., 27 Oct 2025).
Causal Interventions:
- Steering: Addition of the FV to an attention or state-space head’s output in a demonstration-free context.
- Ablation: Replacing the head’s final-token output with zero or a mean vector to quantify necessity.

These interventions are systematically applied to high-AIE heads (top quantiles) versus randomly selected heads. Sharp recovery or collapse of ICL performance upon these interventions pinpoints the causal substrates for ICL and exposes their dependence on architecture and task category (Wang et al., 27 Oct 2025).

3. Applications: Data Attribution, Membership Inference, and Classification

ICP underpins several high-impact applications beyond classical language understanding tasks:

Data Attribution: ICP can serve as a gradient-free proxy for influence-based data selection. For candidate data $z$ , the ICP score with respect to test set $D$ is: $ICP(z;D,\theta) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\big[S_{os}(x_i, z; \theta) > S_{zs}(x_i; \theta)\big]$ where $S_{os}$ and $S_{zs}$ are mean per-token log-likelihoods with and without $z$ in context. This yields strong Spearman correlation (e.g., ρ=0.729) with gradient-based influence measures (Jiao et al., 2024).
Membership Inference Attacks (MIAs): ICP harnesses the Optimization Gap, defined as the difference in a sample's loss before and after a hypothetical one-step fine-tuning update. In black-box ICP-MIA, probe contexts are constructed either from reference pools (retrieval-based) or via self-perturbation (masking/generation), and the membership decision is based on minimal log-likelihood improvement: $Score(s) = \min_{j=1…K} \big(LL(y|x;M) - LL(y|C_j \oplus x;M)\big)$ This approach achieves state-of-the-art AUC (up to 0.965) and TPR@1%FPR (up to 0.518) across datasets, outperforming prior surface-level attacks (Lu et al., 18 Dec 2025).
Robust Classifier Construction: Instead of decoding output labels per prompt as in standard ICL, ICP aggregates hidden states for inputs contextualized by instructions, then trains a lightweight probe (e.g., linear softmax) atop the frozen LLM representation. This yields classifiers with lower variance to prompt design, greater sample efficiency, and competitive or superior accuracy compared to full model fine-tuning (Amini et al., 2023).

4. Architectural and Task-Type Sensitivity

ICP reveals pronounced architecture-dependent and task-type-dependent effects in ICL:

For parametric retrieval tasks (e.g., factual look-ups), injected FVs in attention or state-space heads robustly control ICL behavior. In transformers and single-headed state-space models, top-AIE heads are both necessary and sufficient for ICL; steering and ablation produce dramatic effects.
For contextual understanding tasks (e.g., sentiment, hate-speech), FV control is diffuse or ineffective, indicating that ICL circuits are more distributed and may recruit value pathways, cross-head dynamics, or other mechanisms.
Architectural idiosyncrasies are observed: in hybrid models, transformer streams dominate ICL, whereas SSM streams are inert or even disruptive. Mamba2 exhibits high AIE but fails to respond to steering, suggesting head fragmentation or alternative SSM update rules override FV-centric ICL (Wang et al., 27 Oct 2025).

5. ICP in Model Analysis and Interpretability

ICP has emerged as a principled approach for mapping information flow and functional specialization within LLMs:

Tools such as Differentiable Subset Pruning (DSP) provide attention-head importance estimates under in-context prompting, revealing that certain linguistic properties (e.g., entity-type information) are more centrally encoded and leveraged by the LM, as measured by downstream cross-entropy increases upon head removal (Li et al., 2022).
ICP facilitates layer-wise and head-wise localization of linguistic, factual, or algorithmic properties, offering a high-resolution map of what information is encoded by which parts of the network and to what extent it supports model output (Li et al., 2022).
Comparisons to diagnostic probes (model-based) demonstrate that ICP-based, prompt-based, and hybrid approaches are highly selective: they "extract" existing knowledge rather than "learn" new mappings, as evidenced by near-majority-class performance when probing random or untrained models (Li et al., 2022).

6. Experimental Frameworks, Implementation Practices, and Metrics

Rigorous ICP studies emphasize:

Diversity of probe types: including behavioral manipulations, direct hidden-state interventions, structured synthetic tasks (linear regression, Gaussian kernels, nonlinear dynamics), and real-world downstream tasks (NLI, hate speech).
Metrics spanning both generalization (MSE, F1, AUC), attribution (log-likelihood improvement, binary "helpfulness" voting), and efficiency (FLOPs, inference time, memory footprint).
Careful control of confounders, e.g., context length, task dimensionality, input scaling, and curriculum learning schedules to avoid trivial shortcuts and to stress specific architectural inductive biases (Liu et al., 10 May 2025).
Black-box compatibility (for ICP-MIA, data attribution), requiring only log-likelihood access—enabling application in limited-access APIs (Lu et al., 18 Dec 2025, Jiao et al., 2024).

7. Implications, Limitations, and Future Directions

ICP brings several practical and theoretical benefits:

For practitioners, ICP presents a computationally lightweight, training-free mechanism to dissect, manipulate, or audit model behaviors—usable for data curation, privacy risk assessment, and rapid classifier construction (Lu et al., 18 Dec 2025, Jiao et al., 2024, Amini et al., 2023).
ICP underscores that ICL is not monolithic: its mechanisms, localization, and controllability are heterogeneous across architectures and task categories, cautioning against simple transfer of interpretability hypotheses (Wang et al., 27 Oct 2025).
Major limitations include the need for precise task demarcation (parametric vs. contextual), dependence on reference data quality (for reference-based probing), and constraints on the generalizability of fine-grained circuit insights to large, heterogeneous LLMs.
Open problems involve extending ICP to groupwise or batched interventions, richer probing of value-pathway or non-attention circuits, exploring its reach in pretraining settings, and integrating it with automated prompt search or neural pruning methodologies for more comprehensive causal discovery (Wang et al., 27 Oct 2025, Lu et al., 18 Dec 2025, Jiao et al., 2024).

In sum, In-Context Probing provides robust theoretical, empirical, and practical frameworks to interrogate, explain, and operationalize the foundations of in-context learning and its multifaceted implications for neural NLP systems.