Characteristic Function Approach to Pseudo-Distributions
- Characteristic Function Approach to Pseudo-Distributions is a framework that represents complex probability measures using Fourier analysis and kernel methods.
- It leverages mathematical techniques to transform pseudo-distributions into analyzable characteristic functions, enabling efficient inference without traditional gradients.
- This approach enhances model interpretability and deployment, facilitating advancements in multimodal inference and robust zero- or few-shot generalization.
In-context representation learning refers to the process by which large neural models, such as Transformers, induce semantically meaningful representations of new data exclusively from context provided at inference time—typically, a set of examples and associated outputs—without any explicit gradient-based parameter updates. This paradigm generalizes the well-established concept of in-context learning (ICL) by emphasizing how internal vector embeddings within the model adapt to structure and regularities present in the provided context, thereby supporting zero- or few-shot generalization, task inference, and multimodal reasoning. Recent research positions in-context representation learning as the basis for flexible, data-driven adaptation in both language and multimodal models, enabling deployment in settings where gradient-based fine-tuning is infeasible or undesirable.
1. Theoretical Foundations and Formalism
The core mechanism in in-context representation learning is the implicit, gradient-free adaptation of internal representations based solely on contextual information. For a model such as a Transformer, an input sequence—including demonstration pairs (e.g., , , ) and a query —is projected through stacked attention layers, producing query-dependent latent embeddings at each position. Unlike in standard training or fine-tuning, model parameters remain frozen; adaptation occurs via manipulation of the input context. Mathematical analyses leverage kernel methods, mixed-effects decomposition, contrastive dualities, and energy-minimization analogies to characterize the transformation of these internal states.
In the contrastive perspective, each example's key and value representations (, ) become “views” whose distance drives the relevant representational update. The loss minimized by a single attention layer,
aligns the representation of query and exemplar, with overall ICL performance mediated by the distribution of pairwise distances in the latent space (Miyanishi et al., 2024, Ren et al., 2023). The mixed-effect modeling framework further decomposes the contribution of in-context semantics and input formatting to the observed representation shift, elucidating the context-driven adaptation of model predictions (Miyanishi et al., 2024).
2. Algorithms and Architectures
Recent advances propose explicit algorithms for extracting, composing, and deploying in-context representations. Methods include:
- Contrastive Multimodal ICL (MCICL): Models key–value pairs as contrastive “views,” with representational shifts driven by their latent distance. A contrastive loss is formulated without the need for negative samples, aligning the query with in-context exemplars. Mixed-effect linear models are used to separate semantic and formatting contributions to both prediction and representation shift (Miyanishi et al., 2024).
- Credibility Transformer with ICL: Batch construction alternates between context (nearest-neighbor queries in the embedding space) and target sets. Decorated “CLS”-style tokens are formed by aggregating learned representations, augmented with outcome information, and processed through cross-batch attention allowing context-driven “re-centering” of representations on actuarially coherent peers (Padayachy et al., 9 Sep 2025).
- Semi-supervised In-context Representation Learning (IC-SSL): Stage I uses Transformer layers to compute affinity matrices (e.g., RBF graph Laplacians) over the context, learning geometric structure; stage II implements functional gradient descent in an RKHS via attention, propagating label information across the learned embedding (Fan et al., 17 Dec 2025).
- Implicit In-context Learning (I2CL): Demonstrations are compressed into a context vector via mean aggregation of layerwise multi-head or MLP activations. During inference, this vector is injected through learned residual scaling at each layer, enabling few-shot generalization at the computational cost of zero-shot inference (Li et al., 2024).
- Task Vector Construction: Task representations are encoded as weighted linear combinations of attention head outputs. These “Learnable Task Vectors” are learned via gradient descent on a fixed (frozen) model, empirically supporting cross-modality generalization (text and functional regression) (Saglam et al., 8 Feb 2025).
- Continuous Vector In-Context Learning (Vector-ICL): Arbitrary modality-specific encoders produce continuous embeddings, projected into the LLM's input space via a trainable (typically linear) projector. These embeddings are incorporated directly into the LLM's input pipeline as pseudo-token representations, allowing for unified in-context inference across text, numerical, molecular, and neuroimaging modalities (Zhuang et al., 2024).
3. Multimodal, Visual, and Cross-domain Extensions
In-context representation learning has been extended to multimodal tasks, including vision, language, and chemistry.
- Unified Multimodal Pipelines: Discrete tokenizations of both text (e.g., via BPE) and images (e.g., via VQGAN or similar quantization) are interleaved, embedded in a shared latent space, and modeled autoregressively by a decoder-only Transformer. Sparse mixture-of-experts FFNs are often employed to mitigate task interference (Sheng et al., 2023).
- Training-free Non-textual ICRL: Foundation model (FM) representations are injected into an LLM at prompt time via random or optimal-transport-aligned projections, supporting test-time, training-free multimodal inference. Key factors for successful deployment include diversity in FM representation, proper alignment, and prompt composition (Zhang et al., 22 Sep 2025).
- Visual In-context Retrieval: Empirically, ICL in vision models is highly sensitive to prompt composition. Supervised and unsupervised retrieval systems, often feature-space nearest neighbors or contrastive-learned retrievers, are used to select in-context exemplars that optimize downstream performance, with retrieval quality strongly influencing latent representation clustering (Zhang et al., 2023, Zhou et al., 2024).
4. Mechanistic and Interpretability Insights
Recent work addresses the mechanism by which in-context representations are formed, reshaped, and deployed in LLMs.
- Representation Dynamics and Energy Minimization: Studies show that as context grows, a model’s latent geometry may undergo a “phase transition” from encoding pretraining semantics to reflecting in-context structure, quantitatively matching predictions from Dirichlet energy minimization on context-specified graphs (Park et al., 2024). This behavior is demonstrated both in language and in vision, with PCA and distance correlation metrics used to track alignment between internal activations and problem structure (Lepori et al., 4 Feb 2026).
- Contrastive Kernel Duality: The forward pass through a Transformer layer is provably equivalent to a one-step gradient update on a dual model under a contrastive loss, providing a theoretical explanation for the rapid adaptation and alignment mediated by contextual demonstrations (Ren et al., 2023).
- Representational Robustness and Limitation: While models reliably induce in-context representations that mirror novel semantic or relational structure, there exists a gap between representation induction and downstream deployment, particularly across task boundaries, context interruptions, or shifts in prompt mode (Lepori et al., 4 Feb 2026). Explicitly described structure recovers deployment capacity, but latent representations often remain “inert” otherwise.
5. Practical Considerations and Limitations
The efficiency, robustness, and transferability of in-context representation learning depend on several technical factors:
- Prompt and Label Representation: Independent axes of demonstration representation (e.g., optimality of label tokens) and demonstration count determine baseline accuracy and learning slope, respectively. Optimizing label representation improves baseline, while larger models extract more per-demo improvement (Marinescu et al., 9 Oct 2025).
- Retrieval and Compositionality: Automated prompt retrieval—by feature similarity or learned ranking—substantially boosts vision ICL, but performance is highly sensitive to example pose, style, and spatial-structural match (Zhang et al., 2023). Multimodal prompt composition via compact intent-oriented summaries enables longer context and higher per-token efficiency (Zhou et al., 2024).
- Scalability: Embedding-level injections and summary tokenization allow efficient scaling to larger context windows and multimodal settings; however, homogeneity or poor alignment of external representations constrains effectiveness (Zhang et al., 22 Sep 2025, Park et al., 2024).
- Generalization and Transfer: In unsupervised or cross-domain meta-learning, in-context representation learners that employ compositional augmentation (mixup, iterative task construction) and sequence modeling generalize better to domains with scarce or unseen labels (Vettoruzzo et al., 2024).
6. Open Problems and Future Directions
Several challenges and research frontiers remain:
- Deployment Gap: Current models can encode but not reliably deploy in-context representations across prompt boundaries or for complex composite tasks. Identifying architectural or training interventions that encourage persistent representation and flexible deployment remains a key challenge (Lepori et al., 4 Feb 2026).
- Mechanistic Localization: Path-patching and attribution analysis may offer mechanistic localization of context-induced representational shifts, clarifying how token-level dynamics are altered by key–value distances (Miyanishi et al., 2024).
- Negative Sampling and Richer Contrastive Objectives: Integrating negative sampling and fully supervised contrastive losses into in-context modules may further improve representation quality and task discrimination, closing the gap with training-time supervised contrastive learning (Miyanishi et al., 2024, Ren et al., 2023).
- Unified Multimodal ICL: Extending end-to-end unified pipelines to more modalities (e.g., audio, video, tabular) with discrete or continuous tokenization would generalize the framework of in-context representation learning (Sheng et al., 2023, Zhuang et al., 2024).
- Representation-Deployment Bifurcation: Systematic study of when, and by which mechanisms, latent representations become decoupled from model output—and how to create “routing” or “binding” modules that maintain context-relevant geometry over prompt mode transitions (Lepori et al., 4 Feb 2026)—remains an active area.
In summary, in-context representation learning is a foundational paradigm unifying representation learning and context-driven adaptation in large-scale models, with ongoing theoretical, algorithmic, and empirical progress in both unimodal and multimodal domains. It underpins both interpretability and practical deployment of zero- and few-shot capabilities, and will likely remain a focus as models transition toward architecture-agnostic, task-general data-driven inference.