Contextures: The Mechanism of Representation Learning

Published 28 Apr 2025 in cs.LG, cs.AI, and stat.ML | (2504.19792v1)

Abstract: This dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. Despite the remarkable empirical success of foundation models, it is not very clear what representations they learn, and why these representations are useful for various downstream tasks. A scientific understanding of representation learning is critical, especially at this point when scaling up the model size is producing diminishing returns, and designing new pretraining methods is imperative for further progress. Prior work treated different representation learning methods quite differently, whereas the contexture theory provides a unified framework for analyzing these methods. The central argument is that a representation is learned from the association between the input X and a context variable A. We prove that if an encoder captures the maximum information of this association, in which case we say that the encoder learns the contexture, then it will be optimal on the class of tasks that are compatible with the context. We also show that a context is the most useful when the association between X and A is neither too strong nor too weak. The important implication of the contexture theory is that increasing the model size alone will achieve diminishing returns, and further advancements require better contexts. We demonstrate that many pretraining objectives can learn the contexture, including supervised learning, self-supervised learning, generative models, etc. Then, we introduce two general objectives -- SVME and KISE, for learning the contexture. We also show how to mix multiple contexts together, an effortless way to create better contexts from existing ones. Then, we prove statistical learning bounds for representation learning. Finally, we discuss the effect of the data distribution shift from pretraining to the downstream task.

Abstract PDF Upgrade to Chat

Summary

The paper establishes contexture theory, providing a mathematical formulation that explains how associations between inputs and context variables drive representation learning in foundation models.
It details a spectral decomposition and variational objectives that recover the top eigenfunctions, linking pretraining methods to improved model performance and scalability.
The study further introduces methods for mixing contexts and robust optimization techniques (STKR and DORO) to enhance generalization under distribution shifts and outlier scenarios.

This paper, "Contextures: The Mechanism of Representation Learning" (2504.19792), establishes a theoretical framework called the contexture theory to mathematically characterize the mechanism of representation learning, commonly known as pretraining, especially in the context of large foundation models. Despite the empirical success of these models, there has been a lack of clear understanding regarding what representations they learn and why these representations are useful for various disparate downstream tasks. The paper argues that a scientific understanding is crucial for future progress, especially as scaling alone yields diminishing returns.

Introduction to the Contexture Theory

The central argument of the contexture theory is that representations are learned from the association between the input $X$ and a context variable $A$ . This association is referred to as a contexture. The theory posits that representation learning is capable of capturing system 1 thinking, which is fast, automatic, and associative, aligning with the empirical observation that large neural networks can perform such tasks rapidly. The theory aims to answer key questions about the nature and utility of learned representations, variational objectives for learning them, implications for scaling laws, methods for improving models beyond scaling, and statistical guarantees.

Contexts: Definition and Spectral Properties

A context is defined by an input space $\mathcal{X}$ , a context space $\mathcal{A}$ , and their joint distribution $P^+(x, a)$ . Examples of contexts include labels in supervised learning, transformed/augmented inputs in self-supervised learning, related samples in graphs, and features from teacher models.

The joint distribution $P^+$ induces an expectation operator $T_{P^+} : L^2(P_\mathcal{A}) \to L^2(P_\mathcal{X})$ defined by $(T_{P^+} g)(x) = E[g(A) | x]$ , and its adjoint $T_{P^+}^* : L^2(P_\mathcal{X}) \to L^2(P_\mathcal{A})$ defined by $(T_{P^+}^* f)(a) = E[f(X) | a]$ . These operators lead to two positive semi-definite kernels: the positive-pair kernel $K_\mathcal{A}(a, a') = \int P^+(a|x) P^+(a'|x) dP_\mathcal{X}(x) / (P_\mathcal{A}(a) P_\mathcal{A}(a'))$ on $\mathcal{A} \times \mathcal{A}$ and the dual kernel $K_\mathcal{X}(x, x') = \int P^+(a|x) P^+(a|x') / P_\mathcal{A}(a) da$ on $\mathcal{X} \times \mathcal{X}$ . The integral operators of these kernels, $T_{K_\mathcal{A}} = T_{P^+}^* T_{P^+}$ and $T_{K_\mathcal{X}} = T_{P^+} T_{P^+}^*$ , share the same non-zero eigenvalues. The square roots of these eigenvalues, $s_i$ , are the singular values of $T_{P^+}$ . The set of eigenvalues is called the spectrum of the context.

The paper proves a spectral decomposition of $P^+(x,a) = \sum_i s_i \mu_i(x) \nu_i(a) P_\mathcal{X}(x) P_\mathcal{A}(a)$ , where $\mu_i$ and $\nu_i$ are the orthonormal eigenfunctions of $T_{K_\mathcal{X}}$ and $T_{K_\mathcal{A}}$ , respectively, corresponding to $s_i^2$ . The shape of the spectrum (eigenvalue decay rate) is determined by the strength of the association between $X$ and $A$ : stronger association leads to slower decay.

An encoder $\Phi: \mathcal{X} \to \mathbb{R}^d$ is said to "learn the contexture" if the span of its centered components $[\tilde{\phi}_1, \dots, \tilde{\phi}_d]$ recovers the linear space spanned by the top- $d$ eigenfunctions $\mu_1, \dots, \mu_d$ of $T_{K_\mathcal{X}}$ (excluding the constant function $\mu_0 \equiv 1$ ).

Types of Access and Variational Objectives

Contexts can be accessed in different ways:

Pair access: Access to i.i.d. samples $(x_i, a_i)$ from $P^+$ .
Kernel access (k-access): Access to a kernel function $k(x, x')$ approximating $K_\mathcal{X}$ .
Transformation access (T-access): Ability to sample $A \sim P^+(\cdot | x)$ for any $x$ .

Existing pretraining objectives are shown to implicitly learn the contexture for specific contexts. For instance:

Mean squared error regression (linear probe) with a context $A \in \mathbb{R}^{d_\mathcal{A}}$ learns the top- $d$ eigenspace of $T_{P^+} \Lambda T_{P^+}^*$ , where $\Lambda$ depends on the loss kernel and data imbalance. A balanced MSE objective can learn the exact $T_{K_\mathcal{X}}$ eigenspace.
Graph node representation learning minimizing $\norm{\Phi(u) - \Phi(v)}_2^2$ for connected nodes $(u, v)$ learns the top- $d$ eigenspace of the graph Laplacian (related to $T_{K_\mathcal{X}}$ for graph contexts).
Multi-view learning objectives like spectral contrastive loss or non-contrastive learning with orthonormality constraints on the encoder $\Psi: \mathcal{A} \to \mathbb{R}^d$ learn the top- $d$ eigenspace of $T_{P^+}^* T_{P^+}$ . The average encoder $\Phi = T_{P^+} \Psi$ then learns the contexture of $P^+$ .
Reconstruction objectives mapping $A$ back to $X$ learn the top- $d$ eigenspace of $T_{P^+} \Lambda T_{P^+}^*$ , where $\Lambda$ is related to the linear kernel on $\mathcal{X}$ .

Two general variational objectives are proposed:

Single-View Multi-Encoder (SVME): For pair access, minimizes $E_{(X,A) \sim P^+} [\norm{\Phi(X) - \Psi(A)}_2^2]$ subject to $Cov_{P_\mathcal{X}}[\Phi] = I$ . This trains $\Phi$ directly and is equivalent to KISE when $\Psi$ is optimal for a fixed $\Phi$ .
Kernel-Integral Single-Encoder (KISE): For k-access, minimizes $E_{X \sim P_\mathcal{X}} [\norm{\tilde{\Phi}(X)}_2^2 - \langle \tilde{\Phi}(X), T_k \tilde{\Phi}(X) \rangle]$ subject to $Cov_{P_\mathcal{X}}[\Phi] = I$ , where $k$ approximates $K_\mathcal{X}$ .

The orthonormality constraint $Cov_{P_\mathcal{X}}[\Phi] = I$ can be approximated using objectives like VICReg, although perfect enforcement is challenging. Learning the contexture with these objectives requires expressive function approximators (like deep neural networks) and effective optimizers. The paper suggests that scaling up model size primarily helps align the learned representation space with the theoretical top- $d$ eigenspace, explaining diminishing returns when sufficient alignment is achieved.

Knowledge Distillation and Context Conversion

Teacher models $\Phi_t: \mathcal{X} \to \mathbb{R}^{d_t}$ can be viewed as providing a context, even if their original pretraining context is unknown. Their knowledge can be distilled by querying $\Phi_t$ and constructing its centered linear kernel $k_t(x, x') = \langle \tilde{\Phi}_t(x), \tilde{\Phi}_t(x') \rangle$ . KISE or a distillation objective minimizing $E_{X \sim P_\mathcal{X}} [\norm{W \Phi(X) - \Phi_t(X)}_2^2]$ can then be used to learn the top- $d$ eigenspace of $T_{k_t}$ . This provides a practical way to obtain a context with k-access from any pretrained encoder.

Mixing Multiple Contexts

To obtain better contexts, especially ones with moderate association from those with strong or weak ones, the paper proposes mixing existing contexts using three base operations:

Convolution: Composing multiple contexts sequentially, analogous to applying multiple data augmentations in order. For contexts $P_1^+, \dots, P_r^+$ with dual kernels $k_1, \dots, k_r$ (where $k_j$ involves a heuristic inverse $Q_j^+$ if T-access), the dual kernel of the convolution is related to $T_{k_r} \dots T_{k_1}$ . Learning requires propagating features through the sequence of transformations or kernels. Used when contexts have strong associations.
Convex Combination: Forming a weighted sum of contexts $\sum w_j P_j^+$ . Learning involves minimizing a weighted sum of individual objectives $\sum w_j \mathcal{L}_j(\Phi, \Psi_j)$ , which extracts the top- $d$ eigenspace of the combined kernel $\sum w_j k_j$ . Used when contexts have mixed weak/strong associations. Optimal weights can be found via a minimax game.
Concatenation: Training separate encoders $\Phi_j$ for each context and concatenating their outputs $\Phi(x) = [\Phi_1(x), \dots, \Phi_r(x)]$ . The dual kernel of the concatenation has eigenvalues that are the union of individual eigenvalues, making the combined spectrum decay slower. Used when contexts have weak associations.

Empirical results on tabular data demonstrate that mixing contexts, particularly convex combination of SCARF/Cutmix (strong association) with Y-Linear (weak association) and concatenation with XGBoost teacher models (moderate association), can lead to performance improvements over state-of-the-art methods like XGBoost and MLP.

Statistical Learning Bounds

The paper provides theoretical bounds on the generalization error of contexture learning in the finite sample regime ( $m$ pretraining samples, $n$ downstream samples). The error decomposes into approximation and estimation errors. A key quantity is the "context complexity" $\kappa_\mathcal{T} = \norm{K_\mathcal{X}}_\infty^{1/2}$, bounding $\sum s_i^2 \mu_i(x)^2$ for $P_\mathcal{X}$ -almost all $x$ . $\kappa_\mathcal{T}$ measures the "smoothness" or "peakiness" of the eigenfunctions; higher $\kappa_\mathcal{T}$ indicates less smooth eigenfunctions and higher sample complexity.

Generalization bounds for learning the top- $d$ eigenspace via kernel PCA (assuming k-access to $K_\mathcal{X}$ ) are derived. The approximation error depends on $s_{d+1}^2$ and terms related to $\kappa_\mathcal{T}$ and $m$ . The estimation error (of the linear probe on downstream data) depends on $d$ , $\kappa_\mathcal{T}$ , and $n$ . These bounds formalize the trade-off between approximation (decreasing with $d$ ) and estimation (increasing with $d$ and $\kappa_\mathcal{T}$ ). The bounds show that generalization performance degrades with higher $\kappa_\mathcal{T}$ , which is often exponential in data dimensionality, highlighting a discrepancy with practical deep learning success in high dimensions.

Spectrally Transformed Kernel Regression (STKR)

Contexture learning (truncation of the spectrum of $T_{K_\mathcal{X}}$ ) is a specific instance of Spectrally Transformed Kernel (STK) regression. An STK $k_s(x, x') = \sum s(\lambda_i) \mu_i(x) \mu_i(x')$ uses the same eigenfunctions as a base kernel $k$ , but transforms its eigenvalues $\lambda_i$ via a function $s$ . STKR involves fitting a predictor in the RKHS of $k_s$ . This framework is particularly relevant for semi-supervised learning where unlabeled data can be used to estimate the kernel $k$ and its spectrum.

The paper shows that STKR can be more effective than standard Kernel Ridge Regression (KRR) which only uses labeled data, by leveraging the structure captured by the kernel's spectrum across all data. Efficient iterative algorithms (STKR-Prop) are proposed for polynomial spectral transformations, including the inverse Laplacian transformation which is popular in semi-supervised learning. Generalization bounds are derived for STKR, showing how the spectral transformation $s$ and context complexity $\kappa_\mathcal{T}$ affect the error. Empirical studies on graph node classification demonstrate the effectiveness and efficiency of STKR-Prop, suggesting that capturing multi-step similarity (higher powers of the kernel) is beneficial.

Generalization Under Distribution Shift

The contexture theory largely assumes a fixed data distribution $P_\mathcal{X}$ , which is often violated in practice (distribution shift). The paper focuses on subpopulation shift where the test distribution $Q$ is absolutely continuous with respect to the training distribution $P$ ( $Q \ll P$ ). Standard approaches like importance weighting (reweighting samples based on $Q/P$ ) and Distributionally Robust Optimization (DRO), which minimizes the worst-case risk over distributions $Q$ close to $P$ , are discussed.

However, the paper presents theoretical and empirical results suggesting that reweighting and standard DRO methods may not improve over standard Empirical Risk Minimization (ERM) in overparameterized deep learning, particularly with common loss functions like logistic loss. This is because without sufficient regularization, these methods tend to converge to models very similar to the ERM solution (e.g., the max-margin classifier for classification). Significant regularization or early stopping is needed for them to diverge and potentially improve performance on target subgroups.

Furthermore, standard DRO methods, especially those based on $f$ -divergences like CVaR and $\chi^2$ -divergence, are shown to be highly sensitive to outliers in the training data. Because outliers often incur high losses, DRO objectives prioritize them, leading to unstable training and poor generalization.

Distributionally and Outlier Robust Optimization (DORO)

To address DRO's outlier sensitivity, the paper introduces Distributionally and Outlier Robust Optimization (DORO). DORO aims to minimize risk over robust distributions $P'$ which are $\epsilon$ -contaminations of some clean distribution $P$ , i.e., $\mathcal{P} = \{(1-\epsilon) P + \epsilon \tilde{P} \}$ . Specifically, DORO minimizes $\inf_{P'} DRO(f; P')$ , where $P'$ is a distribution in the support of the contaminated data $\mathcal{P}$ and $TV(P, P') \le \epsilon/(1-\epsilon)$ . For the Cressie-Read family of $f$ -divergences, a dual formulation for DORO risk is derived, which can be minimized by an algorithm (Algorithm 3) that effectively ignores a fraction of the highest-loss samples in each batch. Theoretical guarantees show that DORO risk provides an upper bound on the worst-group risk under the clean distribution $P$ . Empirical results on benchmark datasets (COMPAS, CelebA, CivilComments) demonstrate that DORO consistently outperforms standard DRO in terms of both average and worst-group accuracy and training stability, confirming its robustness to outliers.

Conclusion and Open Problems

The contexture theory provides a unified mathematical framework for understanding representation learning as the recovery of spectral properties of the association between input and context. It connects diverse pretraining objectives and offers insights into scaling laws and context design. Key findings highlight the importance of moderate context association and reveal that mixing existing contexts can yield improvements. Generalization analysis in finite data and under distribution shift reveals significant challenges for both contexture learning and standard DRO methods, motivating approaches like STKR and DORO.

Limitations include the lack of analysis on the impact of optimization dynamics and model architecture on the learned representation beyond general scaling observations. Open problems include:

Characterizing the oscillating representations learned by deep networks at the edge of stability.
Formalizing the inductive bias of arbitrary neural network architectures as a context.
Achieving true context scaling by obtaining complex, real-world contexts.
Extending the theory to model and improve System 2 thinking (reasoning) in AI systems, potentially involving test-time scaling and sequential processing.

The paper concludes that while mixing existing contexts is a useful step, revolutionary breakthroughs likely require discovering fundamentally new contexts. DRG remains a hard problem, and careful consideration of implementation details and dataset properties is needed for effective out-of-distribution generalization.