Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contextures: The Mechanism of Representation Learning

Published 28 Apr 2025 in cs.LG, cs.AI, and stat.ML | (2504.19792v1)

Abstract: This dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. Despite the remarkable empirical success of foundation models, it is not very clear what representations they learn, and why these representations are useful for various downstream tasks. A scientific understanding of representation learning is critical, especially at this point when scaling up the model size is producing diminishing returns, and designing new pretraining methods is imperative for further progress. Prior work treated different representation learning methods quite differently, whereas the contexture theory provides a unified framework for analyzing these methods. The central argument is that a representation is learned from the association between the input X and a context variable A. We prove that if an encoder captures the maximum information of this association, in which case we say that the encoder learns the contexture, then it will be optimal on the class of tasks that are compatible with the context. We also show that a context is the most useful when the association between X and A is neither too strong nor too weak. The important implication of the contexture theory is that increasing the model size alone will achieve diminishing returns, and further advancements require better contexts. We demonstrate that many pretraining objectives can learn the contexture, including supervised learning, self-supervised learning, generative models, etc. Then, we introduce two general objectives -- SVME and KISE, for learning the contexture. We also show how to mix multiple contexts together, an effortless way to create better contexts from existing ones. Then, we prove statistical learning bounds for representation learning. Finally, we discuss the effect of the data distribution shift from pretraining to the downstream task.

Summary

  • The paper establishes contexture theory, providing a mathematical formulation that explains how associations between inputs and context variables drive representation learning in foundation models.
  • It details a spectral decomposition and variational objectives that recover the top eigenfunctions, linking pretraining methods to improved model performance and scalability.
  • The study further introduces methods for mixing contexts and robust optimization techniques (STKR and DORO) to enhance generalization under distribution shifts and outlier scenarios.

This paper, "Contextures: The Mechanism of Representation Learning" (2504.19792), establishes a theoretical framework called the contexture theory to mathematically characterize the mechanism of representation learning, commonly known as pretraining, especially in the context of large foundation models. Despite the empirical success of these models, there has been a lack of clear understanding regarding what representations they learn and why these representations are useful for various disparate downstream tasks. The paper argues that a scientific understanding is crucial for future progress, especially as scaling alone yields diminishing returns.

Introduction to the Contexture Theory

The central argument of the contexture theory is that representations are learned from the association between the input XX and a context variable AA. This association is referred to as a contexture. The theory posits that representation learning is capable of capturing system 1 thinking, which is fast, automatic, and associative, aligning with the empirical observation that large neural networks can perform such tasks rapidly. The theory aims to answer key questions about the nature and utility of learned representations, variational objectives for learning them, implications for scaling laws, methods for improving models beyond scaling, and statistical guarantees.

Contexts: Definition and Spectral Properties

A context is defined by an input space X\mathcal{X}, a context space A\mathcal{A}, and their joint distribution P+(x,a)P^+(x, a). Examples of contexts include labels in supervised learning, transformed/augmented inputs in self-supervised learning, related samples in graphs, and features from teacher models.

The joint distribution P+P^+ induces an expectation operator TP+:L2(PA)L2(PX)T_{P^+} : L^2(P_\mathcal{A}) \to L^2(P_\mathcal{X}) defined by (TP+g)(x)=E[g(A)x](T_{P^+} g)(x) = E[g(A) | x], and its adjoint TP+:L2(PX)L2(PA)T_{P^+}^* : L^2(P_\mathcal{X}) \to L^2(P_\mathcal{A}) defined by (TP+f)(a)=E[f(X)a](T_{P^+}^* f)(a) = E[f(X) | a]. These operators lead to two positive semi-definite kernels: the positive-pair kernel KA(a,a)=P+(ax)P+(ax)dPX(x)/(PA(a)PA(a))K_\mathcal{A}(a, a') = \int P^+(a|x) P^+(a'|x) dP_\mathcal{X}(x) / (P_\mathcal{A}(a) P_\mathcal{A}(a')) on A×A\mathcal{A} \times \mathcal{A} and the dual kernel KX(x,x)=P+(ax)P+(ax)/PA(a)daK_\mathcal{X}(x, x') = \int P^+(a|x) P^+(a|x') / P_\mathcal{A}(a) da on X×X\mathcal{X} \times \mathcal{X}. The integral operators of these kernels, TKA=TP+TP+T_{K_\mathcal{A}} = T_{P^+}^* T_{P^+} and TKX=TP+TP+T_{K_\mathcal{X}} = T_{P^+} T_{P^+}^*, share the same non-zero eigenvalues. The square roots of these eigenvalues, sis_i, are the singular values of TP+T_{P^+}. The set of eigenvalues is called the spectrum of the context.

The paper proves a spectral decomposition of P+(x,a)=isiμi(x)νi(a)PX(x)PA(a)P^+(x,a) = \sum_i s_i \mu_i(x) \nu_i(a) P_\mathcal{X}(x) P_\mathcal{A}(a), where μi\mu_i and νi\nu_i are the orthonormal eigenfunctions of TKXT_{K_\mathcal{X}} and TKAT_{K_\mathcal{A}}, respectively, corresponding to si2s_i^2. The shape of the spectrum (eigenvalue decay rate) is determined by the strength of the association between XX and AA: stronger association leads to slower decay.

An encoder Φ:XRd\Phi: \mathcal{X} \to \mathbb{R}^d is said to "learn the contexture" if the span of its centered components [ϕ~1,,ϕ~d][\tilde{\phi}_1, \dots, \tilde{\phi}_d] recovers the linear space spanned by the top-dd eigenfunctions μ1,,μd\mu_1, \dots, \mu_d of TKXT_{K_\mathcal{X}} (excluding the constant function μ01\mu_0 \equiv 1).

Types of Access and Variational Objectives

Contexts can be accessed in different ways:

  1. Pair access: Access to i.i.d. samples (xi,ai)(x_i, a_i) from P+P^+.
  2. Kernel access (k-access): Access to a kernel function k(x,x)k(x, x') approximating KXK_\mathcal{X}.
  3. Transformation access (T-access): Ability to sample AP+(x)A \sim P^+(\cdot | x) for any xx.

Existing pretraining objectives are shown to implicitly learn the contexture for specific contexts. For instance:

  • Mean squared error regression (linear probe) with a context ARdAA \in \mathbb{R}^{d_\mathcal{A}} learns the top-dd eigenspace of TP+ΛTP+T_{P^+} \Lambda T_{P^+}^*, where Λ\Lambda depends on the loss kernel and data imbalance. A balanced MSE objective can learn the exact TKXT_{K_\mathcal{X}} eigenspace.
  • Graph node representation learning minimizing $\norm{\Phi(u) - \Phi(v)}_2^2$ for connected nodes (u,v)(u, v) learns the top-dd eigenspace of the graph Laplacian (related to TKXT_{K_\mathcal{X}} for graph contexts).
  • Multi-view learning objectives like spectral contrastive loss or non-contrastive learning with orthonormality constraints on the encoder Ψ:ARd\Psi: \mathcal{A} \to \mathbb{R}^d learn the top-dd eigenspace of TP+TP+T_{P^+}^* T_{P^+}. The average encoder Φ=TP+Ψ\Phi = T_{P^+} \Psi then learns the contexture of P+P^+.
  • Reconstruction objectives mapping AA back to XX learn the top-dd eigenspace of TP+ΛTP+T_{P^+} \Lambda T_{P^+}^*, where Λ\Lambda is related to the linear kernel on X\mathcal{X}.

Two general variational objectives are proposed:

  • Single-View Multi-Encoder (SVME): For pair access, minimizes $E_{(X,A) \sim P^+} [\norm{\Phi(X) - \Psi(A)}_2^2]$ subject to CovPX[Φ]=ICov_{P_\mathcal{X}}[\Phi] = I. This trains Φ\Phi directly and is equivalent to KISE when Ψ\Psi is optimal for a fixed Φ\Phi.
  • Kernel-Integral Single-Encoder (KISE): For k-access, minimizes $E_{X \sim P_\mathcal{X}} [\norm{\tilde{\Phi}(X)}_2^2 - \langle \tilde{\Phi}(X), T_k \tilde{\Phi}(X) \rangle]$ subject to CovPX[Φ]=ICov_{P_\mathcal{X}}[\Phi] = I, where kk approximates KXK_\mathcal{X}.

The orthonormality constraint CovPX[Φ]=ICov_{P_\mathcal{X}}[\Phi] = I can be approximated using objectives like VICReg, although perfect enforcement is challenging. Learning the contexture with these objectives requires expressive function approximators (like deep neural networks) and effective optimizers. The paper suggests that scaling up model size primarily helps align the learned representation space with the theoretical top-dd eigenspace, explaining diminishing returns when sufficient alignment is achieved.

Knowledge Distillation and Context Conversion

Teacher models Φt:XRdt\Phi_t: \mathcal{X} \to \mathbb{R}^{d_t} can be viewed as providing a context, even if their original pretraining context is unknown. Their knowledge can be distilled by querying Φt\Phi_t and constructing its centered linear kernel kt(x,x)=Φ~t(x),Φ~t(x)k_t(x, x') = \langle \tilde{\Phi}_t(x), \tilde{\Phi}_t(x') \rangle. KISE or a distillation objective minimizing $E_{X \sim P_\mathcal{X}} [\norm{W \Phi(X) - \Phi_t(X)}_2^2]$ can then be used to learn the top-dd eigenspace of TktT_{k_t}. This provides a practical way to obtain a context with k-access from any pretrained encoder.

Mixing Multiple Contexts

To obtain better contexts, especially ones with moderate association from those with strong or weak ones, the paper proposes mixing existing contexts using three base operations:

  1. Convolution: Composing multiple contexts sequentially, analogous to applying multiple data augmentations in order. For contexts P1+,,Pr+P_1^+, \dots, P_r^+ with dual kernels k1,,krk_1, \dots, k_r (where kjk_j involves a heuristic inverse Qj+Q_j^+ if T-access), the dual kernel of the convolution is related to TkrTk1T_{k_r} \dots T_{k_1}. Learning requires propagating features through the sequence of transformations or kernels. Used when contexts have strong associations.
  2. Convex Combination: Forming a weighted sum of contexts wjPj+\sum w_j P_j^+. Learning involves minimizing a weighted sum of individual objectives wjLj(Φ,Ψj)\sum w_j \mathcal{L}_j(\Phi, \Psi_j), which extracts the top-dd eigenspace of the combined kernel wjkj\sum w_j k_j. Used when contexts have mixed weak/strong associations. Optimal weights can be found via a minimax game.
  3. Concatenation: Training separate encoders Φj\Phi_j for each context and concatenating their outputs Φ(x)=[Φ1(x),,Φr(x)]\Phi(x) = [\Phi_1(x), \dots, \Phi_r(x)]. The dual kernel of the concatenation has eigenvalues that are the union of individual eigenvalues, making the combined spectrum decay slower. Used when contexts have weak associations.

Empirical results on tabular data demonstrate that mixing contexts, particularly convex combination of SCARF/Cutmix (strong association) with Y-Linear (weak association) and concatenation with XGBoost teacher models (moderate association), can lead to performance improvements over state-of-the-art methods like XGBoost and MLP.

Statistical Learning Bounds

The paper provides theoretical bounds on the generalization error of contexture learning in the finite sample regime (mm pretraining samples, nn downstream samples). The error decomposes into approximation and estimation errors. A key quantity is the "context complexity" $\kappa_\mathcal{T} = \norm{K_\mathcal{X}}_\infty^{1/2}$, bounding si2μi(x)2\sum s_i^2 \mu_i(x)^2 for PXP_\mathcal{X}-almost all xx. κT\kappa_\mathcal{T} measures the "smoothness" or "peakiness" of the eigenfunctions; higher κT\kappa_\mathcal{T} indicates less smooth eigenfunctions and higher sample complexity.

Generalization bounds for learning the top-dd eigenspace via kernel PCA (assuming k-access to KXK_\mathcal{X}) are derived. The approximation error depends on sd+12s_{d+1}^2 and terms related to κT\kappa_\mathcal{T} and mm. The estimation error (of the linear probe on downstream data) depends on dd, κT\kappa_\mathcal{T}, and nn. These bounds formalize the trade-off between approximation (decreasing with dd) and estimation (increasing with dd and κT\kappa_\mathcal{T}). The bounds show that generalization performance degrades with higher κT\kappa_\mathcal{T}, which is often exponential in data dimensionality, highlighting a discrepancy with practical deep learning success in high dimensions.

Spectrally Transformed Kernel Regression (STKR)

Contexture learning (truncation of the spectrum of TKXT_{K_\mathcal{X}}) is a specific instance of Spectrally Transformed Kernel (STK) regression. An STK ks(x,x)=s(λi)μi(x)μi(x)k_s(x, x') = \sum s(\lambda_i) \mu_i(x) \mu_i(x') uses the same eigenfunctions as a base kernel kk, but transforms its eigenvalues λi\lambda_i via a function ss. STKR involves fitting a predictor in the RKHS of ksk_s. This framework is particularly relevant for semi-supervised learning where unlabeled data can be used to estimate the kernel kk and its spectrum.

The paper shows that STKR can be more effective than standard Kernel Ridge Regression (KRR) which only uses labeled data, by leveraging the structure captured by the kernel's spectrum across all data. Efficient iterative algorithms (STKR-Prop) are proposed for polynomial spectral transformations, including the inverse Laplacian transformation which is popular in semi-supervised learning. Generalization bounds are derived for STKR, showing how the spectral transformation ss and context complexity κT\kappa_\mathcal{T} affect the error. Empirical studies on graph node classification demonstrate the effectiveness and efficiency of STKR-Prop, suggesting that capturing multi-step similarity (higher powers of the kernel) is beneficial.

Generalization Under Distribution Shift

The contexture theory largely assumes a fixed data distribution PXP_\mathcal{X}, which is often violated in practice (distribution shift). The paper focuses on subpopulation shift where the test distribution QQ is absolutely continuous with respect to the training distribution PP (QPQ \ll P). Standard approaches like importance weighting (reweighting samples based on Q/PQ/P) and Distributionally Robust Optimization (DRO), which minimizes the worst-case risk over distributions QQ close to PP, are discussed.

However, the paper presents theoretical and empirical results suggesting that reweighting and standard DRO methods may not improve over standard Empirical Risk Minimization (ERM) in overparameterized deep learning, particularly with common loss functions like logistic loss. This is because without sufficient regularization, these methods tend to converge to models very similar to the ERM solution (e.g., the max-margin classifier for classification). Significant regularization or early stopping is needed for them to diverge and potentially improve performance on target subgroups.

Furthermore, standard DRO methods, especially those based on ff-divergences like CVaR and χ2\chi^2-divergence, are shown to be highly sensitive to outliers in the training data. Because outliers often incur high losses, DRO objectives prioritize them, leading to unstable training and poor generalization.

Distributionally and Outlier Robust Optimization (DORO)

To address DRO's outlier sensitivity, the paper introduces Distributionally and Outlier Robust Optimization (DORO). DORO aims to minimize risk over robust distributions PP' which are ϵ\epsilon-contaminations of some clean distribution PP, i.e., P={(1ϵ)P+ϵP~}\mathcal{P} = \{(1-\epsilon) P + \epsilon \tilde{P} \}. Specifically, DORO minimizes infPDRO(f;P)\inf_{P'} DRO(f; P'), where PP' is a distribution in the support of the contaminated data P\mathcal{P} and TV(P,P)ϵ/(1ϵ)TV(P, P') \le \epsilon/(1-\epsilon). For the Cressie-Read family of ff-divergences, a dual formulation for DORO risk is derived, which can be minimized by an algorithm (Algorithm 3) that effectively ignores a fraction of the highest-loss samples in each batch. Theoretical guarantees show that DORO risk provides an upper bound on the worst-group risk under the clean distribution PP. Empirical results on benchmark datasets (COMPAS, CelebA, CivilComments) demonstrate that DORO consistently outperforms standard DRO in terms of both average and worst-group accuracy and training stability, confirming its robustness to outliers.

Conclusion and Open Problems

The contexture theory provides a unified mathematical framework for understanding representation learning as the recovery of spectral properties of the association between input and context. It connects diverse pretraining objectives and offers insights into scaling laws and context design. Key findings highlight the importance of moderate context association and reveal that mixing existing contexts can yield improvements. Generalization analysis in finite data and under distribution shift reveals significant challenges for both contexture learning and standard DRO methods, motivating approaches like STKR and DORO.

Limitations include the lack of analysis on the impact of optimization dynamics and model architecture on the learned representation beyond general scaling observations. Open problems include:

  1. Characterizing the oscillating representations learned by deep networks at the edge of stability.
  2. Formalizing the inductive bias of arbitrary neural network architectures as a context.
  3. Achieving true context scaling by obtaining complex, real-world contexts.
  4. Extending the theory to model and improve System 2 thinking (reasoning) in AI systems, potentially involving test-time scaling and sequential processing.

The paper concludes that while mixing existing contexts is a useful step, revolutionary breakthroughs likely require discovering fundamentally new contexts. DRG remains a hard problem, and careful consideration of implementation details and dataset properties is needed for effective out-of-distribution generalization.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 156 likes about this paper.