Papers
Topics
Authors
Recent
Search
2000 character limit reached

COD: Clustering-On-Difficulty Framework

Updated 27 January 2026
  • The paper introduces the COD framework, which partitions LLM samples and variables using difficulty and covariance profiles for robust performance prediction.
  • It employs modified MeanShift and sCOD thresholding algorithms to form clusters, enabling stable scaling law fitting and accurate subset-to-full mapping.
  • COD achieves minimax-optimal cluster recovery and outperforms competing methods through explicit separation thresholds and calibrated quartic mapping.

The Clustering-On-Difficulty (COD) framework encompasses a family of model-based clustering techniques that leverage difficulty or similarity metrics to robustly partition samples or variables for downstream performance prediction, covariance structure recovery, or high-dimensional exploratory analysis. COD has been developed and rigorously analyzed in contexts including LLM scaling prediction and model-assisted variable clustering, offering minimax-optimal recovery, explicit cluster separation thresholds, and end-to-end accuracy extrapolation pipelines (Xu et al., 24 Feb 2025, Bunea et al., 2015).

1. Modeling by Difficulty Features and Covariance Profiles

COD’s foundational principle is the clustering of samples (in LLM evaluation) or variables (in variable clustering) based on features that encode “difficulty” or “similarity” as measured by model predictions or covariance profiles.

  • In LLM performance prediction, each evaluation sample is characterized by its passrate vector across a suite of small models {M1,,MS}\{M_1, \dots, M_S\}. The difficulty-feature for sample pp is xp=[p1(p),,pS(p)]RS\mathbf{x}_p = [p_1(p),\dots,p_S(p)]^\top \in \mathbb{R}^S, where pj(p)p_j(p) is the empirical passrate of MjM_j on pp, estimated via repeated few-shot stochastic trials. This vector is typically nondecreasing in index, as model performance increases with scale (Xu et al., 24 Feb 2025).
  • In high-dimensional covariance modeling, the G-block covariance model posits that the pp-dimensional random vector XX admits

Σ=ACA+Γ,\Sigma = A C^* A^\top + \Gamma,

with AA the p×Kp \times K indicator membership matrix, CRK×KC^*\in\mathbb{R}^{K \times K} latent factor covariance, and Γ\Gamma diagonal noise. This structure clusters variables whose covariance profiles are similar across all other variables (Bunea et al., 2015).

2. Difficulty-Driven or Similarity-Based Clustering Algorithms

COD clusters are constructed by partitioning points with similar difficulty or similarity profiles, automatically adapting both the cluster count and outlier handling.

  • For LLMs: The passrate matrix XRN×SX \in \mathbb{R}^{N \times S} (samples ×\times small models) is clustered using an improved MeanShift algorithm based on Euclidean distance. The steps are:

    1. All samples start unassigned.
    2. Standard MeanShift is run on the unassigned set with bandwidth RR.
    3. Points within RR of each new center cc are assigned to the cluster; others remain unassigned.
    4. Clusters with fewer than KK points are dissolved; their members revert to unassigned.
    5. Repeat until convergence; remaining unassigned samples are marked as outliers (Xu et al., 24 Feb 2025).
  • For variable clustering: COD operates by identifying pairs of variables (a,b)(a, b) with minimal

s^COD(a,b)=maxca,bΣ^a,cΣ^b,c/(Σ^a,a+Σ^b,b2Σ^a,b)Σ^c,c\hat sCOD(a, b) = \max_{c \ne a,b} \left| \hat \Sigma_{a,c} - \hat \Sigma_{b,c} \right| / \sqrt{ (\hat \Sigma_{a,a} + \hat \Sigma_{b,b} - 2\hat\Sigma_{a,b}) \cdot \hat\Sigma_{c,c} }

and iteratively extracts clusters based on a fixed threshold α\alpha. Singleton clusters are identified where maximal similarity is above threshold; otherwise, a cluster is built by merging variables whose pairwise sCOD falls below α\alpha with respect to at least one seed variable (Bunea et al., 2015).

3. Cluster-Wise Performance Extrapolation and Scaling Laws

Only clusters with regular, predictable scaling are considered for further extrapolation, yielding stable predictions on subset performance.

  • Within each cluster, a scaling law is fit of the form:

y(C)=g+(1g)exp(aCbc),a>1, b>0.1, 0<c<1,y(C) = g + (1-g) \exp(-aC^{-b} - c),\quad a>1,\ b>0.1,\ 0<c<1,

where y(C)y(C) denotes the expected accuracy at compute CC. The random-guess floor gg and asymptotic offset cc are essential for accurate fits. Only clusters whose parameters and fitted curves satisfy these constraints, and that are monotonic and extrapolatable, are retained. Cluster-wise predictions y(C0)y(C_0) are aggregated as a cluster-size-weighted average for the subset (Xu et al., 24 Feb 2025).

4. Mapping Subset Prediction to Full Evaluation Set

Since not all samples are in extrapolatable clusters, a mapping is learned to translate subset accuracy TT' to total set accuracy TT.

  • The mapping function ff is a quartic polynomial constrained by f(0)=0f(0) = 0, f(1)=1f(1) = 1:

f(x)=α1x4+α2x3+α3x2+(1α1α2α3)x.f(x) = \alpha_1 x^4 + \alpha_2 x^3 + \alpha_3 x^2 + (1-\alpha_1-\alpha_2-\alpha_3)x.

  • Coefficients α1,2,3\alpha_{1,2,3} are fit by least-squares or interpolation on anchor points (Tm,Tm)(T_m', T_m) obtained from mid-sized or external models. This calibration reduces bias, particularly when anchors are out-of-distribution models; empirically, out-of-distribution anchors reduce error approximately 40% over no-anchor mapping (Xu et al., 24 Feb 2025).

5. Minimax Thresholds, Theoretical Guarantees, and Empirical Results

COD is rigorously analyzed for optimality guarantees and demonstrated empirically to match the tightest known error rates.

  • Minimax-Optimality (variable clustering):
    • The MCOD (MCOD(Σ)\operatorname{MCOD}(\Sigma)) and Δ(C)\Delta(C^*) metrics govern the cluster separation: MCOD(Σ)=mina≁bmaxca,bΣa,cΣb,c\operatorname{MCOD}(\Sigma) = \min_{a\not\sim b} \max_{c \ne a,b} |\Sigma_{a,c} - \Sigma_{b,c}|, and Δ(C)=minj<k[Cj,j+Ck,k2Cj,k]\Delta(C^*) = \min_{j<k} [C^*_{j,j} + C^*_{k,k} - 2C^*_{j,k}].
    • For class M(m,η)\mathcal{M}(m,\eta), no estimator can guarantee exact recovery if η<clogp/n\eta < c \sqrt{\log p / n}. COD attains exact recovery for MCOD(Σ)logp/n\operatorname{MCOD}(\Sigma)\gtrsim \sqrt{\log p/n}, matching the minimax threshold (Theorem 3.1, (Bunea et al., 2015)).
  • Downstream task prediction (LLMs):
    • On eight LLM benchmarks (GSM8K, MATH, BBH, TriviaQA, MBPP, AGIEval, DROP, MMLU-pro), COD achieves a mean absolute error of 1.63 percentage points (1.36% in the abstract) and a maximum error never exceeding 2.4 points, outperforming end-to-end, passrate-only, and loss-intermediate baselines.
    • Key ablation findings include the critical importance of fitting both gg and cc, and the superiority of quartic mapping over lower or higher-degree interpolants (Xu et al., 24 Feb 2025).

COD is compared against PECOK, a penalized SDP tailored to the Δ(C)\Delta(C^*) metric, and corrected spectral clustering.

Algorithm Specialized Metric Exact Recovery Threshold Computational Regime
COD MCOD logp/n\gtrsim \sqrt{\log p / n} O(p3)O(p^3), no SDP, cluster size $1$
PECOK (SDP relaxation) Δ(C)\Delta(C^*) (Klogp)/(mn)\gtrsim \sqrt{(K \vee \log p)/(mn)} SDP (O(p3)O(p^3)), clusters 10\gtrsim 10 or balanced
Corrected Spectral Latent eigengap Stronger than minimax, no exact at threshold Low constant factors, needs larger separation

COD provides optimal recovery for MCOD-type separation with minimal assumptions and computational simplicity. PECOK is optimal for Δ(C)\Delta(C^*) separation when clusters are balanced and K=O(logp)K=O(\log p), but incurs SDP complexity. Corrected spectral clustering is computationally attractive but theoretically requires larger separation and does not attain minimax thresholds (Bunea et al., 2015).

7. Practical Implementation and Use Cases

COD is applied by:

  1. Gathering difficulty or covariance profile matrices.
  2. Clustering via improved MeanShift (LLM) or sCOD thresholding (variable clustering).
  3. Fitting cluster-wise scaling laws.
  4. Aggregating extrapolatable cluster predictions.
  5. Mapping to full set predictions via anchor-calibrated quartic.
  6. Recommending parameter choices: pre-filter all-zero passrates, smooth passrates across checkpoints, mean-shift RR and KK for intra-cluster diameter 0.2\lesssim 0.2 and cluster size 10\geq 10.

LLM experiments use small models from 122M to 12B parameters, predict full-model performance at 70B scale, and produce actionable insights for efficient resource allocation and pretraining monitoring (Xu et al., 24 Feb 2025). In fMRI variable clustering, COD identifies meaningful brain networks and outperforms classical clustering across sparsity regimes (Bunea et al., 2015).

COD thus serves as a state-of-the-art methodology for both principled model-based clustering and reliable performance extrapolation, combining minimax theoretical performance with robust empirical accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clustering-On-Difficulty (COD) Framework.