COD: Clustering-On-Difficulty Framework

Updated 27 January 2026

The paper introduces the COD framework, which partitions LLM samples and variables using difficulty and covariance profiles for robust performance prediction.
It employs modified MeanShift and sCOD thresholding algorithms to form clusters, enabling stable scaling law fitting and accurate subset-to-full mapping.
COD achieves minimax-optimal cluster recovery and outperforms competing methods through explicit separation thresholds and calibrated quartic mapping.

The Clustering-On-Difficulty (COD) framework encompasses a family of model-based clustering techniques that leverage difficulty or similarity metrics to robustly partition samples or variables for downstream performance prediction, covariance structure recovery, or high-dimensional exploratory analysis. COD has been developed and rigorously analyzed in contexts including LLM scaling prediction and model-assisted variable clustering, offering minimax-optimal recovery, explicit cluster separation thresholds, and end-to-end accuracy extrapolation pipelines (Xu et al., 24 Feb 2025, Bunea et al., 2015).

1. Modeling by Difficulty Features and Covariance Profiles

COD’s foundational principle is the clustering of samples (in LLM evaluation) or variables (in variable clustering) based on features that encode “difficulty” or “similarity” as measured by model predictions or covariance profiles.

In LLM performance prediction, each evaluation sample is characterized by its passrate vector across a suite of small models $\{M_1, \dots, M_S\}$ . The difficulty-feature for sample $p$ is $\mathbf{x}_p = [p_1(p),\dots,p_S(p)]^\top \in \mathbb{R}^S$ , where $p_j(p)$ is the empirical passrate of $M_j$ on $p$ , estimated via repeated few-shot stochastic trials. This vector is typically nondecreasing in index, as model performance increases with scale (Xu et al., 24 Feb 2025).
In high-dimensional covariance modeling, the G-block covariance model posits that the $p$ -dimensional random vector $X$ admits

$\Sigma = A C^* A^\top + \Gamma,$

with $A$ the $p \times K$ indicator membership matrix, $C^*\in\mathbb{R}^{K \times K}$ latent factor covariance, and $\Gamma$ diagonal noise. This structure clusters variables whose covariance profiles are similar across all other variables (Bunea et al., 2015).

2. Difficulty-Driven or Similarity-Based Clustering Algorithms

COD clusters are constructed by partitioning points with similar difficulty or similarity profiles, automatically adapting both the cluster count and outlier handling.

For LLMs: The passrate matrix $X \in \mathbb{R}^{N \times S}$ $X \in R^{N \times S}$ (samples $\times$ $\times$ small models) is clustered using an improved MeanShift algorithm based on Euclidean distance. The steps are:
1. All samples start unassigned.
2. Standard MeanShift is run on the unassigned set with bandwidth $R$ .
3. Points within $R$ of each new center $c$ are assigned to the cluster; others remain unassigned.
4. Clusters with fewer than $K$ points are dissolved; their members revert to unassigned.
5. Repeat until convergence; remaining unassigned samples are marked as outliers (Xu et al., 24 Feb 2025).
For variable clustering: COD operates by identifying pairs of variables $(a, b)$ with minimal

$\hat sCOD(a, b) = \max_{c \ne a,b} \left| \hat \Sigma_{a,c} - \hat \Sigma_{b,c} \right| / \sqrt{ (\hat \Sigma_{a,a} + \hat \Sigma_{b,b} - 2\hat\Sigma_{a,b}) \cdot \hat\Sigma_{c,c} }$

and iteratively extracts clusters based on a fixed threshold $\alpha$ . Singleton clusters are identified where maximal similarity is above threshold; otherwise, a cluster is built by merging variables whose pairwise sCOD falls below $\alpha$ with respect to at least one seed variable (Bunea et al., 2015).

3. Cluster-Wise Performance Extrapolation and Scaling Laws

Only clusters with regular, predictable scaling are considered for further extrapolation, yielding stable predictions on subset performance.

Within each cluster, a scaling law is fit of the form:

$y(C) = g + (1-g) \exp(-aC^{-b} - c),\quad a>1,\ b>0.1,\ 0<c<1,$

where $y(C)$ denotes the expected accuracy at compute $C$ . The random-guess floor $g$ and asymptotic offset $c$ are essential for accurate fits. Only clusters whose parameters and fitted curves satisfy these constraints, and that are monotonic and extrapolatable, are retained. Cluster-wise predictions $y(C_0)$ are aggregated as a cluster-size-weighted average for the subset (Xu et al., 24 Feb 2025).

4. Mapping Subset Prediction to Full Evaluation Set

Since not all samples are in extrapolatable clusters, a mapping is learned to translate subset accuracy $T'$ to total set accuracy $T$ .

The mapping function $f$ is a quartic polynomial constrained by $f(0) = 0$ , $f(1) = 1$ :

$f(x) = \alpha_1 x^4 + \alpha_2 x^3 + \alpha_3 x^2 + (1-\alpha_1-\alpha_2-\alpha_3)x.$

Coefficients $\alpha_{1,2,3}$ are fit by least-squares or interpolation on anchor points $(T_m', T_m)$ obtained from mid-sized or external models. This calibration reduces bias, particularly when anchors are out-of-distribution models; empirically, out-of-distribution anchors reduce error approximately 40% over no-anchor mapping (Xu et al., 24 Feb 2025).

5. Minimax Thresholds, Theoretical Guarantees, and Empirical Results

COD is rigorously analyzed for optimality guarantees and demonstrated empirically to match the tightest known error rates.

Minimax-Optimality (variable clustering):
- The MCOD ( $\operatorname{MCOD}(\Sigma)$ ) and $\Delta(C^*)$ metrics govern the cluster separation: $\operatorname{MCOD}(\Sigma) = \min_{a\not\sim b} \max_{c \ne a,b} |\Sigma_{a,c} - \Sigma_{b,c}|$ , and $\Delta(C^*) = \min_{j<k} [C^*_{j,j} + C^*_{k,k} - 2C^*_{j,k}]$ .
- For class $\mathcal{M}(m,\eta)$ , no estimator can guarantee exact recovery if $\eta < c \sqrt{\log p / n}$ . COD attains exact recovery for $\operatorname{MCOD}(\Sigma)\gtrsim \sqrt{\log p/n}$ , matching the minimax threshold (Theorem 3.1, (Bunea et al., 2015)).
Downstream task prediction (LLMs):
- On eight LLM benchmarks (GSM8K, MATH, BBH, TriviaQA, MBPP, AGIEval, DROP, MMLU-pro), COD achieves a mean absolute error of 1.63 percentage points (1.36% in the abstract) and a maximum error never exceeding 2.4 points, outperforming end-to-end, passrate-only, and loss-intermediate baselines.
- Key ablation findings include the critical importance of fitting both $g$ and $c$ , and the superiority of quartic mapping over lower or higher-degree interpolants (Xu et al., 24 Feb 2025).

COD is compared against PECOK, a penalized SDP tailored to the $\Delta(C^*)$ metric, and corrected spectral clustering.

Algorithm	Specialized Metric	Exact Recovery Threshold	Computational Regime
COD	MCOD	$\gtrsim \sqrt{\log p / n}$	$O(p^3)$ , no SDP, cluster size $1$
PECOK (SDP relaxation)	$\Delta(C^*)$	$\gtrsim \sqrt{(K \vee \log p)/(mn)}$	SDP ( $O(p^3)$ ), clusters $\gtrsim 10$ or balanced
Corrected Spectral	Latent eigengap	Stronger than minimax, no exact at threshold	Low constant factors, needs larger separation

COD provides optimal recovery for MCOD-type separation with minimal assumptions and computational simplicity. PECOK is optimal for $\Delta(C^*)$ separation when clusters are balanced and $K=O(\log p)$ , but incurs SDP complexity. Corrected spectral clustering is computationally attractive but theoretically requires larger separation and does not attain minimax thresholds (Bunea et al., 2015).

7. Practical Implementation and Use Cases

COD is applied by:

Gathering difficulty or covariance profile matrices.
Clustering via improved MeanShift (LLM) or sCOD thresholding (variable clustering).
Fitting cluster-wise scaling laws.
Aggregating extrapolatable cluster predictions.
Mapping to full set predictions via anchor-calibrated quartic.
Recommending parameter choices: pre-filter all-zero passrates, smooth passrates across checkpoints, mean-shift $R$ and $K$ for intra-cluster diameter $\lesssim 0.2$ and cluster size $\geq 10$ .

LLM experiments use small models from 122M to 12B parameters, predict full-model performance at 70B scale, and produce actionable insights for efficient resource allocation and pretraining monitoring (Xu et al., 24 Feb 2025). In fMRI variable clustering, COD identifies meaningful brain networks and outperforms classical clustering across sparsity regimes (Bunea et al., 2015).

COD thus serves as a state-of-the-art methodology for both principled model-based clustering and reliable performance extrapolation, combining minimax theoretical performance with robust empirical accuracy.

Markdown Report Issue Upgrade to Chat

References (2)

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective (2025)

Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clustering-On-Difficulty (COD) Framework.