COD: Clustering-On-Difficulty Framework
- The paper introduces the COD framework, which partitions LLM samples and variables using difficulty and covariance profiles for robust performance prediction.
- It employs modified MeanShift and sCOD thresholding algorithms to form clusters, enabling stable scaling law fitting and accurate subset-to-full mapping.
- COD achieves minimax-optimal cluster recovery and outperforms competing methods through explicit separation thresholds and calibrated quartic mapping.
The Clustering-On-Difficulty (COD) framework encompasses a family of model-based clustering techniques that leverage difficulty or similarity metrics to robustly partition samples or variables for downstream performance prediction, covariance structure recovery, or high-dimensional exploratory analysis. COD has been developed and rigorously analyzed in contexts including LLM scaling prediction and model-assisted variable clustering, offering minimax-optimal recovery, explicit cluster separation thresholds, and end-to-end accuracy extrapolation pipelines (Xu et al., 24 Feb 2025, Bunea et al., 2015).
1. Modeling by Difficulty Features and Covariance Profiles
COD’s foundational principle is the clustering of samples (in LLM evaluation) or variables (in variable clustering) based on features that encode “difficulty” or “similarity” as measured by model predictions or covariance profiles.
- In LLM performance prediction, each evaluation sample is characterized by its passrate vector across a suite of small models . The difficulty-feature for sample is , where is the empirical passrate of on , estimated via repeated few-shot stochastic trials. This vector is typically nondecreasing in index, as model performance increases with scale (Xu et al., 24 Feb 2025).
- In high-dimensional covariance modeling, the G-block covariance model posits that the -dimensional random vector admits
with the indicator membership matrix, latent factor covariance, and diagonal noise. This structure clusters variables whose covariance profiles are similar across all other variables (Bunea et al., 2015).
2. Difficulty-Driven or Similarity-Based Clustering Algorithms
COD clusters are constructed by partitioning points with similar difficulty or similarity profiles, automatically adapting both the cluster count and outlier handling.
- For LLMs: The passrate matrix (samples small models) is clustered using an improved MeanShift algorithm based on Euclidean distance. The steps are:
- All samples start unassigned.
- Standard MeanShift is run on the unassigned set with bandwidth .
- Points within of each new center are assigned to the cluster; others remain unassigned.
- Clusters with fewer than points are dissolved; their members revert to unassigned.
- Repeat until convergence; remaining unassigned samples are marked as outliers (Xu et al., 24 Feb 2025).
For variable clustering: COD operates by identifying pairs of variables with minimal
and iteratively extracts clusters based on a fixed threshold . Singleton clusters are identified where maximal similarity is above threshold; otherwise, a cluster is built by merging variables whose pairwise sCOD falls below with respect to at least one seed variable (Bunea et al., 2015).
3. Cluster-Wise Performance Extrapolation and Scaling Laws
Only clusters with regular, predictable scaling are considered for further extrapolation, yielding stable predictions on subset performance.
- Within each cluster, a scaling law is fit of the form:
where denotes the expected accuracy at compute . The random-guess floor and asymptotic offset are essential for accurate fits. Only clusters whose parameters and fitted curves satisfy these constraints, and that are monotonic and extrapolatable, are retained. Cluster-wise predictions are aggregated as a cluster-size-weighted average for the subset (Xu et al., 24 Feb 2025).
4. Mapping Subset Prediction to Full Evaluation Set
Since not all samples are in extrapolatable clusters, a mapping is learned to translate subset accuracy to total set accuracy .
- The mapping function is a quartic polynomial constrained by , :
- Coefficients are fit by least-squares or interpolation on anchor points obtained from mid-sized or external models. This calibration reduces bias, particularly when anchors are out-of-distribution models; empirically, out-of-distribution anchors reduce error approximately 40% over no-anchor mapping (Xu et al., 24 Feb 2025).
5. Minimax Thresholds, Theoretical Guarantees, and Empirical Results
COD is rigorously analyzed for optimality guarantees and demonstrated empirically to match the tightest known error rates.
- Minimax-Optimality (variable clustering):
- The MCOD () and metrics govern the cluster separation: , and .
- For class , no estimator can guarantee exact recovery if . COD attains exact recovery for , matching the minimax threshold (Theorem 3.1, (Bunea et al., 2015)).
- Downstream task prediction (LLMs):
- On eight LLM benchmarks (GSM8K, MATH, BBH, TriviaQA, MBPP, AGIEval, DROP, MMLU-pro), COD achieves a mean absolute error of 1.63 percentage points (1.36% in the abstract) and a maximum error never exceeding 2.4 points, outperforming end-to-end, passrate-only, and loss-intermediate baselines.
- Key ablation findings include the critical importance of fitting both and , and the superiority of quartic mapping over lower or higher-degree interpolants (Xu et al., 24 Feb 2025).
6. Comparative Analysis with Related Algorithms
COD is compared against PECOK, a penalized SDP tailored to the metric, and corrected spectral clustering.
| Algorithm | Specialized Metric | Exact Recovery Threshold | Computational Regime |
|---|---|---|---|
| COD | MCOD | , no SDP, cluster size $1$ | |
| PECOK (SDP relaxation) | SDP (), clusters or balanced | ||
| Corrected Spectral | Latent eigengap | Stronger than minimax, no exact at threshold | Low constant factors, needs larger separation |
COD provides optimal recovery for MCOD-type separation with minimal assumptions and computational simplicity. PECOK is optimal for separation when clusters are balanced and , but incurs SDP complexity. Corrected spectral clustering is computationally attractive but theoretically requires larger separation and does not attain minimax thresholds (Bunea et al., 2015).
7. Practical Implementation and Use Cases
COD is applied by:
- Gathering difficulty or covariance profile matrices.
- Clustering via improved MeanShift (LLM) or sCOD thresholding (variable clustering).
- Fitting cluster-wise scaling laws.
- Aggregating extrapolatable cluster predictions.
- Mapping to full set predictions via anchor-calibrated quartic.
- Recommending parameter choices: pre-filter all-zero passrates, smooth passrates across checkpoints, mean-shift and for intra-cluster diameter and cluster size .
LLM experiments use small models from 122M to 12B parameters, predict full-model performance at 70B scale, and produce actionable insights for efficient resource allocation and pretraining monitoring (Xu et al., 24 Feb 2025). In fMRI variable clustering, COD identifies meaningful brain networks and outperforms classical clustering across sparsity regimes (Bunea et al., 2015).
COD thus serves as a state-of-the-art methodology for both principled model-based clustering and reliable performance extrapolation, combining minimax theoretical performance with robust empirical accuracy.