DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Published 9 Oct 2025 in cs.LG and cs.AI | (2510.07959v1)

Abstract: Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that condensing evaluation datasets via high model output disagreement can reduce evaluation costs by over 99% with only a 1.07 percentage point accuracy loss.
The DISCO method leverages Predictive Diversity Score and Jensen-Shannon Divergence to simplify sample selection compared to traditional, clustering-based approaches.
The approach enables frequent and sustainable model evaluations across language and vision benchmarks, promoting faster innovation cycles in AI development.

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

The paper "DISCO: Diversifying Sample Condensation for Efficient Model Evaluation" presents a novel approach to efficiently evaluate machine learning models by condensing evaluation datasets through sample selection that maximizes model disagreement. The proposed method, DISCO, addresses the escalating cost of model evaluation by significantly reducing the number of samples used while maintaining accurate performance predictions.

Problem and Motivation

The increasing size and complexity of machine learning models have led to prohibitive evaluation costs, often involving thousands of GPU hours per model. Traditional evaluation approaches, which rely on selecting an anchor subset of data and mapping accuracy on this subset to the final test result, are complex and sensitive to design choices. The paper argues that promoting diversity among samples is redundant; instead, the focus should be on selecting samples that maximize diversity in model responses.

Figure 1: Problem overview: Selecting a smaller evaluation dataset while maintaining close estimated performances.

DISCO: Method Overview

DISCO comprises two main components: dataset selection and performance prediction. The dataset selection process identifies a reduced subset of the evaluation data that is most informative for performance prediction. This is accomplished by selecting samples that yield the greatest disagreement among different models' predictions, a concept supported by information-theoretical insights.

Figure 2: DISCO overview: First, select informative samples, then predict unseen models' performance.

Dataset Selection

The selection process leverages Predictive Diversity Score (PDS) and Jensen-Shannon Divergence (JSD) to measure model disagreement. Samples inducing the highest model output diversity are preferred, simplifying the sampling procedure compared to clustering-based methods.

Performance Prediction

Performance prediction in DISCO bypasses traditional scalar summaries like accuracy on anchor sets. Instead, it employs model signatures—concatenated model outputs on selected samples—to predict performance using algorithms like kNN or Random Forest. This approach reduces complexity and provides robust performance estimates.

Experiments and Results

The experimental evaluation demonstrates the method's efficacy across language and vision domains. On the MMLU language benchmark, DISCO reduced evaluation costs by 99.3% with only a 1.07 percentage point error in accuracy, outperforming prior methods. Similar results were observed in the vision domain with ImageNet, achieving substantial cost reductions with minimal accuracy loss.

Figure 3: MMLU performance estimation vs. compression rates: Shows correlation between true model ranking and estimated model ranking with MAE differences.

Applications and Implications

The DISCO framework offers significant implications for efficient model evaluation and scalability. By drastically reducing evaluation costs without compromising accuracy, it enables more frequent and inclusive model evaluations, fostering faster innovation cycles and reducing environmental impacts. The method is particularly beneficial in low-resource settings, supporting end-user model checks and frequent training performance assessments.

Conclusion

DISCO presents a practical, scalable solution for model evaluation by optimizing sample selection based on model output diversity. Its application can lead to more sustainable and accessible AI development processes. Future work could focus on enhancing model adaptability to handle distribution shifts and integrating adaptive learning techniques for continuous model updates.

The paper contributes to the methodological advancement of model evaluation in machine learning, with the potential to significantly impact AI research and deployment strategies.

Markdown Report Issue