Overtuning in Hyperparameter Optimization

Published 24 Jun 2025 in cs.LG and stat.ML | (2506.19540v1)

Abstract: Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data. As the expected generalization error cannot be optimized directly, it is estimated with a resampling strategy, such as holdout or cross-validation. This approach implicitly assumes that minimizing the validation error leads to improved generalization. However, since validation error estimates are inherently stochastic and depend on the resampling strategy, a natural question arises: Can excessive optimization of the validation error lead to overfitting at the HPO level, akin to overfitting in model training based on empirical risk minimization? In this paper, we investigate this phenomenon, which we term overtuning, a form of overfitting specific to HPO. Despite its practical relevance, overtuning has received limited attention in the HPO and AutoML literature. We provide a formal definition of overtuning and distinguish it from related concepts such as meta-overfitting. We then conduct a large-scale reanalysis of HPO benchmark data to assess the prevalence and severity of overtuning. Our results show that overtuning is more common than previously assumed, typically mild but occasionally severe. In approximately 10% of cases, overtuning leads to the selection of a seemingly optimal HPC with worse generalization error than the default or first configuration tried. We further analyze how factors such as performance metric, resampling strategy, dataset size, learning algorithm, and HPO method affect overtuning and discuss mitigation strategies. Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.

Abstract PDF Upgrade to Chat

Summary

The paper formally defines overtuning as the excess test error of the current incumbent relative to previous configurations during hyperparameter optimization.
It empirically shows that overtuning occurs in about 10% of runs, particularly in small-data settings and with flexible models.
The study evaluates mitigation strategies such as repeated cross-validation and noise-aware Bayesian optimization to reduce overtuning in practical applications.

Overtuning in Hyperparameter Optimization: Formalization, Empirical Prevalence, and Mitigation

The paper "Overtuning in Hyperparameter Optimization" (2506.19540) provides a rigorous treatment of overtuning, a phenomenon in which hyperparameter optimization (HPO) overfits to stochastic validation error estimates, resulting in the selection of hyperparameter configurations (HPCs) that generalize worse than earlier or default configurations. The work offers a formal definition of overtuning, distinguishes it from related concepts such as meta-overfitting and test regret, and presents a comprehensive empirical analysis of its prevalence and determinants across a range of HPO benchmarks. The authors further discuss mitigation strategies and their practical implications for AutoML and HPO workflows.

Formalization of Overtuning

The paper introduces overtuning as a distinct form of overfitting at the HPO level. Unlike meta-overfitting, which quantifies the gap between validation and test error for the selected HPC, overtuning specifically measures the extent to which the final incumbent's test error exceeds that of any previously observed incumbent in the HPO trajectory. This is formalized as:

Overtuning: The difference between the test error of the current incumbent and the minimum test error among all previous incumbents.
Relative Overtuning: The overtuning effect normalized by the maximum test error improvement achieved during the HPO run, enabling comparison across datasets and metrics.

This distinction is critical: overtuning can occur even when meta-overfitting is present, but not all meta-overfitting leads to overtuning. The paper also clarifies the relationship between overtuning, trajectory test regret, and oracle test regret, providing a nuanced taxonomy for analyzing HPO inefficiencies.

Empirical Analysis

A large-scale reanalysis of public HPO benchmark data (including FCNet, LCBench, WDTB, TabZilla, TabRepo, and others) quantifies the prevalence and severity of overtuning. Key findings include:

In approximately 10% of HPO runs, overtuning is severe: the final selected HPC generalizes worse than the default or first configuration.
In 60% of runs, no overtuning is observed; 70% of runs exhibit relative overtuning below 0.1.
Overtuning is more pronounced in small-data regimes, with simple holdout validation, and when using flexible models (e.g., neural networks, CatBoost) and certain metrics (accuracy, ROC AUC).
More robust resampling strategies (e.g., repeated cross-validation) and larger datasets substantially reduce both the frequency and magnitude of overtuning.

The authors employ generalized linear mixed-effects models to identify determinants of overtuning. The analysis reveals that longer HPO runs increase overtuning risk, but the effect plateaus. Bayesian optimization (BO) methods slightly increase the probability of overtuning but reduce its magnitude compared to random search (RS). Early stopping and reshuffling resampling splits can further mitigate overtuning, especially in the small-data/holdout regime.

Mitigation Strategies

The paper systematically reviews and empirically evaluates mitigation strategies, which fall into three categories:

Objective Function Modification: Using more robust resampling (e.g., repeated CV), regularization, or noise-aware surrogate modeling in BO.
Incumbent Selection: Early stopping, conservative selection criteria (e.g., LOOCVCV), or selection based on surrogate posterior means.
Optimizer Modification: Employing BO over RS, adaptive resampling, or racing techniques to allocate evaluation budgets more efficiently.

Empirical results indicate that repeated CV is the most effective and practical mitigation, albeit at increased computational cost. BO with noise-aware selection and reshuffling resampling splits also show promise, particularly for flexible models and noisy metrics. The use of a dedicated selection set or overly conservative selection criteria can degrade generalization by reducing the effective training set size or missing genuinely better configurations.

Implications and Future Directions

The formalization and empirical quantification of overtuning have several implications for both research and practice:

Benchmarking and Reporting: Overtuning should be routinely reported in HPO and AutoML studies, especially in small-data settings or when using noisy validation metrics.
AutoML System Design: Automated systems should default to robust resampling strategies and incorporate overtuning-aware incumbent selection, particularly for tabular and small-scale tasks.
Theoretical Analysis: The taxonomy and metrics introduced enable more precise theoretical analysis of HPO inefficiencies and the development of regret bounds for overtuning.
Algorithm Development: There is scope for new HPO algorithms that explicitly control overtuning, e.g., by adaptively adjusting the aggressiveness of search or by integrating uncertainty quantification in incumbent selection.

The findings challenge the assumption that aggressive minimization of validation error in HPO always leads to improved generalization. The results underscore the need for overtuning-aware HPO protocols, especially as AutoML systems are increasingly deployed in domains with limited data and high-stakes decision-making.

Speculation on Future Developments

Future research may focus on:

Adaptive HPO Protocols: Methods that dynamically adjust resampling, stopping, and selection strategies based on overtuning risk estimates.
Meta-Learning for Overtuning Mitigation: Leveraging meta-features to predict overtuning risk and select appropriate mitigation strategies per task.
Integration with Foundation Models: As foundation models are increasingly used for tabular and small-data tasks, overtuning-aware fine-tuning and adaptation protocols will become essential.
Theoretical Guarantees: Development of tighter theoretical bounds on overtuning and its interaction with model complexity, data size, and search space structure.

In summary, the paper provides a comprehensive and rigorous treatment of overtuning in HPO, offering both formal tools and practical guidance for mitigating its impact in real-world machine learning workflows.

Markdown Report Issue