Nonparametric Bandits with Single-Index Rewards: Optimality and Adaptivity

Published 31 Dec 2025 in math.ST | (2512.24669v1)

Abstract: Contextual bandits are a central framework for sequential decision-making, with applications ranging from recommendation systems to clinical trials. While nonparametric methods can flexibly model complex reward structures, they suffer from the curse of dimensionality. We address this challenge using a single-index model, which projects high-dimensional covariates onto a one-dimensional subspace while preserving nonparametric flexibility. We first develop a nonasymptotic theory for offline single-index regression for each arm, combining maximum rank correlation for index estimation with local polynomial regression. Building on this foundation, we propose a single-index bandit algorithm and establish its convergence rate. We further derive a matching lower bound, showing that the algorithm achieves minimax-optimal regret independent of the ambient dimension $d$, thereby overcoming the curse of dimensionality. We also establish an impossibility result for adaptation: without additional assumptions, no policy can adapt to unknown smoothness levels. Under a standard self-similarity condition, however, we construct a policy that remains minimax-optimal while automatically adapting to the unknown smoothness. Finally, as the dimension $d$ increases, our algorithm continues to achieve minimax-optimal regret, revealing a phase transition that characterizes the fundamental limits of single-index bandit learning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a single-index bandit model that reduces high-dimensional contextual problems to one-dimensional nonparametric learning, leading to minimax regret optimality.
It combines Maximum Rank Correlation estimation with local polynomial regression to accurately estimate the index vectors and link functions under mild regularity conditions.
The work reveals intrinsic adaptivity challenges, proving impossibility without self-similarity while proposing undersmoothing methods that preserve optimal regret rates and practical efficiency.

Nonparametric Bandits with Single-Index Rewards: Optimality and Adaptivity

Introduction and Problem Formulation

The paper "Nonparametric Bandits with Single-Index Rewards: Optimality and Adaptivity" (2512.24669) addresses a central challenge in contextual bandit problems: striking a balance between model expressivity and statistical efficiency in high-dimensional settings. While contextual bandits are essential for sequential decision-making with side information, the sample complexity of nonparametric approaches scales poorly with the context dimension $d$ . To mitigate this, the authors propose the single-index bandit model, in which the potentially high-dimensional covariates are projected onto a univariate subspace through an unknown linear index per arm, followed by an unknown nonparametric link function. This structure preserves modeling flexibility while circumventing the curse of dimensionality, provided via sharp theoretical results.

The paper formalizes the $K$ -armed contextual bandit problem under a single-index reward model: for each arm $k$ , the expected reward is $g_k(X) = f_k(v_k^\top X)$ , where $f_k$ is an unknown H\"older link function and $v_k\in\mathbb{R}^d$ is an unknown index vector. The objective is to construct a policy that minimizes cumulative regret, measured relative to the optimal arm selection, while only observing the rewards for the chosen arms at each round.

Single-Index Regression: Estimation Theory

A core technical contribution is the development of a nonasymptotic theory for single-index regression suited for bandit contexts. The key estimator combines Han's Maximum Rank Correlation (MRC) procedure for index estimation with local polynomial regression (LPE) conditioned on the estimated index. The MRC directly estimates the latent $v_k$ for each arm by maximizing the empirical rank correlation of $(Y^{(k)}, v_k^\top X)$ over a sample split.

The authors establish, under mild regularity conditions on the distribution of $X$ and the noise, that the MRC estimator achieves $\ell_2$ error of order $\widetilde{O}(\sqrt{d/n})$ for $v_k$ with high probability. Subsequent LPE along the estimated index achieves a uniform sup-norm error rate of

$\widetilde{O}\left( \left(\frac{\log n}{n}\right)^{\frac{\beta}{2\beta+1}} \right)$

for estimating the link, matching the minimax-optimal rate of one-dimensional nonparametric regression. The uniform bound is independent of $d$ provided $d \ll n^{\frac{2\beta-1}{2\beta+1}}$ , demonstrating sharp dimension reduction. Notably, the argument handles covariate distribution shifts introduced by index estimation error, a scenario not covered by classical regression theory.

Online Single-Index Bandit Algorithm

Building on the regression estimator, the paper proposes a batched elimination bandit algorithm that adaptively alternates between exploration and exploitation epochs. In each stage, data are used to fit arm-specific single-index regressors. The key innovation over existing methods is the use of a pre-selection step, filtering plausible arms via confidence bounds derived from the single-index regression, thereby avoiding explicit bin discretization in $d$ dimensions. Post pre-selection, arms are further eliminated based on refined estimates.

Crucially, the batch scheduler and confidence intervals are designed to ensure that, at each epoch, the estimator's accuracy is sufficient in the relevant decision regions. The approach leverages regularity and margin assumptions to ensure that the region over which arms are plausible is well-behaved for local regression, facilitating error control.

Minimax Regret Analysis

The authors provide a comprehensive regret analysis matching upper and lower bounds:

Upper Bound: The proposed algorithm attains cumulative expected regret

$\widetilde{O}\left( n^{1-\frac{(\alpha+1)\beta}{2\beta+1}} \right)$

where $\alpha$ is a margin parameter describing the measure of near-tie points between arms. This rate is minimax-optimal up to logarithmic factors and is independent of $d$ (within the fixed-dimension regime), signifying complete circumvention of the curse of dimensionality.

Lower Bound: They establish a matching lower bound for the single-index model under margin and regularity conditions, confirming that the one-dimensional nonparametric regret rate is tight for this class.
Phase Transition: When $d$ increases with $n$ , a two-phase minimax rate emerges. In the "nonparametric" phase (moderate $d$ ), regret matches the one-dimensional nonparametric rate. In the "parametric" phase (large $d$ ), it transitions to the linear bandit rate $n(\frac{d}{n})^{\frac{1+\alpha}{2}}$ . The boundary between phases is at $d \asymp n^{1/(2\beta+1)}$ , reflecting the relative hardness of index vs. link learning.

Adaptivity and Smoothness Estimation

A major theoretical finding is an impossibility result: in the absence of further regularity, no bandit policy can be uniformly minimax-adaptive to unknown smoothness $\beta$ . This is in contrast to classical regression, where adaptation is occasionally feasible. The authors prove lower bounds extending recent impossibility results in nonparametric bandits to the single-index context.

However, under a self-similarity condition—ensuring that the function bias cannot decay too rapidly with bandwidth—rate-optimal adaptation becomes feasible. The paper develops an undersmoothed estimator of $\beta$ using Lepski's method on the projected data, and demonstrates that plugging this into the main bandit algorithm preserves minimax regret rate (up to additional logarithmic factors).

Practical Algorithms and Numerical Performance

The authors provide detailed algorithmic constructions amenable to practice, despite some non-convexity in MRC optimization. Empirical evaluation on synthetic examples demonstrates:

Systematic improvement in index estimation as data accumulates.
Significant regret reduction over standard $d$ -dimensional nonparametric bandit algorithms, substantiating the benefit of effective dimension reduction.
Nearly minimax performance under smoothness adaptation.

Theoretical Implications and Connections

The results bridge bandit theory, single-index regression, and high-dimensional statistics:

Sufficient Dimension Reduction: The paper demonstrates, both statistically and algorithmically, the possibility of reducing high-dimensional contextual bandits to effectively one-dimensional nonparametric learning under a single-index reward model.
Phase Transitions: It rigorously characterizes where model complexity transitions from being dominated by link estimation to being dominated by index estimation, paralleling developments in minimax classification and regression with unknown subspaces.
Adaptivity and Impossibility: The work clarifies the interplay between incomplete feedback and structural assumptions, proving impossibility of adaptation in the absence of self-similarity and practical adaptivity under it.

Future Directions

Potential future research avenues include:

Extensions to Multiple-Index Models: Investigating whether analogous efficient dimension reduction and regret rates can be achieved for models with multiple indices.
Computational-Statistical Tradeoffs: Addressing computational bottlenecks in non-convex optimization, possibly via relaxations or stochastic optimization.
Generic Nonlinear Structures: Generalizing the analysis to other forms of low-dimensional nonlinear reward structures beyond single-index.
Instance-Dependent Regret: Further sharpening of bounds exploiting instance-dependent characteristics such as arm gap distributions and structural sparsity.

Conclusion

This work fundamentally advances the theory and methodology of nonparametric contextual bandits in high-dimensional settings by demonstrating that single-index structures can yield dimension-independent minimax regret. The paper contributes novel estimation theory for single-index regression under bandit sampling, provides tight lower bounds, reveals intricate adaptivity phenomena, and develops practical algorithms with strong empirical performance. Its insights are likely to impact both the statistical learning and online decision-making communities, particularly in applications where high-dimensional covariates exhibit intrinsic low-dimensional structures.

Markdown