Nonparametric Clustering with Bandit Feedback

Updated 19 January 2026

The paper introduces a nonparametric clustering framework that uses bandit-style queries and kernel mean embeddings to partition arms without parametric assumptions.
It details algorithms like KABC and TaS-FW, which achieve instance-optimal sample complexity and adaptive performance under minimal noise assumptions.
The study extends these methods to practical applications such as adaptive crowdsourcing and dynamic recommendation systems with strong theoretical guarantees.

Nonparametric clustering with bandit feedback is concerned with partitioning a set of “arms” or items into groups without parametric restrictions on their underlying distributions, relying solely on sequential, bandit-style queries. This setting can handle arbitrary noise distributions and is critical for applications such as adaptive crowdsourcing, customer segmentation, and dynamic recommendation systems. Recent research advances demonstrate that both kernel-based methods (e.g. RKHS approaches) and distributional divergence frameworks yield instance-optimal algorithms that can adaptively identify underlying clusters, with theoretical guarantees under minimal assumptions (Thuot et al., 12 Jan 2026, Yavas et al., 2024, Thuot et al., 2024).

1. Formal Problem Description

Let there be $N$ arms, each arm $i$ associated with an unknown data-generating distribution $\nu_i$ on a space $\mathcal{X}$ . The learner sequentially selects arms and observes samples $X_t \sim \nu_{A_t}$ . The true partition $\mathcal{C}^*$ of arms into $K$ clusters is such that $\nu_i = \nu_j$ iff $i,j$ share a cluster. The learner’s objective is to identify $\mathcal{C}^*$ with probability at least $1-\delta$ (the $\delta$ -PAC criterion), while minimizing the expected total number of samples $\tau$ .

Nonparametric scope refers to algorithms operating without any a priori parametric assumption on $\nu_i$ (such as Gaussian or sub-Gaussian families), requiring tools capable of capturing distributional equality under arbitrary settings.

2. Kernel Mean Embedding and MMD in Nonparametric Bandit Clustering

A principal mechanism for distributional comparison in the nonparametric setting is the kernel mean embedding (KME):

Select a continuous positive-definite kernel $k:\mathcal{X} \times \mathcal{X} \to \mathbb{R}$ (e.g. Gaussian RBF).
For $\nu_i$ , define its KME as $\mu_i = \mathbb{E}_{X\sim \nu_i}[ \phi(X)] \in \mathcal{H}$ , where $\phi(x) = k(x, \cdot)$ and $\mathcal{H}$ is the RKHS.
The maximum mean discrepancy (MMD) between $\nu_i$ and $\nu_j$ is $\mathrm{MMD}(\nu_i, \nu_j) = \| \mu_i - \mu_j\|_{\mathcal{H}}$ .

Assuming $k$ is characteristic, clustering $\nu_i$ by equality is equivalent to clustering $\mu_i$ in $\mathcal{H}$ (Thuot et al., 12 Jan 2026). This allows the learner to sequentially estimate empirical KMEs via samples and cluster using MMD-based connectivity.

The sample complexity is governed by a signal-to-noise ratio: $s_*^2 = \min_{i<j,\; \mu_i \neq \mu_j} \left\{ \frac{\Delta_{i,j}^2}{\mathcal{V}_i^* \vee \mathcal{V}_j^*},\quad \frac{2\Delta_{i,j}}{\sqrt{\bar{g}}} \right\}$ where $\Delta_{i,j} = \|\mu_i - \mu_j\|_{\mathcal{H}}$ and $\mathcal{V}_i^* = \mathbb{E}_{X\sim \nu_i}[ \| \phi(X) - \mu_i \|^2_{\mathcal{H}} ]$ .

3. General Divergence-Based Frameworks: KL and Adaptive Allocation

Another approach models clustering as a sequential composite hypothesis test over partitions. Each hypothesis $\sigma$ specifies clusters; constraints are expressed as equality of the empirical distributions over arms within a block. For finite-alphabet $\mathcal{X}$ , define for any partition $\sigma$ and allocation $w \in \Sigma_K$ : $g_P^\sigma(w) = \sum_{m=1}^M \min_{Q \in \mathcal{P}(\mathcal{X})} \sum_{i\in A_m^\sigma} w_i D(P_i \| Q)$ with $D(\cdot \| \cdot)$ the KL divergence. The sample complexity lower bound and optimal allocation depend on the hardest alternative clustering’s KL distinguishability (“hardness”): $T^* = \max_{w \in \Sigma_K} \min_{\sigma' \neq \sigma_{\text{true}}} \sum_m G(P_{A_m^{\sigma'}}, w_{A_m^{\sigma'}})$ (Yavas et al., 2024).

4. Algorithms: KABC and TaS-FW

Kernel Active Bandit Clustering (KABC)

KABC (Thuot et al., 12 Jan 2026) iteratively increases per-arm budgets, estimating empirical KMEs $(\hat{\mu}_i)$ and variances, and clusters arms by MMD thresholding:

At each round $k$ , sample each arm $n_k$ times.
Build a graph connecting arms $i, j$ if empirical $\widehat{\Delta}_{i,j}^2$ falls below a variance-dependent threshold.
Return connected components as clusters when $K$ components emerge.

KABC is $\delta$ -PAC and satisfies

$\tau = O \Big( \frac{N}{s_*^2} \ln \frac{N}{\delta} \Big)$

and is adaptive—it does not require prior knowledge of $s_*^2$ .

Track-and-Stop + Frank–Wolfe (TaS-FW)

TaS-FW (Yavas et al., 2024) maintains empirical allocation and empirical distributions, tracks the best current partition $\sigma$ , and uses Frank–Wolfe iterations to approximate the optimal allocation (maximizing distinguishability). Arms are selected by chasing the optimal allocation vector, with forced exploration ensuring conditioning. TaS-FW stops when evidence statistic $Z(t)$ passes an adaptive threshold, and outputs the current guess. This approach achieves lower-bound optimality up to second-order corrections: $\mathbb{E}[\tau] \leq \frac{\ln(1/\delta)}{T^*} (1 + o(1))$

5. Extensions: Active Clustering, Feature Selection, Context-Dependence

Recent work investigates:

Clustering under feature-selection feedback: the learner selects an item and feature at each round, observes a noisy value, and leverages sequential halving to efficiently identify relevant features and partitions (Graf et al., 14 Mar 2025).
Active clustering with bandit feedback under unknown cluster numbers, higher dimensions, and subGaussian (but not strictly parametric) noise. The ACB algorithm grows a representative set via repeated two-sample tests, then classifies remaining arms by estimated center proximity (Thuot et al., 2024).
Context-dependent clustering for recommendation tasks, wherein clusters arise as data-driven neighborhoods depending on context vectors, and updates are performed via recursive least squares on user feedback, yielding regret bounds scaling with the expected number of clusters rather than users (Gentile et al., 2016).

6. Bayesian Nonparametric Mixture Models and Thompson Sampling

Bayesian nonparametric approaches posit independent Dirichlet-process Gaussian mixtures for each arm’s distribution, allowing for unknown, potentially multimodal reward models (Urteaga et al., 2018).

Posterior inference is achieved via Gibbs sampling over mixture components and cluster assignments per arm.
Thompson sampling is performed by drawing from the posterior predictive mixtures.
Theoretical regret bounds match parametric rates up to log corrections, and empirical evaluation on both simulated and real data (including heavy-tailed, exponential, and clinical arms) demonstrates performance gains in both mean and variance of cumulative regret.

7. Analysis of Sample Complexity and Regret Rates

Across methods, sample complexity and regret depend on:

Distributional separation between true clusters (measured in MMD, KL, or mean gap).
Noise or within-cluster variance.
Cluster balance, number of arms, and desired error $\delta$ .

Kernel-based methods attain $O(N / s_*^2 \ln(N/\delta))$ budget (Thuot et al., 12 Jan 2026). KL-divergence and cluster-minimax methods yield instance-adaptive allocation, with lower bounds matching leading terms (Yavas et al., 2024). Bayesian nonparametric mixture TS achieves $O(|A| \sqrt{T} (\log T)^\kappa)$ regret (Urteaga et al., 2018). Algorithms leveraging feature selection or active representative identification strictly outperform uniform sampling in high-dimensional regimes (Thuot et al., 2024, Graf et al., 14 Mar 2025). Context-dependent clustering replaces worst-case $O(\sqrt{nT})$ regret with $O(\sqrt{[m(X)]T})$ scaling when context-induced clusters are few (Gentile et al., 2016).

References Table

Approach	Key Principle	Representative Paper
Kernel mean embedding (MMD)	RKHS metric, nonparametric	(Thuot et al., 12 Jan 2026)
KL-based allocation (TaS-FW)	Sequential partition test	(Yavas et al., 2024)
Active feature selection	Sequential halving	(Graf et al., 14 Mar 2025)
Bayesian nonparametric mix	Dirichlet process mixture	(Urteaga et al., 2018)
Context-dependent clustering	Neighborhood estimation	(Gentile et al., 2016)
Active representatives (ACB)	Two-sample cluster tests	(Thuot et al., 2024)

A plausible implication is that nonparametric bandit clustering is no longer constrained by parametric distribution modeling, and instance-adaptive algorithms now provably match (up to logarithmic factors) the fundamental limits set by divergence and kernel-based inequalities. Extending these results to cluster overlap, unknown numbers of clusters, and heavy-tailed arms remains an active research direction.