WIQL: Whittle Index-based Q-learning

Updated 10 January 2026

WIQL is a reinforcement learning algorithm that estimates Whittle indices for restless multi-armed bandits using a two-timescale stochastic approximation approach.
It unifies fast Q-learning updates with slower index adjustments, supporting both tabular and deep function approximation for scalable, near-optimal policy implementation.
Empirical studies show WIQL achieves rapid convergence and efficiency in diverse applications such as wireless scheduling, machine repair, and federated learning.

Whittle Index-based Q-learning (WIQL) is a class of reinforcement learning algorithms for learning Whittle indices in restless multi-armed bandit problems (RMABPs) via model-free, two-timescale stochastic approximation, unifying the computational power of Q-learning with the index policy structure of the Whittle heuristic. WIQL directly estimates Whittle indices online for each arm and state, allowing rapid and scalable implementation of near-optimal index policies in environments with unknown dynamics and rewards. Modern variants include both tabular and deep (function approximation) versions, with convergence guarantees and strong empirical performance in large-scale RMABPs (Relaño et al., 2024).

1. Restless Bandits and the Whittle Index Principle

A restless multi-armed bandit consists of $N$ arms, each evolving as an MDP regardless of being selected, with the agent allowed to activate a subset of $M < N$ arms at each timestep. This PSPACE-hard control problem is rendered tractable by Whittle’s Lagrangian relaxation, decoupling the joint constraint into separate single-arm problems via a per-time-step subsidy $\lambda$ . For a given state $s$ of an arm, the Whittle index $W(s)$ is defined as the critical subsidy at which the arm is equally attractive to play or not play:

$W(s) = \inf\{\lambda : Q^{\lambda}(s,1) = Q^{\lambda}(s,0)\}$

where $Q^{\lambda}(s,a)$ is the state-action value function parameterized by subsidy $\lambda$ . If an arm is indexable (i.e., the set of passive-optimal states grows monotonically in $\lambda$ ), the Whittle index exists and the policy that activates the $M$ arms with the highest $W(s)$ is asymptotically optimal as $N \to \infty$ (Relaño et al., 2024, Xiong et al., 2023).

2. Two-Timescale WIQL Architecture and Algorithm

WIQL exploits the Whittle index indifference condition by maintaining, for each state $s$ , (i) online Q-learning of $Q(s,a)$ and (ii) a slow-timescale update of $\lambda(s)$ to drive $Q(s,1)-Q(s,0)$ to zero.

Generic tabular WIQL (QWI) (Relaño et al., 2024):

Fast (Q-value) timescale: For fixed $\lambda_n(s)$ , update

$Q_{n+1}(s_n,a_n) = Q_n(s_n,a_n) + \alpha_n(s_n,a_n)\left[r_n + \gamma \max_{a'} Q_n(s_{n+1},a') - Q_n(s_n,a_n)\right]$

for observed $(s_n,a_n)$ , with $\alpha_n$ satisfying $\sum\alpha_n = \infty$ , $\sum\alpha_n^2 < \infty$ .

Slow (Whittle-index) timescale: For current $s_n$ , update

$\lambda_{n+1}(s_n) = \lambda_n(s_n) + \beta_n(s_n)\left(Q_{n+1}(s_n,1) - Q_{n+1}(s_n,0) - \lambda_n(s_n)\right)$

with $\beta_n/\alpha_n \to 0$ , $\sum \beta_n = \infty$ , $\sum\beta_n^2<\infty$ .

Action selection: $a_n = \arg\max\{Q_n(s_n,1) - \lambda_n(s_n), Q_n(s_n,0)\}$ .

Empirically, WIQL requires orders of magnitude less storage and reduces the search space from exponential ( $|S|^N$ ) to linear in $N\cdot|S|$ (Avrachenkov et al., 2020).

3. Function Approximation and Deep WIQL

For large or continuous state spaces, WIQL is extended to deep neural function approximation. In QWINN (WIQL with Neural Networks) (Relaño et al., 2024), Q-values $Q(s,a;\theta)$ (parameter vector $\theta$ ) are fitted using standard DQN with minibatch SGD steps, while Whittle indices are represented by a separate “index network” $\lambda(s;\varphi)$ . On the slow timescale, $\varphi$ is updated by minimizing the squared error between neural $Q(s,1;\theta) - Q(s,0;\theta)$ and the index estimate $\lambda(s;\varphi)$ :

$L_w(\varphi) = \mathbb{E}_s\left[(Q(s,1;\theta) - Q(s,0;\theta) - \lambda(s;\varphi))^2\right]$

with $\varphi$ updated less frequently or at lower learning rate to ensure two-timescale separation. All local minima of the Bellman error in QWINN are locally stable equilibria, marking a notable theoretical advance for DQN-based index schemes (Relaño et al., 2024).

Convergence guarantees for deep WIQL have been established in recent work. In Neural-Q-Whittle, a two-layer ReLU network approximates $Q(s,a)$ , and the Whittle index estimates $\lambda_k(s)$ are updated per-iteration as $f(\theta_k;\phi(s,1)) - f(\theta_k;\phi(s,0))$ (where $f$ is the neural $Q$ and $\phi$ is a state-action embedding). Finite-time analysis yields a non-asymptotic error decay of $O(k^{-2/3})$ for both neural parameters and index estimation, accounting for Markovian sampling and network approximation error (Xiong et al., 2023).

4. Convergence Guarantees and Rate Analysis

Convergence proofs for WIQL and its extensions rely on two-timescale stochastic approximation. Under standard assumptions (ergodicity, bounded rewards, Lipschitz transitions, sufficient exploration, and step-sizes with $\beta_n/\alpha_n \to 0$ ), tabular WIQL converges almost surely to the true Whittle indices $\lambda^*(s)$ and optimal $Q^*(s,a;\lambda^*(\cdot))$ (Relaño et al., 2024, Avrachenkov et al., 2020). For neural and linear function approximation, convergence is to neighborhoods (due to approximation error), with rates $O(k^{-2/3})$ established for overparameterized networks (Xiong et al., 2023, Xiong et al., 2022).

In the average-reward setting, relative value iteration Q-learning is used with normalization of the value biases, and convergence is established under the unichain assumption (Avrachenkov et al., 2020).

5. Variants, Acceleration, and Practical Implementation

Advanced Q-learning schemes can be integrated on the fast timescale for improved sample efficiency:

Speedy Q-Learning (SQL): Maintains two past iterates to accelerate contraction (Kakarapalli et al., 2024).
Generalized SQL (GSQL): Adds a relaxation parameter to further improve convergence constants.
Phase Q-Learning (PhaseQL): Averages Bellman targets over $m>1$ simulated transitions. In synchronous settings, achieves empirical $O(1/T)$ convergence (Kakarapalli et al., 2024).

Exploration strategies include $\epsilon$ -greedy, softmax, $\epsilon$ -softmax, and UCB. UCB-based policies typically accelerate convergence and index estimation (Kakarapalli et al., 2024), especially for rarely visited state-action pairs. For efficient implementation, step-sizes $\gamma\ll\alpha$ must be chosen. In large settings, periodic random resets or state-aggregation accelerate coverage (Mittal et al., 2024, Xiong et al., 2022).

Function approximation is crucial in high-dimensional problems: linear architectures (state aggregation or features) yield practical WIQL variants with provable $O(n^{-2/3})$ finite-time bounds (Xiong et al., 2022); deep neural architectures (QWINN, Neural-Q-Whittle) enable extrapolation across states, with empirical runtime dominated by DQN-like updates and $<$ 10% overhead for index-network updates (Relaño et al., 2024, Xiong et al., 2023).

6. Empirical Performance and Applications

WIQL algorithms have been evaluated in various RMABP domains:

Domain/Environment	WIQL Variant	Sample Complexity	Regret (vs. oracle Whittle)
Machine repair, queueing	QWI/QWINN	$10^5$ – $10^6$ steps	$\leq 5\%$ , converges $2\!-\!3\times$ faster than DQN (Relaño et al., 2024)
Wireless sensor scheduling	Tabular WIQL	Within 2–3% of oracle	Up to 70% reduction in transmissions (Jonah et al., 3 Jan 2026)
Federated learning client selection	Tabular WIQL	$O(r^{-1/2})$ index MSE	Within 3–7% of Whittle-optimal, 45% wallclock time reduction (Li et al., 17 Sep 2025)
Edge caching	Linear FA WIQL	$O(n^{-2/3})$ convergence	Within 1–2% of model-based Whittle (Xiong et al., 2022)

Numerical studies confirm that WIQL and its deep variants outperform naive Q-learning by leveraging the structure of Whittle-index policies: they converge faster, require less memory, and maintain near-optimal rewards even in large-scale, heterogeneous, or resource-constrained regimes (Relaño et al., 2024, Xiong et al., 2023, Li et al., 17 Sep 2025, Jonah et al., 3 Jan 2026).

7. Limitations and Extensions

WIQL's theoretical guarantees crucially depend on arm-wise indexability. Extensions to non-indexable arms or arms with continuous state/action spaces require alternative index learning or actor-critic hybridizations (Relaño et al., 2024). For deep WIQL, convergence may be to local minima; good exploration coverage and buffer diversity are important. For rare or infrequently visited states, index estimation may be slow unless forced by explicit state resets or targeted exploration (Kakarapalli et al., 2024, Mittal et al., 2024). Empirical studies suggest that meta-learning across arm families, functional parameter sharing, and variance-reduction actor-critic approaches are promising for scaling further (Relaño et al., 2024).

Major extensions include:

Incorporation of variance-reducing actor-critic components.
Online adaptation to non-stationary or continuous dynamics.
Large-scale parallelization across arms with shared or conditional index networks.
Meta-learning warm-starts for families of similar arms.

WIQL thus provides a unified, scalable, and theoretically grounded framework for model-free learning of near-optimal index policies in RMABPs with broad applicability to resource allocation, wireless scheduling, smart sensing, and federated learning (Relaño et al., 2024, Xiong et al., 2023, Li et al., 17 Sep 2025, Jonah et al., 3 Jan 2026).