Q-Learning Slice Admission Control in 5G

Updated 16 January 2026

The paper introduces a Q-learning framework that formulates slice admission as a Markov decision process to optimize resource allocation.
It employs both tabular and deep Q-learning techniques to adaptively manage multi-dimensional resource constraints in dynamic network conditions.
Experimental analyses demonstrate that this method outperforms baseline strategies, achieving significantly higher utility and robust quality of service.

Q-learning-based slice admission control is an approach for optimally admitting network slices in 5G and beyond (B5G) wireless systems under dynamic resource constraints. Slice admission refers to the selective acceptance or rejection of service requests for different virtual network slices, subject to multidimensional resource constraints (e.g., spectrum, CPU, power) and service-level objectives such as throughput, latency, and priority. By framing admission control as a Markov decision process (MDP) and solving it with Q-learning—either in tabular or deep (function-approximate) variant—controllers learn policies that maximize long-term system utility or profit, adaptively allocating resources to stochastic demand while accounting for future resource release and uncertainties such as fading, spectrum sharing, and traffic bursts. Empirical studies consistently demonstrate that Q-learning outperforms myopic, greedy, or first-come-first-served baseline policies across diverse network scenarios (Shi et al., 2020, Bakhshi et al., 2021, Jacoby et al., 9 Jan 2026, Tao et al., 2023).

1. Fundamental Model Formulations

Q-learning-based slice admission is formalized by specifying the system's state, the possible actions at each decision epoch, the reward structure, and the objective function:

State Space ( $s_t$ ): The controller's observation encodes available resources and active or pending slice requests. Typical state vectors include:
- Free resource blocks $F(t)$ (e.g., frequency-time slices),
- Residual computational capacity $P^{c}(t)$ ,
- Transmit power budget $P^{T}(t)$ ,
- Active requests $A(t) = \{(w_{ij}, d_{ij}, p^{c}_{ij}, \tau_{ij}, \ell_{ij})\}$ , giving slice priority, minimum data rate, CPU demand, deadline, and remaining lifetime (Shi et al., 2020).
Action Space ( $a_t$ ): For each pending request, admission ( $x_{ij}(t) \in \{0,1\}$ ) and resource allocation are decided, subject to instantaneous constraints:

$\sum_{(i,j) \in A(t)} F_{ij} x_{ij}(t) \leq F(t), \quad \sum_{(i,j) \in A(t)} P^{c}_{ij} x_{ij}(t) \leq P^{c}(t), \quad \sum_{(i,j) \in A(t)} P^{T}_{ij} x_{ij}(t) \leq P^{T}(t)$

Reward Function ( $r_t$ ): Usually the weighted sum utility of admitted slices per slot,

$r_t = \sum_{(i,j) \in A(t)} w_{ij} x_{ij}(t)$

Profit-based formulations may subtract federation cost or penalties for missed deadlines as in multi-domain settings.

Objective: Maximization of cumulative discounted return:

$\max_{\pi} \mathbb{E} \left[ \sum_{t=0}^{T-1} \gamma^t r_t \right]$

or, for average profit, an undiscounted sum over episodes (Bakhshi et al., 2021).

This framework naturally extends to more complex architectures, such as semi-Markov models for queuing and digital-twin accelerated deep RL approaches (Tao et al., 2023). Slice requests, arrivals, and departures are typically modeled as Poisson processes; capacity constraints and priority weights are parameterized for each service class.

2. Q-Learning Algorithms and Extensions

Classical tabular Q-learning stores $Q(s,a)$ values for every state-action pair and iteratively applies the Bellman update:

$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [ r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t,a_t) ]$

Key algorithmic elements across representative studies include:

Exploration: $\epsilon$ -greedy with initial $\epsilon \approx 1$ , decaying to $\epsilon_{min}$ (e.g., $0.1$), promotes broad search then exploitation.
Resource Discretization: To control table size, resource spaces are quantized into manageable levels (e.g., $11$ frequency bands, $50$ CPU levels).
Episodes and Convergence: Per-episode step counts $N_{episodes} \sim 100$ –200 are standard; convergence is assessed via policy performance or Q-table stability.
Function Approximation: For large state/action grids, deep Q-learning (DQN) architectures, such as dueling DQN, are deployed. State value $V(s;\theta)$ and action advantage $A(s,a;\theta)$ branches efficiently separate state and action assessment (Tao et al., 2023).
Digital Twin Acceleration (DT): A supervised neural net policy $\pi_{DT}(a|s;\theta_{DT})$ is pre-trained to imitate baseline strategies, initializing DQN exploration around safe defaults and improving early training stability (Tao et al., 2023).

Proactive controllers, such as AWaRe-SAC (Jacoby et al., 9 Jan 2026), fuse Q-learning with auxiliary prediction modules to anticipate link capacity variations due to fading or interference. Predictive features and related penalty terms enter both state and Q-reward computation, better aligning admissions with future resource constraints.

3. Integration of Uncertainty and Prediction

Q-learning-based admission control is particularly adept in environments where resource capacity is time-varying or unpredictable. Advanced frameworks incorporate:

Predictive State Augmentation: State vectors extended with capacity forecasts (e.g., $c_f$ , expected “safe” slots) based on sequence-to-sequence RNN predictors (AttIRNN) processing received signal level (RSL) histories (Jacoby et al., 9 Jan 2026).
Penalty Mechanisms: Immediate reward is penalized if forecasted capacity risks exceed demand within $H$ -step horizons, deterring risky admissions.
Scenario Distributions: Future capacities are mapped into discrete modulation/coding levels with transition probabilities, enabling forward-looking admission strategies.
Online Adaptation: Arrival and departure event-driven admission supports quick response to spectrum blocking, rain-induced outages, or bursty incumbent interference.

Table: Revenue and QoS outcomes under capacity uncertainty (Jacoby et al., 9 Jan 2026)

Algorithm	Long-term Revenue	QoS-Violation Rate
Predictive Q-Learning	$2-3\times$ baseline	$<5\%$
Reactive Q-Learning	1× baseline	$20$– $30\%$
Greedy (static)	1× baseline	$20$– $30\%$

This demonstrates that predictive Q-learning closes $\sim 80\%$ of the performance gap to offline oracles with foresight and delivers superior QoS even under adverse weather or spectrum events.

4. Practical Implementation and Performance Analysis

Q-learning-based admission controllers have been validated on real and synthetic network slices under high-load, multi-domain, spectrum sharing, and capacity uncertainty scenarios.

Resource Allocation Efficiency: Q-learning consistently achieves $24$– $36\%$ higher utility relative to myopic, first-come-first-served, and random policies ((Shi et al., 2020), Table 1).
Scalability: Controllers maintain sublinear utility growth with increase in user equipment (UE) count, sustaining near-optimal performance under high demand (Shi et al., 2020).
Convergence and Robustness: Tabular and deep Q-learning approach optimality within $3$– $5\%$ gap but require careful tuning of discount factors and exploration rates. R-learning (average-reward RL) outperforms Q-learning in terms of optimality gap robustness (Bakhshi et al., 2021).
Digital Twin Enhancement: DT-assisted DQN converges to $40\%$ higher resource utilization compared to direct DQN, with halved convergence time (Tao et al., 2023).
Computational Overhead: Implementation remains real-time feasible with resource quantization and efficient NN architectures; per-decision updates are $O(|A|)$ , Q-table entries remain $<10^6$ for reasonable quantization.

Table: Comparative utility for three UEs ($1000$ slots) (Shi et al., 2020)

Algorithm	Utility	Utility Gain vs Baseline
Q-Learning	1807	—
Myopic	1456	+24.1%
FCFS	1416	+27.6%
Random	1334	+35.5%

5. Strengths, Limitations, and Tuning Guidelines

Principal strengths:

Model-free RL enables learning in presence of unknown arrival/departure rates and undisclosed channel statistics (Bakhshi et al., 2021).
Explicit future planning via discount factor $\gamma$ avoids myopic behavior and leverages scheduled resource release.

Limitations:

Convergence speeds and optimality gaps depend critically on choice of $\gamma$ ; too small values risk greedy bias, too large slow down learning.
Table-based Q-learning scales poorly with system dimension, requiring function approximation (deep RL) for high-dimensional multi-slice networks.

Tuning guidelines:

Initial $\alpha$ , $\epsilon$ high (e.g., $0.9$) with exponential decay to stabilize learning and guarantee near-optimal exploration.
$\gamma$ in $[0.7, 0.95]$ balances immediate vs future reward, to be validated via dynamic programming benchmarks.
For very large or continuous state spaces, deep Q-learning (dueling DQN, prioritized replay) or digital twin pretraining is recommended.

Potential improvements include hybrid schemes combining table lookup for low-traffic states and deep RL for dense resource slices, hierarchical RL for decomposing admission and routing decisions, and predictive RL using external forecasts of capacity and demand.

6. Applications and Future Research Directions

Q-learning-based slice admission is deployed across several 5G/B5G domains:

RAN Slicing: Maximizing utility and meeting priority requirements amid dynamic user arrivals (Shi et al., 2020).
Multi-Domain Federation: Joint local/federated resource allocation for service orchestrators, optimizing cross-domain profit (Bakhshi et al., 2021).
mmWave backhaul and transport: Admission control under link attenuation and forecast-driven QoS, leveraging RNN-based predictors (Jacoby et al., 9 Jan 2026).
Online Sliced Networking: Resource utilization acceleration via digital twin-enhanced DRL, supporting rapid convergence and robust deployment (Tao et al., 2023).

Open challenges span scalable RL architectures for massive slice heterogeneity, integration of continual learning under time-varying traffic and resource landscapes, decentralized multi-agent RL for federated slice control, and the formal analysis of policy optimality under operational constraints and uncertainties.

This body of work collectively establishes Q-learning as a premier methodology for slice admission control, delivering scalable, long-horizon-oriented, and robust admission policies adaptable to diverse wireless and networking environments.