Hierarchical RL in O-RAN Networks

Updated 3 February 2026

The paper demonstrates how hierarchical RL decomposes complex network optimization into nested MDPs across non-RT and near-RT controllers to enhance scalability.
It employs meta-learning and federated DRL to accelerate adaptation, reduce control overhead, and improve real-time network performance.
LLM-augmented guidance is integrated to refine scheduling and resource allocation, achieving measurable gains in throughput and latency.

Hierarchical reinforcement learning (HRL) has emerged as a principal methodology for scalable and adaptive control within the Open Radio Access Network (O-RAN) framework. O-RAN disaggregates radio access network functions and introduces multi-layer intelligent control through RIC modules at distinct time scales and abstraction levels. HRL leverages this architectural hierarchy by decomposing complex network optimization (e.g., slicing, traffic steering, resource allocation) into nested Markov Decision Processes (MDPs), each solved by specialized RL agents aligned with the O-RAN split between non-real-time (non-RT, >1s) and near-real-time (near-RT, 10ms–1s) controllers. Recent research extends traditional HRL by integrating meta-learning, federated DRL, and auxiliary LLM guidance to significantly improve real-time adaptability, robustness, and network scalability in O-RAN environments.

1. O-RAN and the Motivation for Hierarchical RL

O-RAN's open architecture comprises the non-RT RIC for global policy, intent translation, and cross-domain orchestration, and the near-RT RIC for fast, local resource allocation and control. This natural separation of time scales and control domains forms the basis for HRL adoption. The non-RT RIC typically serves as the high-level agent ("manager" or "meta-controller"), setting global slice shares, intent goals, or strategic guidance based on aggregated KPIs and operator intent. The near-RT RIC implements low-level "worker" agents—often as xApps—responsible for scheduling, power allocation, traffic steering, or RAN slicing at DU or cell granularity (Lotfi et al., 8 Dec 2025, Habib et al., 2023, Tsampazi et al., 2023, Rezazadeh et al., 2022, Habib et al., 2024, Bao et al., 25 Apr 2025).

Significant motivations for HRL in O-RAN include:

Decomposition reduces MDP dimensionality and enables scalable learning across hundreds of slices and thousands of UEs (Lotfi et al., 8 Dec 2025, Rezazadeh et al., 2022).
Explicit time-scale separation aligns RL decision periods with O-RAN control loop latencies (Bao et al., 25 Apr 2025).
Hierarchy mitigates policy conflicts and enables distinct objectives and reward signals for global versus local optimization (Tsampazi et al., 2023, Habib et al., 2023).

2. Canonical Hierarchical RL Architectures in O-RAN

O-RAN HRL is typically deployed as a two-level agent structure, tightly mapping to the RIC hierarchy.

2.1 Two-Level Agent Structure

Level	Deployment	Agent Type	Typical Decision	Decision Interval
High (Meta)	Non-RT RIC (rApp/xApp)	DQN, PPO, MAML, LLM	Intent, slicing, guidance	Seconds–minutes
Low (Controller)	Near-RT RIC (xApp)	DDPG, DQN, PPO	Scheduling, allocation, steering	ms–seconds

High-Level Controller (Meta/Manager): Operating at the non-RT RIC, this agent allocates resource budgets (e.g., physical resource blocks, power splits), selects slice-level KPI targets, or generates strategic guidance for subordinate agents. States typically comprise slice-level QoS metrics, traffic statistics, and previous allocations (Lotfi et al., 8 Dec 2025, Habib et al., 2024, Bao et al., 25 Apr 2025), while actions encode resource splits, KPI goals, or orchestration directives.
Low-Level Controllers (Workers): Multiple near-RT agents, often one per slice per DU or cell, perform fine-grained RB scheduling, RAT selection, or traffic steering. Their states include per-UE or cell-local QoS, load metrics, and meta-level guidance. Actions span RB assignments, scheduler profile selection, or flow steering (Lotfi et al., 8 Dec 2025, Tsampazi et al., 2023, Habib et al., 2024).

2.2 MDP Formulations

Each level solves its own MDP:

High-Level: $s^H_t$ = slice KPIs, user counts; $a^H_t$ = RB/power split vector; $r^H_t = \sum_l Q_l(t)$ (Lotfi et al., 8 Dec 2025).
Low-Level: $s^L_t$ = per-UE QoS; $a^L_t$ = RB assignments; $r^L_t = \text{sigmoid}(\tilde Q_m) - \text{sigmoid}(\tilde K_r)$ (Lotfi et al., 8 Dec 2025).

Hierarchical reward assignment reflects the decoupling: meta-agents receive extrinsic rewards aggregated over worker performance; workers optimize intrinsic, per-step objectives (Habib et al., 2023, Habib et al., 2024).

3. Algorithmic Advances: Meta-Learning, Federated DRL, and LLM-Augmented Hierarchy

3.1 Meta-Hierarchical RL

Meta-HRL in O-RAN further accelerates adaptation by sharing transferable meta-knowledge among DU-local HRL agents using a MAML-style objective (Lotfi et al., 8 Dec 2025):

Each two-level DU HRL instance is a meta-task with its own parameters $\theta_{g}$ .
The meta-model $\theta_M$ aggregates meta-gradients weighted by TD-error variance:

$\theta_{t,M} = \theta_{t-1,M} - \alpha \nabla_{\theta_M}\!\!\sum_{g=1}^{N_g} \omega_g \mathcal L(\theta_{t-1,g}, \mathcal B^{\text{val}_g})$

where

$\omega_g = \mathrm{Softmin}(\sigma^2_{\mathrm{TD},g})$

prioritizes harder tasks.

This yields sublinear convergence and regret guarantees for the two-level process:

$\mathbb E[\|\nabla \mathcal L_{\mathrm{meta}}(\theta_M^{(T)})\|^2] \leq C \frac{\log T}{\sqrt T}$

(Lotfi et al., 8 Dec 2025).

3.2 Federated Hierarchical RL

The federated HRL approach distributes DRL agents per-slice, per-BS (local DAs), aggregating models in the non-RT RIC via traffic-aware clustering (DTW+DBSCAN) and weighted averaging to derive specialized federated policies (Rezazadeh et al., 2022). This reduces control-plane overhead (>85%), accelerates convergence (25% fewer episodes), and improves latency-violation rates under heavy load (<1% for URLLC).

3.3 LLM-Augmented Hierarchical RL

LLM-hRIC injects a LLM-empowered non-RT RIC as a strategic guidance generator for near-RT RL agents (Bao et al., 25 Apr 2025). The LLM, e.g., Llama-3.1-8B-instruct, outputs a per-SBS guidance vector $g_t$ based on global KPIs and network summaries. Near-RT RL agents (DDPG) blend this guidance with local policy via scheduled mixing:

Initial phase: $a_t = g_t + \varepsilon_t$
Blending phase: $a_t = w_t g_t + (1-w_t)\mu_\phi(s_t, g_t)$
Final phase: $a_t = \mu_\phi(s_t, g_t)$

LLM-hRIC achieves 2× faster convergence and 10% higher throughput compared to pure RL under diverse traffic and partitioning (Bao et al., 25 Apr 2025).

4. Practical Realizations and Implementation

4.1 Prototyped xApps and System Integration

Hierarchical RL controllers are realized as dockerized xApps on O-RAN-compliant emulation testbeds (e.g., Colosseum) (Tsampazi et al., 2023). High-level slicing xApps set long-term PRB shares or intent, while low-level scheduling xApps assign profiles (RR/PF/WF) per slice. By enforcing disjoint control dimensions (slice shares vs. schedulers) and respecting O-RAN RIC interface timescales (A1 for guidance, E2SM-KPM/RC for metrics and control), hierarchical deployment avoids action conflicts and meets real-time requirements.

4.2 Best Practices and Lessons

Reward normalization by KPI dynamic range prevents unfair bias toward slices with larger-scale KPIs (Tsampazi et al., 2023).
Timescale separation (slow meta/fast controller) is essential for stability and adaptation to sudden load changes.
Autoencoder-based state compression enables tractable MDPs for large-scale testbeds (Tsampazi et al., 2023).
Embedding both slicing and scheduling control in a flat xApp can achieve higher extremes but starves competing slices; hierarchy yields Pareto-efficient resource use (Tsampazi et al., 2023).

5. Quantitative Performance and Scalability

Hierarchical RL frameworks deliver consistent performance enhancements across throughput, latency, fairness, and robustness metrics:

Method & Metric	Improvement	Source
Meta-HRL efficiency	+19.8% over DDPG baseline; +4% over uniform MAML-HRL	(Lotfi et al., 8 Dec 2025)
Adaptation speed	−40% fewer adaptation shots	(Lotfi et al., 8 Dec 2025)
Fairness (Jain's index)	0.91 → 0.96	(Lotfi et al., 8 Dec 2025)
Throughput (h-DQN traffic steer)	52.1 Mbps (+15.5% vs. heuristic, +6.5% vs. flat DQN)	(Habib et al., 2024)
Delay (h-DQN traffic steer)	20.4 ms (−28% vs. heuristic, −59% vs. flat DQN)	(Habib et al., 2024)
Energy efficiency	+37.9% (HRL vs. non-ML cell-sleep)	(Habib et al., 2023)
Robustness to traffic surge	<5% performance drop (Meta-HRL, 50% traffic increase)	(Lotfi et al., 8 Dec 2025)
Scalability	<2% normalized reward loss, meta-update latency < 10 ms	(Lotfi et al., 8 Dec 2025)

Ablation studies confirm the importance of adaptive meta-weighting ( $\omega_g$ from TD-error variance), which yields normalized reward $0.78\to0.84$ and cuts adaptation shots from $28\to17$ (Lotfi et al., 8 Dec 2025).

6. Theoretical Guarantees

HRL methods achieve sublinear convergence rates and bounded regret through theoretical analysis of multi-timescale actor-critic updates, provided prerequisites such as $L$ -smoothness and bounded-variance of gradient estimates hold (Lotfi et al., 8 Dec 2025). The hierarchical structure decouples exploration, stabilizes credit assignment, and yields tractable regret bounds:

$\mathcal R_T = \sum_{t=1}^T(f_t(a_t) - f_t(a^*)) = \widetilde O\left(\sqrt{T}(\sigma_{\text{TD}}^2 + G^2)\right)$

Moreover, federated clustering and adaptive meta-learning ensure that distributed HRL converges efficiently across heterogeneous and non-stationary traffic scenarios (Rezazadeh et al., 2022, Lotfi et al., 8 Dec 2025).

7. Limitations and Research Challenges

Several open challenges remain:

LLM-guided frameworks incur inference latency (>100 ms), limiting ultra-fast adaptation (Bao et al., 25 Apr 2025).
Goal abstraction and hierarchy design remain partially manual; automated discovery (e.g., options learning) is an open research area (Habib et al., 2024).
Real-time, distributed meta-RL and federated strategies require further study for deployment at full O-RAN scale and in presence of domain shifts or adversarial behavior (Rezazadeh et al., 2022).
Domain-specific LLM tuning, multi-modal prompt engineering, and efficient RL-LLM co-training protocols are needed to fully exploit global context while respecting latency constraints (Bao et al., 25 Apr 2025).

In summary, hierarchical RL—augmented by meta-learning, federated optimization, and cross-domain LLM guidance—enables scalable, robust, and adaptive resource management for O-RAN. These frameworks map naturally to O-RAN's multi-layer RIC architecture, provide substantial gains in efficiency, latency, and fairness, and establish theoretical and empirical benchmarks for real-time intelligent control in open, multi-slice wireless networks (Lotfi et al., 8 Dec 2025, Rezazadeh et al., 2022, Tsampazi et al., 2023, Bao et al., 25 Apr 2025, Habib et al., 2023, Habib et al., 2024).