Hierarchical RL in O-RAN Networks
- The paper demonstrates how hierarchical RL decomposes complex network optimization into nested MDPs across non-RT and near-RT controllers to enhance scalability.
- It employs meta-learning and federated DRL to accelerate adaptation, reduce control overhead, and improve real-time network performance.
- LLM-augmented guidance is integrated to refine scheduling and resource allocation, achieving measurable gains in throughput and latency.
Hierarchical reinforcement learning (HRL) has emerged as a principal methodology for scalable and adaptive control within the Open Radio Access Network (O-RAN) framework. O-RAN disaggregates radio access network functions and introduces multi-layer intelligent control through RIC modules at distinct time scales and abstraction levels. HRL leverages this architectural hierarchy by decomposing complex network optimization (e.g., slicing, traffic steering, resource allocation) into nested Markov Decision Processes (MDPs), each solved by specialized RL agents aligned with the O-RAN split between non-real-time (non-RT, >1s) and near-real-time (near-RT, 10ms–1s) controllers. Recent research extends traditional HRL by integrating meta-learning, federated DRL, and auxiliary LLM guidance to significantly improve real-time adaptability, robustness, and network scalability in O-RAN environments.
1. O-RAN and the Motivation for Hierarchical RL
O-RAN's open architecture comprises the non-RT RIC for global policy, intent translation, and cross-domain orchestration, and the near-RT RIC for fast, local resource allocation and control. This natural separation of time scales and control domains forms the basis for HRL adoption. The non-RT RIC typically serves as the high-level agent ("manager" or "meta-controller"), setting global slice shares, intent goals, or strategic guidance based on aggregated KPIs and operator intent. The near-RT RIC implements low-level "worker" agents—often as xApps—responsible for scheduling, power allocation, traffic steering, or RAN slicing at DU or cell granularity (Lotfi et al., 8 Dec 2025, Habib et al., 2023, Tsampazi et al., 2023, Rezazadeh et al., 2022, Habib et al., 2024, Bao et al., 25 Apr 2025).
Significant motivations for HRL in O-RAN include:
- Decomposition reduces MDP dimensionality and enables scalable learning across hundreds of slices and thousands of UEs (Lotfi et al., 8 Dec 2025, Rezazadeh et al., 2022).
- Explicit time-scale separation aligns RL decision periods with O-RAN control loop latencies (Bao et al., 25 Apr 2025).
- Hierarchy mitigates policy conflicts and enables distinct objectives and reward signals for global versus local optimization (Tsampazi et al., 2023, Habib et al., 2023).
2. Canonical Hierarchical RL Architectures in O-RAN
O-RAN HRL is typically deployed as a two-level agent structure, tightly mapping to the RIC hierarchy.
2.1 Two-Level Agent Structure
| Level | Deployment | Agent Type | Typical Decision | Decision Interval |
|---|---|---|---|---|
| High (Meta) | Non-RT RIC (rApp/xApp) | DQN, PPO, MAML, LLM | Intent, slicing, guidance | Seconds–minutes |
| Low (Controller) | Near-RT RIC (xApp) | DDPG, DQN, PPO | Scheduling, allocation, steering | ms–seconds |
- High-Level Controller (Meta/Manager): Operating at the non-RT RIC, this agent allocates resource budgets (e.g., physical resource blocks, power splits), selects slice-level KPI targets, or generates strategic guidance for subordinate agents. States typically comprise slice-level QoS metrics, traffic statistics, and previous allocations (Lotfi et al., 8 Dec 2025, Habib et al., 2024, Bao et al., 25 Apr 2025), while actions encode resource splits, KPI goals, or orchestration directives.
- Low-Level Controllers (Workers): Multiple near-RT agents, often one per slice per DU or cell, perform fine-grained RB scheduling, RAT selection, or traffic steering. Their states include per-UE or cell-local QoS, load metrics, and meta-level guidance. Actions span RB assignments, scheduler profile selection, or flow steering (Lotfi et al., 8 Dec 2025, Tsampazi et al., 2023, Habib et al., 2024).
2.2 MDP Formulations
Each level solves its own MDP:
- High-Level: = slice KPIs, user counts; = RB/power split vector; (Lotfi et al., 8 Dec 2025).
- Low-Level: = per-UE QoS; = RB assignments; (Lotfi et al., 8 Dec 2025).
Hierarchical reward assignment reflects the decoupling: meta-agents receive extrinsic rewards aggregated over worker performance; workers optimize intrinsic, per-step objectives (Habib et al., 2023, Habib et al., 2024).
3. Algorithmic Advances: Meta-Learning, Federated DRL, and LLM-Augmented Hierarchy
3.1 Meta-Hierarchical RL
Meta-HRL in O-RAN further accelerates adaptation by sharing transferable meta-knowledge among DU-local HRL agents using a MAML-style objective (Lotfi et al., 8 Dec 2025):
- Each two-level DU HRL instance is a meta-task with its own parameters .
- The meta-model aggregates meta-gradients weighted by TD-error variance:
where
prioritizes harder tasks.
This yields sublinear convergence and regret guarantees for the two-level process:
3.2 Federated Hierarchical RL
The federated HRL approach distributes DRL agents per-slice, per-BS (local DAs), aggregating models in the non-RT RIC via traffic-aware clustering (DTW+DBSCAN) and weighted averaging to derive specialized federated policies (Rezazadeh et al., 2022). This reduces control-plane overhead (>85%), accelerates convergence (25% fewer episodes), and improves latency-violation rates under heavy load (<1% for URLLC).
3.3 LLM-Augmented Hierarchical RL
LLM-hRIC injects a LLM-empowered non-RT RIC as a strategic guidance generator for near-RT RL agents (Bao et al., 25 Apr 2025). The LLM, e.g., Llama-3.1-8B-instruct, outputs a per-SBS guidance vector based on global KPIs and network summaries. Near-RT RL agents (DDPG) blend this guidance with local policy via scheduled mixing:
- Initial phase:
- Blending phase:
- Final phase:
LLM-hRIC achieves 2× faster convergence and 10% higher throughput compared to pure RL under diverse traffic and partitioning (Bao et al., 25 Apr 2025).
4. Practical Realizations and Implementation
4.1 Prototyped xApps and System Integration
Hierarchical RL controllers are realized as dockerized xApps on O-RAN-compliant emulation testbeds (e.g., Colosseum) (Tsampazi et al., 2023). High-level slicing xApps set long-term PRB shares or intent, while low-level scheduling xApps assign profiles (RR/PF/WF) per slice. By enforcing disjoint control dimensions (slice shares vs. schedulers) and respecting O-RAN RIC interface timescales (A1 for guidance, E2SM-KPM/RC for metrics and control), hierarchical deployment avoids action conflicts and meets real-time requirements.
4.2 Best Practices and Lessons
- Reward normalization by KPI dynamic range prevents unfair bias toward slices with larger-scale KPIs (Tsampazi et al., 2023).
- Timescale separation (slow meta/fast controller) is essential for stability and adaptation to sudden load changes.
- Autoencoder-based state compression enables tractable MDPs for large-scale testbeds (Tsampazi et al., 2023).
- Embedding both slicing and scheduling control in a flat xApp can achieve higher extremes but starves competing slices; hierarchy yields Pareto-efficient resource use (Tsampazi et al., 2023).
5. Quantitative Performance and Scalability
Hierarchical RL frameworks deliver consistent performance enhancements across throughput, latency, fairness, and robustness metrics:
| Method & Metric | Improvement | Source |
|---|---|---|
| Meta-HRL efficiency | +19.8% over DDPG baseline; +4% over uniform MAML-HRL | (Lotfi et al., 8 Dec 2025) |
| Adaptation speed | −40% fewer adaptation shots | (Lotfi et al., 8 Dec 2025) |
| Fairness (Jain's index) | 0.91 → 0.96 | (Lotfi et al., 8 Dec 2025) |
| Throughput (h-DQN traffic steer) | 52.1 Mbps (+15.5% vs. heuristic, +6.5% vs. flat DQN) | (Habib et al., 2024) |
| Delay (h-DQN traffic steer) | 20.4 ms (−28% vs. heuristic, −59% vs. flat DQN) | (Habib et al., 2024) |
| Energy efficiency | +37.9% (HRL vs. non-ML cell-sleep) | (Habib et al., 2023) |
| Robustness to traffic surge | <5% performance drop (Meta-HRL, 50% traffic increase) | (Lotfi et al., 8 Dec 2025) |
| Scalability | <2% normalized reward loss, meta-update latency < 10 ms | (Lotfi et al., 8 Dec 2025) |
Ablation studies confirm the importance of adaptive meta-weighting ( from TD-error variance), which yields normalized reward and cuts adaptation shots from (Lotfi et al., 8 Dec 2025).
6. Theoretical Guarantees
HRL methods achieve sublinear convergence rates and bounded regret through theoretical analysis of multi-timescale actor-critic updates, provided prerequisites such as -smoothness and bounded-variance of gradient estimates hold (Lotfi et al., 8 Dec 2025). The hierarchical structure decouples exploration, stabilizes credit assignment, and yields tractable regret bounds:
Moreover, federated clustering and adaptive meta-learning ensure that distributed HRL converges efficiently across heterogeneous and non-stationary traffic scenarios (Rezazadeh et al., 2022, Lotfi et al., 8 Dec 2025).
7. Limitations and Research Challenges
Several open challenges remain:
- LLM-guided frameworks incur inference latency (>100 ms), limiting ultra-fast adaptation (Bao et al., 25 Apr 2025).
- Goal abstraction and hierarchy design remain partially manual; automated discovery (e.g., options learning) is an open research area (Habib et al., 2024).
- Real-time, distributed meta-RL and federated strategies require further study for deployment at full O-RAN scale and in presence of domain shifts or adversarial behavior (Rezazadeh et al., 2022).
- Domain-specific LLM tuning, multi-modal prompt engineering, and efficient RL-LLM co-training protocols are needed to fully exploit global context while respecting latency constraints (Bao et al., 25 Apr 2025).
In summary, hierarchical RL—augmented by meta-learning, federated optimization, and cross-domain LLM guidance—enables scalable, robust, and adaptive resource management for O-RAN. These frameworks map naturally to O-RAN's multi-layer RIC architecture, provide substantial gains in efficiency, latency, and fairness, and establish theoretical and empirical benchmarks for real-time intelligent control in open, multi-slice wireless networks (Lotfi et al., 8 Dec 2025, Rezazadeh et al., 2022, Tsampazi et al., 2023, Bao et al., 25 Apr 2025, Habib et al., 2023, Habib et al., 2024).