Average-Reward Reinforcement Learning

Updated 21 January 2026

Average-reward reinforcement learning is a framework that optimizes the long-run average reward by treating all rewards equally across time steps.
Recent algorithmic advances like reference-advantage decomposition and RVI Q-learning provide rigorous statistical guarantees and near-optimal regret bounds.
Extensions to risk-aware methods, deep function approximation, and temporal abstraction broaden its real-world applications in robotics, network management, and more.

Average-reward reinforcement learning (AR-RL) is a foundational paradigm for sequential decision-making under uncertainty, focusing on the optimization of the long-term average reward rate in Markov decision processes (MDPs). Unlike discounted-reward RL, which prioritizes short-term returns through a discount factor, AR-RL treats rewards at all time steps equally, making it the preferred criterion for continuing control, steady-state system optimization, and settings without an intrinsic notion of discounting. This article surveys the theoretical foundations, algorithmic methodologies, statistical guarantees, and current research trends, with an emphasis on rigorous developments and open problems.

1. Formal Foundations of the Average-Reward Criterion

AR-RL considers an infinite-horizon MDP $(S, A, P, r)$ where $S$ is the state space, $A$ the action space, $P(s'|s,a)$ the transition kernel, and $r: S \times A \rightarrow [0,1]$ the reward function. For a stationary policy $\pi: S \rightarrow \Delta(A)$ , the long-run average reward (gain) is defined as: $\rho^\pi = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}_\pi \left[ \sum_{t=0}^{T-1} r(s_t, a_t) \right]$ Under mild assumptions such as weakly communicating or unichain structure, $\rho^\pi$ is independent of the starting state.

The bias (or relative value) function $h^\pi: S \rightarrow \mathbb{R}$ is: $h^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} (r(s_t, a_t) - \rho^\pi) \,\big|\, s_0 = s \right]$ and satisfies the Poisson (Bellman) equation

$\rho^\pi + h^\pi(s) = \mathbb{E}_{a \sim \pi(s), s' \sim P(s,a)}[r(s,a) + h^\pi(s')]$

The optimal average reward $\rho^* = \max_\pi \rho^\pi$ , and an optimal policy $\pi^*$ achieves this value. The bias span $\mathrm{sp}(h^*) = \max_s h^*(s) - \min_s h^*(s)$ plays a central role in the complexity of average-reward MDPs (Zhang et al., 2023).

2. Algorithmic Principles and Model-Free AR-RL Methods

Contemporary AR-RL algorithms aim to solve for both the optimal gain $\rho^*$ and the bias $h^*$ or corresponding Q-functions, using only sampled interactions with the environment.

2.1 Reference-Advantage Decomposition and UCB-AVG

Recent advances leverage a reference-advantage decomposition to reduce AR-MDPs to a family of nearly undiscounted MDPs ( $\gamma \rightarrow 1$ ), enabling variance reduction and efficient confidence computation. The UCB-AVG algorithm (Zhang et al., 2023) structures learning into epochs, uses sparse graphs to estimate bias differences, and projects value functions into polytopes constrained by bias span. This yields a regret bound of

$\widetilde{O}(S^5 A^2\,\mathrm{sp}(h^*)\,\sqrt{T})$

with near-optimal dependence on time and fundamental hardness parameter $\mathrm{sp}(h^*)$ .

2.2 Relative Value Iteration (RVI) Q-Learning

RVI Q-learning and its variants generalize Q-learning by simultaneously updating state-action values and a running estimate of the average reward. The update for Q-values and gain (average reward) in the tabular setting is: $Q_{t+1}(s,a) \leftarrow Q_t(s,a) + \alpha_t \left( r_t - \rho_t + \max_{a'} Q_t(s',a') - Q_t(s,a) \right)$

$\rho_{t+1} \leftarrow \rho_t + \beta_t \left( r_t - \rho_t + \max_{a'} Q_t(s',a') - \max_{a} Q_t(s,a) \right)$

Convergence to $(\rho^*, h^*)$ holds under standard conditions (diminishing step-sizes, sufficient exploration, ergodicity) (Bakhshi et al., 2021, Wang et al., 2023, Yu et al., 2024, Yu et al., 5 Dec 2025).

2.3 Function Approximation and Deep AR-RL

For large or continuous state-action spaces, AR-RL is extended to function approximation using neural networks and feature-based representations. Techniques like implicit TD( $\lambda$ ) (Kim et al., 7 Oct 2025) provide stable updates, and actor-critic methods with average-reward critics have been developed.

Notably, algorithms such as Average Reward Off-policy Soft Actor Critic (ARO-SAC) (Yang et al., 12 Jan 2025) and entropy-regularized variants (ASAC/ASQL) (Adamczyk et al., 15 Jan 2025) adapt expected bellman equations and policy-improvement to the average-reward and entropy-regularized contexts, sometimes leveraging eigenvector-based frameworks (Adamczyk et al., 15 Jan 2025).

3. Statistical Guarantees and Regret/Sample Complexity

The statistical understanding of AR-RL focuses on minimizing regret or achieving a policy with gain within $\epsilon$ of optimal.

For finite-space, model-free online AR-RL, regret of order $\widetilde{O}(S^5A^2 \mathrm{sp}(h^*) \sqrt{T})$ is achievable, matching known lower bounds in $T$ up to polynomial factors in S, A (Zhang et al., 2023).
In the presence of generative models (simulators), sample complexity bounds of the form $\widetilde{O}(SA\,\mathrm{sp}^2(h^*)/\epsilon^2 + S^2A\,\mathrm{sp}(h^*)/\epsilon)$ are proven, and precise characterizations of the statistical hardness via mixing times or bias span are established (Zhang et al., 2023, Chen et al., 15 May 2025).
In continuous spaces, adaptive discretization and zooming algorithms achieve regret rates scaling as $\widetilde{O}(T^{1-1/d_{\text{eff}}})$ , where $d_{\text{eff}}$ captures both cover dimension and a zooming dimension reflecting problem-specific geometry (Kar et al., 2024).

Distributional robustness under KL, total variation, and $f$ -divergence uncertainty sets is also treated with matching sample complexity under uniform ergodicity (Chen et al., 15 May 2025).

4. Extensions: Risk, Temporal Abstraction, and Specification

4.1 Risk-Aware and Subtask-Driven AR-RL

The Reward-Extended Differential (RED) framework demonstrates that the structural property of AR-MDPs—where a single TD error drives both value and gain estimates—extends to simultaneous, online optimization of multiple scalar subtasks, including conditional value-at-risk (CVaR) and quantiles, without state augmentation (Rojas et al., 2024).

4.2 Options and Semi-Markov Models

AR-RL extends naturally to the options framework (temporal abstraction) and semi-Markov decision processes (SMDPs). Generalized RVI Q-learning algorithms with length-normalized updates converge under mild assumptions, supporting both inter- and intra-option methods (Wan et al., 2021, Yu et al., 2024, Yu et al., 5 Dec 2025).

4.3 Formal Specification and Mean-Payoff Objectives

AR-RL provides the proper formalism for optimizing mean-payoff and ω-regular (infinite-trace) specifications in continuing tasks. Recent work introduces model-free reductions from absolute liveness ω-regular specifications to average-reward MDPs, yielding the first practical RL methods for maximizing satisfaction probability in the continuing (non-episodic) regime and enabling lexicographic multi-objective AR-RL (Kazemi et al., 21 May 2025). Temporal-logic-based reward shaping compatible with AR-RL also enables safe and efficient knowledge-driven learning (Jiang et al., 2020).

5. Trust Region, Policy Improvement, and Performance Bounds

The lack of contraction for the average-reward Bellman operator precludes direct application of discounted-RL trust-region theory. However, recent advances establish performance-difference and trust-region bounds for AR-RL using tools such as Kemeny's constant, policy divergence measures, and average-reward-specific advantage functions (Ma et al., 2021, Zhang et al., 2021). Algorithms such as Average Policy Optimization (APO) and Average-Reward TRPO (ATRPO) adapt trust-region updates for monotonic policy improvement in the AR setting.

$\rho(\pi') - \rho(\pi) \geq L_\pi(\pi') - 2\epsilon(\kappa_{\pi'}-1)\, \bar{D}_{\mathrm{TV}}(\pi' \| \pi)$

where $L_\pi(\pi')$ is a surrogate, $\epsilon$ bounds the worst-case advantage, and $\kappa_{\pi'}$ is Kemeny's constant (Zhang et al., 2021, Ma et al., 2021).

6. Practical Applications and Empirical Trends

AR-RL aligns uniquely with tasks in wireless resource management, network admission control, robotics, and large-scale infrastructure systems, where maximization of steady-state throughput or long-run performance underpins operational objectives (Yang et al., 12 Jan 2025, Bakhshi et al., 2021). Empirical studies verify that AR-RL algorithms (e.g., ARO-SAC) outperform their discounted counterparts on long-horizon or non-episodic tasks, often by significant margins (e.g., ≈15% higher throughput in wireless resource management (Yang et al., 12 Jan 2025)). Moreover, techniques such as entropy regularization and eigenvector-based policy evaluation have enabled robust and scalable average-reward learning in classic control domains (Adamczyk et al., 15 Jan 2025, Adamczyk et al., 15 Jan 2025).

7. Open Problems and Future Directions

Research continues on:

Tightening regret and sample complexity bounds, including sharp dependence on bias span, mixing time, or geometric structure (Zhang et al., 2023, Kar et al., 2024).
Extension to function approximation and generalization: translating model-free, variance-reduced and bias-controlled techniques to deep RL (Zhang et al., 2023, Adamczyk et al., 15 Jan 2025).
Algorithmic advances in risk-aware AR-RL for general coherent risk measures such as CVaR and quantile criteria (Rojas et al., 2024).
Robust AR-RL under broader uncertainty and adversarial conditions, including sample-efficient, distributionally robust approaches (Wang et al., 2023, Chen et al., 15 May 2025).
Potential-based reward shaping, specification-driven reward design, and integration with formal methods (Jiang et al., 2020, Kazemi et al., 21 May 2025).
Efficient planning and learning in high-dimensional and metric state–action spaces, potentially leveraging structure-aware discretization, compression, and manifold learning (Kar et al., 2024).
Generalization to partial observability, multi-agent systems, and settings with weaker communication or mixing assumptions.

In summary, average-reward reinforcement learning provides a mathematically rigorous and operationally natural framework for sequential control in continuing environments. Recent theoretical and algorithmic progress offers efficient, robust, and scalable methods, matching or approaching optimal complexity, and enabling principled solutions to a growing range of long-horizon, specification-driven, and risk-sensitive tasks.