Discounted Risk-Sensitive Objectives

Updated 18 January 2026

Discounted risk-sensitive objective setting is a framework for optimizing Markov decision processes by integrating nonlinear risk measures to account for variability and rare catastrophic outcomes.
It employs risk measures such as entropic risk and CVaR within nonlinear Bellman equations and augmented state spaces to ensure time-consistent and robust policy design.
This approach enables safer decision-making in domains like finance, control, and reinforcement learning by balancing expected performance with explicit risk aversion under uncertainty.

Discounted risk-sensitive objective setting concerns the optimization of dynamic decision processes (typically Markov decision processes, MDPs) in which agents not only discount rewards or costs over time, but also explicitly modulate sensitivity to stochastic uncertainty or rare catastrophic outcomes. The standard risk-neutral framework, which minimizes or maximizes expected discounted return, is generalized by introducing nonlinear (convex or coherent) risk measures—most importantly, the entropic risk (exponential utility) and Conditional Value-at-Risk (CVaR). Discounted risk-sensitive objectives are foundational in robust control, finance, safe reinforcement learning, and operations research.

1. Problem Formulation and Risk-Sensitive Criteria

Consider an infinite-horizon, discrete- or continuous-time MDP $\mathcal{M}=(\mathcal{X},\mathcal{A},P,C,\gamma)$ , where $\mathcal{X}$ is the state space, $\mathcal{A}$ the action space, $P$ the transition kernel, $C(x,a)$ the one-step cost, and $\gamma\in[0,1)$ the discount factor. Under a (possibly history-dependent) policy $\mu$ , the total discounted cost is

$Z(\mu) = \sum_{t=0}^{\infty} \gamma^{t}C(x_t,a_t).$

Rather than evaluating policies only through $\mathbb{E}[Z(\mu)]$ , risk-sensitive settings use nonlinear functionals $\rho$ that penalize variability or tail events. The principal measures are:

Entropic risk: For parameter $\theta\neq 0$ ,

$\rho_\theta(Z) = \frac{1}{\theta}\ln \mathbb{E}[e^{\theta Z}].$

This is associated with exponential utility and constant absolute risk aversion. As $\theta\to 0$ , it reduces to the risk-neutral case.

Conditional Value-at-Risk (CVaR): For confidence level $\alpha\in(0,1]$ ,

$\mathrm{CVaR}_\alpha(Z) = \min_{w\in\mathbb{R}}\left\{ w + \frac{1}{\alpha}\mathbb{E}[(Z-w)^+]\right\}.$

Optimized Certainty Equivalent (OCE): Generalizes both prior cases, defined for a concave utility function $u$ as

$\rho_u(Z) = \sup_{\eta\in\mathbb{R}} \{\eta + \mathbb{E}[u(Z-\eta)]\}.$

Entropic risk and CVaR are recovered for exponential and piecewise-linear forms of $u$ , respectively (Bäuerle et al., 2023).

Discounting in this context remains standard—future costs or rewards are geometrically downweighted by $\gamma$ —but is now nested within a nonlinear risk functional.

2. Dynamic Programming and Bellman Equations

The discounted risk-sensitive objective leads to nonlinear Bellman equations, determined by the choice of risk measure. For entropic risk, the value function satisfies

$V(x) = \inf_{a\in \mathcal{A}} \left[ C(x,a) + \frac{\gamma}{\theta} \ln \mathbb{E}[e^{\theta V(x')}] \right],$

which can be recast as a fixed point in terms of transformations of $V$ (see multiplicative Bellman recursion for exponential-of-integral cost functionals) (Bäuerle et al., 11 Jan 2026, Shen et al., 2011, Bäuerle et al., 2023). For CVaR objectives, state-augmentation is typically necessary. The optimal value function, parameterized by confidence level $y$ , satisfies

$T[V](x,y) = \min_{a\in\mathcal{A}} \left\{ C(x,a) + \gamma\, \max_{\xi\in \mathcal{U}(y, P(\cdot|x,a))} \sum_{x'} P(x'|x,a)\,\xi(x')\,V(x', y\xi(x')) \right\},$

with the set $\mathcal{U}(y, P) = \{\xi:0 \leq \xi(x') \leq 1/y,\; \sum_{x'} \xi(x')P(x')=1 \}$ (Chow et al., 2015). For general OCE-risk measures, the Bellman operator retains monotonicity and shift invariance, and under suitable compactness/continuity assumptions remains a strict contraction in the sup-norm (Bäuerle et al., 2023, Bhabak et al., 2021). Existence and uniqueness of the fixed point (hence optimal value function and policy) are thereby guaranteed.

Value and policy iteration—augmented to operate in the expanded state or on nonlinear expectations—yield globally optimal policies. For entropic and CVaR risk, augmented spaces track either the effective risk-sensitivity parameter (discounted for time) or the tail mass. For CVaR, optimal policies can require non-Markovian, history-dependent structure, but augmentation in the “confidence state” restores Markovian optimality (Chow et al., 2015).

3. Robustness, Duality, and the Role of Risk Parameters

A central insight is the equivalence between risk-sensitive control and robust control under model ambiguity. For discounted cost, the CVaR criterion admits a dual representation: $\mathrm{CVaR}_\alpha(Z) = \sup_{Q \ll P,\, 0 \leq \frac{dQ}{dP} \leq 1/\alpha} \mathbb{E}_Q[Z],$ i.e., as a worst-case expected cost under density perturbations (subject to a budget). Thus minimizing discounted CVaR is equivalent to guarding against adversarial environment perturbations within a specified envelope (Chow et al., 2015). This duality generalizes to ambiguity sets defined by statistical divergences (e.g., Kullback-Leibler or Rényi entropy), yielding “soft-robust” MDPs where both risk aversion (to rare outcomes) and epistemic uncertainty (model misspecification) are jointly mitigated (Hau et al., 2022, Ni et al., 2024).

The discount factor $\gamma$ controls the effective planning horizon—larger $\gamma$ aligns the discounted problem with the average cost setting and stabilizes the Bellman recursions (Bäuerle et al., 11 Jan 2026). The risk parameter ( $\theta$ for entropic risk, $\alpha$ for CVaR) defines the trade-off between performance in expectation and control of tail events; smaller $\alpha$ (or larger $\theta$ ) enforces greater caution at the cost of higher expected costs (Bäuerle et al., 2023, Chow et al., 2015, Ni et al., 2024). Rigorous convergence and bias bounds for approximate value iteration under discretization (e.g., linear interpolation over confidence levels) are available (Chow et al., 2015).

4. Algorithmic Approaches and Computational Schemes

A spectrum of computational frameworks exists for discounted risk-sensitive objectives:

State augmentation and value iteration: For CVaR, the augmented state $(x, y)$ $(x, y)$ yields a high-dimensional but contractive Bellman operator. Discretizing $y$ $y$ and linearly interpolating $y V(x, y)$ $y V (x, y)$ leads to practical algorithms with quantifiable error:
1 2 3 4
For k=0,1,... until convergence: For each (x, y_i) on grid: V_{k+1}(x, y_i) = min_a { C(x,a) + γ max_{ξ} ... } Update interpolants I_x[V_{k+1}]
(Chow et al., 2015, Ni et al., 2024).
Gradient-based and actor-free methods: Convex parameterizations (e.g., partially input-convex neural networks, PICNNs) enable exact solutions for risk-sensitive Q-functions, allowing gradient-descent-based global optimization over action spaces subject to CVaR constraints (Zhang et al., 2023).
Policy gradient and actor-critic methods: For differentiable polices, a likelihood-ratio or SPSA estimator for the policy gradient of entropic, variance, or CVaR-based discounted objectives is feasible. Three-step stochastic approximation (critic, actor, dual) methods are effective for constrained settings (e.g., variance-constrained discounted return) and can guarantee local convergence under suitable assumptions (A. et al., 2014, A. et al., 2018).
Iterated risk measures (IRMs): For time-consistency in dynamic risk measures, IRMs apply risk recursively at each stage, ensuring Bellman recursion validity even when DEU or EUD exhibit time-inconsistency (Osogami, 2012).

For continuous-time, jump, or semi-Markov models, risk-sensitive discounted objectives require solution of infinite-dimensional Hamilton–Jacobi–Bellman equations or equivalent integral equations in the relevant function spaces, but the underlying dynamic programming structure persists (Pal et al., 2021, Bhabak et al., 2021, Ghosh et al., 2016).

5. Structure and Properties of Optimal Policies

Optimal policies for discounted risk-sensitive objectives typically:

Exhibit threshold or barrier structures when utility is of exponential or power form, but the thresholds (and hence, actions) become strongly time- or history-dependent due to risk sensitivity (Bäuerle et al., 2013).
May lack full stationarity: For general values of $\gamma$ , optimal discounted risk-sensitive policies are non-stationary, but ultimate stationarity (steady-state in the limit) can emerge under vanishing risk aversion or high discount, connecting the discounted risk-sensitive and long-run average formulations (policy “turnpike” phenomena) (Bäuerle et al., 11 Jan 2026).
Are globally optimal within the augmented state-policy framework, but can be approximated to any fixed tolerance by truncation (finite horizon) or discretization (confidence grid, value grid), with rigorous bounds (Chow et al., 2015, M et al., 2022).

Robustness to parameter perturbations (in risk aversion or discount factor) is quantifiable, and convergence to moment-optimal (risk-neutral) solutions as $\theta\to 0, \alpha\to 1$ can be established. Moreover, in robust risk-sensitive RL, policies that control for both expectation and tail risk inevitably become more conservative—preferring safer trajectories over risky, high-reward shortcuts as uncertainty budgets increase (Ni et al., 2024).

6. Practical Considerations and Applications

Effective setup of a discounted risk-sensitive objective depends critically on tuning $\gamma$ and the risk parameter ( $\theta$ , $\alpha$ ), choosing discrete grids with appropriate resolution, and selecting risk measures appropriate for the application domain:

Engineering: Risk-sensitive and robust discounted objectives provide formalism for safety-constrained reinforcement learning, fault-tolerant control, and financial systems requiring tail-risk regulation (Zhang et al., 2023, Ni et al., 2024).
Approximate value-iteration algorithms with error guarantees are preferable when continuous state or confidence spaces must be discretized.
When computational cost is a concern, actor-free convex schemes or specialized augmented-state linear programs provide tractable alternatives in finite MDPs (Zhang et al., 2023, M et al., 2022).
Empirical studies on navigation or control (e.g., stochastic water-tank, gridworlds with obstacles) consistently show that risk sensitivity and model ambiguity jointly induce policies that avoid rare but catastrophic events, at the expense of longer or lower-reward trajectories (Zhang et al., 2023, Ni et al., 2024).

7. Extensions and Open Questions

Recent research has advanced discounted risk-sensitive objective setting through:

Fully characterizing time-consistent dynamic risk measures beyond exponential and CVaR-type objectives, including Markovian and non-Markovian risk functionals (Osogami, 2012).
Providing regret bounds for risk-sensitive RL (including CVaR objectives) in tabular and function-approximation regimes (Bastani et al., 2022).
Extending methodologies to partially observable MDPs and semi-Markov models via state augmentation, ensuring that risk-sensitive discounted objectives remain tractable and optimal policies exist (Bäuerle et al., 2015, Bhabak et al., 2021).

Persistent challenges include scalable solution methods for high-dimensional/continuous systems, ensuring stability under parameter perturbations (especially in vanishing-discount or heavy-tailed payoff regimes), and reconciling risk-averse learning with deep function approximation. The interplay between risk-sensitivity and robustness, and the design of risk-aware benchmarks, remain prominent active directions (Chow et al., 2015, Bäuerle et al., 11 Jan 2026, Bäuerle et al., 2023).