Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Cumulative Objectives in Decision Making

Updated 8 February 2026
  • Non-cumulative objectives are evaluation criteria defined by non-additive functionals over entire reward trajectories, emphasizing global performance rather than simple summation.
  • They extend classical reinforcement learning by replacing additive rewards with operators that capture bottleneck, maximum, or harmonic metrics via generalized Bellman equations.
  • Applications span network routing, risk-sensitive control, and scientometrics, where state augmentation and surrogate optimization enable PAC-learnability and global optimality.

A non-cumulative objective is a formal criterion for evaluating policies, algorithms, or agents in sequential decision problems whose value is not a sum (or discounted sum) of instantaneous rewards, but rather a more general functional of the entire sequence of rewards or events. Such objectives characterize many critical problems in reinforcement learning, optimal control, online learning, supervised classification, and scientometrics, where evaluating a system's performance through cumulative metrics is inadequate or fundamentally misaligned with ultimate goals.

1. Formal Definitions and Taxonomies

A non-cumulative objective is specified by a mapping from a trajectory (sequence of rewards, states, or events) to a real-valued score, which is generally not decomposable as a simple sum:

  • Standard cumulative objective: ucum=r1+γr2+γ2r3+u_\mathrm{cum} = r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots
  • General non-cumulative objective: u=f(r1,r2,)u = f(r_1, r_2, \ldots), where ff is not a sum.

Canonical forms include:

  • Bottleneck or min objective: u=mintrtu = \min_t r_t
  • Maximum-reward objective: u=maxtrtu = \max_t r_t
  • Harmonic mean (fixed horizon): u=1t=1T1/rtu = \frac{1}{\sum_{t=1}^T 1/r_t}
  • Event indicator (micro-objective): $u = 1\{\text{event occurs within$T$steps}\}$
  • General path-dependent functionals: u=f({(st,at,rt)}t=1T)u = f(\{(s_t, a_t, r_t)\}_{t=1}^T)

Non-decomposable objectives in supervised learning and bandit frameworks similarly arise whenever a metric of interest, such as the Fβ_\beta score, AUCPR, min-recall, or the area under cost curves, cannot be written as an average over individual examples but instead depends on the confusion matrix, ranking, or aggregated empirical distribution (Eban et al., 2016, Ramasubramanian et al., 2024).

2. Theoretical Foundation and Bellman Generalization

Classical reinforcement learning methods rely on cumulative objectives because the sum decomposability enables use of the Bellman equation, yielding tractable dynamic programming. Non-cumulative objectives fundamentally break this structure, demanding generalized approaches:

  • Generalized Bellman equation (stateless operator):

Q(s,a)=E[g(r(s,a),γmaxaQ(s,a))s,a]Q^*(s,a) = \mathbb E[\,g(r(s,a),\,\gamma\,\max_{a'}Q^*(s',a'))\,|\,s,a\,]

where gg recursively combines immediate reward rr and downstream statistic xx in lieu of addition, chosen to match the functional ff (Cui et al., 2023).

  • Examples:
    • For f(r)=mintrtf(\vec{r}) = \min_t r_t, g(r,x)=min(r,x)g(r, x) = \min(r, x).
    • For f(r)=maxtrtf(\vec{r}) = \max_t r_t, g(r,x)=max(r,x)g(r, x) = \max(r, x).
    • For harmonic mean, g(r,x)=1/(1/r+1/x)g(r, x) = 1/(1/r + 1/x) (for r>0r>0).
  • Finite-horizon reduction via state augmentation: Any non-cumulative ff admitting recursive state summarization can be encoded by augmenting the MDP state with auxiliary variables xtx_t propagating sufficient statistics, so that standard RL algorithms optimize the original objective (Nägele et al., 2024).

3. Sufficient Conditions and Convergence Guarantees

The extension of RL algorithms to non-cumulative objectives is underpinned by strong theoretical guarantees under appropriate conditions:

  • Contraction and uniqueness: If gg is Lipschitz in its second argument, the generalized Bellman operator is a γ\gamma-contraction, so value iteration converges to a unique fixed point (Cui et al., 2023).
  • Monotonicity in deterministic MDPs: If g(a,)g(a,\cdot) is monotone non-decreasing and transitions and rewards are deterministic, the greedy policy derived from the fixed point is globally optimal for the true non-cumulative return (Cui et al., 2023).
  • Sample and computational complexity: For non-cumulative objectives that are uniformly continuous or computable as functions of the reward path, PAC-learnability is preserved; i.e., one can guarantee ϵ\epsilon-optimality with polynomial sample and computation requirements (Yang et al., 2023).

4. Algorithmic Approaches

Several frameworks have been developed for learning and planning under non-cumulative objectives, each adapting established paradigms:

In Reinforcement Learning

  • Generalized Value Iteration / Q-learning: Directly replace the additive update with the corresponding operator gg (Cui et al., 2023).
    1
    2
    3
    4
    
    # Pseudocode for generalized Q-learning
    for t in range(T):
        delta = g(r_t, gamma * max_a(Q[s_next, a])) - Q[s, a]
        Q[s, a] += alpha * delta
  • Finite-Horizon State Augmentation: Augment the state as st=(s~t,xt)s_t = (\tilde{s}_t, x_t), updating xt+1x_{t+1} recursively so that the cumulative reward sequence encodes ff; then apply standard RL (Nägele et al., 2024).
  • Micro-Objective RL: Define task-specific Bernoulli micro-objectives and use Bellman-like or actor–critic updates for each; aggregate via partial order or scalarization (Li et al., 2019).
  • Non-Markovian Aggregation: For multiple objectives with distinct discount factors, augment the MDP state with a vector of cumulative discount products, rendering the process Markovian in the expanded space (Pitis, 2023).

In Supervised and Bandit Learning

  • Surrogate optimization for non-decomposable metrics: Construct convex or non-convex surrogates that bound non-decomposable metrics from below (e.g., Fβ_\beta, AUCPR) and apply mini-batch SGD or saddle-point optimization (Eban et al., 2016, Ramasubramanian et al., 2024).
  • Selective Mixup Fine-Tuning (SelMix): Approximate the metric's functional gradient with respect to class-pair mixup directions, then optimize the mixup policy to maximize expected metric gain (Ramasubramanian et al., 2024).
  • EDPM-UCB for Bandits: When the objective is a function R^(F^Tπ)\widehat{R}(\widehat{F}^\pi_T) of the empirical reward law, use stability and smoothness conditions to derive UCB-type algorithms with regret guarantees (Cassel et al., 2018).

5. Representative Objective Classes and Practical Applications

Non-cumulative objectives arise in a diverse array of domains:

Domain Example Non-Cumulative Objective Reference
RL/control Bottleneck/minimum, max, harmonic mean (Cui et al., 2023, Nägele et al., 2024)
Bandits Conditional value-at-risk, Sharpe ratio (Cassel et al., 2018)
ML classification Fβ_\beta, AUCPR, min-recall (Eban et al., 2016, Ramasubramanian et al., 2024)
Citation metrics Citation acceleration W(t)W(t) (Wilson et al., 2021)

Practical applications:

6. Limitations, Open Problems, and Future Directions

Despite these advances, non-cumulative objectives present substantial modeling, algorithmic, and theoretical challenges:

  • State representation complexity: Some non-cumulative functionals require the augmented state to track history or summary statistics whose size grows with time; classifying objectives ff that admit finite, fixed-dimensional summaries remains open (Nägele et al., 2024).
  • Learning in stochastic environments: Some optimality guarantees depend on deterministic transitions; monotonicity and exchangeability properties must be enforced or new distributional RL techniques developed for broader generality (Cui et al., 2023).
  • Optimization stability and surrogate tightness: Surrogate-based methods for non-decomposable losses must balance tractability and fidelity to the original metric (Eban et al., 2016, Ramasubramanian et al., 2024).
  • Discovery and selection of objectives: Automated approaches to finding compact sets of micro-objectives or selecting relevant event-indicators are largely undeveloped (Li et al., 2019).
  • Normalization and generalization: For citation and scientometric indices, non-cumulative metrics face difficulty in cross-field comparison and sensitivity to short-term fluctuations (Wilson et al., 2021).
  • Aggregation and intertemporal agency: Normatively sound aggregation over objectives with differing time horizons fundamentally imposes non-Markovian, path-dependent reward structures; the minimal state augmentation approach addresses dynamic consistency but introduces additional computational complexity (Pitis, 2023).

Emerging trends include distributional and partially observable RL for non-cumulative objectives, hybrid actor–critic methods tailored to sequence-based metrics, and the design of objectives ensuring PAC-learnability through uniform continuity or computability (Yang et al., 2023, Nägele et al., 2024).

7. Summary Table: Algorithmic Treatments of Non-Cumulative Objectives

Approach Key Criterion Guarantee/Property Papers
Generalized Bellman update Lipschitz/monotone gg Contraction/global optimality (Cui et al., 2023)
State augmentation (RL) Finite recursive summary Reduces to standard MDP (Nägele et al., 2024)
PAC-learnability conditions Uniform continuity/computability Finite sample/comp. bounds (Yang et al., 2023)
Micro-objective RL Structured event-indicators Arbitrary event probabilities (Li et al., 2019)
Surrogate optim. (non-dec.) Convex bounds on metrics SGD/saddle-point convergence (Eban et al., 2016, Ramasubramanian et al., 2024)
Non-Markovian aggregation State-space expansion Pareto-consistency, dynamic consistency (Pitis, 2023)
EDPM-UCB (Bandit) Stability & smoothness O(logT/T)O(\log T/T) regret (Cassel et al., 2018)

Non-cumulative objectives form a broad, foundational class in modern sequential decision-making, supervised learning, and evaluation science. Their algorithmic treatment increasingly relies on problem-specific operator generalization, state augmentation to recover Markovian structure, surrogate-based optimization, and principled considerations regarding learnability and tractability. Together, these advances systematically extend the reach of learning and planning methods beyond the traditional paradigm of cumulative, decomposable reward.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Cumulative Objectives.