Goal-Conditioned Value Formulation in RL

Updated 10 February 2026

GCVF is a mathematical framework in reinforcement learning that extends traditional value functions by explicitly conditioning on goal states, enabling multi-goal learning and planning.
The formulation integrates classical Bellman recursions with advanced methodologies like bilinear networks, contrastive learning, and density estimation to optimize goal-reaching policies.
Applications span multi-goal robotic manipulation, high-dimensional navigation, and robust planning under uncertainty, demonstrating improved sample efficiency and real-time feasibility.

A Goal-Conditioned Value Formulation (GCVF) is a mathematical framework in reinforcement learning (RL) and optimal control that generalizes classical state-action value functions to include explicit conditioning on goal states or task descriptors. GCVFs provide the backbone for sample-efficient multi-goal learning, robust planning, and transfer in high-dimensional control systems. The formalism, its theoretical properties, and its algorithmic realizations have become central to areas such as multi-goal RL, offline RL, planning under uncertainty, and hierarchical control.

1. Formal Definition and Bellman Equations

In a goal-conditioned Markov decision process (MDP), the traditional value functions are extended to include an explicit goal input $g \in \mathcal{G}$ . For state $s \in \mathcal{S}$ , action $a \in \mathcal{A}$ , and a goal $g$ , the goal-conditioned reward $r_g(s,a)$ encodes the agent's progress relative to $g$ . The canonical GCVFs are defined as:

Goal-Conditioned State Value:

$V^\pi(s, g) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_g(s_t, a_t) \mid s_0 = s\right]$

Goal-Conditioned Action-Value:

$Q^\pi(s, a, g) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_g(s_t, a_t) \mid s_0 = s, a_0 = a\right]$

The Bellman recursion for the optimal GCVF is:

$Q^*(s, a, g) = r_g(s, a) + \gamma \int P(s' \mid s, a) \max_{a'} Q^*(s', a', g) ds'$

$V^*(s, g) = \max_{a} Q^*(s, a, g)$

Goal-conditioned policies are given by $\pi^*(s, g) = \arg\max_a Q^*(s, a, g)$ (Lawrence et al., 6 Dec 2025, Lawrence et al., 10 Feb 2025).

This formulation admits reward functions ranging from dense (e.g., quadratic penalties) to sparse (e.g., indicator of reaching $g$ ). For indicator rewards, $V^\pi(s,g)$ is proportional to the discounted probability of reaching $g$ given $s$ (Schroecker et al., 2020, Giammarino et al., 12 Dec 2025, Bagatella et al., 2023).

2. Theoretical Properties and Quasimetric Structure

The optimal GCVF for sparse goal-reaching tasks induces a quasimetric structure over the state-goal space. For reward $R(s,g) = -1$ if $s \neq g$ , $R(g,g)=0$ , $V^*(s, g)$ is minus the minimal time-to-goal, i.e., $d^*(s,g) = -V^*(s,g) \ge 0$ , satisfying:

Nonnegativity: $d^*(s,g) \ge 0$ , $d^*(g,g)=0$
Triangle inequality: $d^*(s_1, s_3) \le d^*(s_1, s_2) + d^*(s_2, s_3)$

Quasimetrics need not be symmetric, as $d^*(s,g) \neq d^*(g,s)$ . In continuous-time, the shortest-path GCVF solves a Hamilton-Jacobi-Bellman (HJB) or Eikonal partial differential equation:

$\|\nabla_x d^*(x, g)\|_2 = 1, \qquad d^*(g, g)=0$

This structure underpins recent PDE-regularized and quasimetric RL approaches (Giammarino et al., 12 Dec 2025). For partially observed systems, GCVFs operate over belief states, coupling estimation and control in the dual control regime (Lawrence et al., 6 Dec 2025).

3. Algorithmic Realizations

GCVF parameterizations and learning schemes span value-based, density-based, metric-based, and contrastive approaches:

Bilinear Value Networks: Employ a low-rank factorization $Q(s,a,g) = f(s,a)^\top \varphi(s,g)$ , separating goal-agnostic dynamics from goal-specific geometry. This yields improved sample efficiency and transfer compared to monolithic networks. Input grouping and rank $d$ crucially impact expressivity and generalization; $d=8$ –$16$ is typically effective (Hong et al., 2022).
Contrastive Learning: Contrastive RL equates the (log-)score $\phi(s,a) \cdot \psi(g)$ —computed via neural encoders—with $\log Q_g^\pi(s,a)$ , training via InfoNCE losses on (state, action, goal) tuples. This directly estimates future-state occupancy measures and is competitive or superior even without auxiliary tasks or data augmentation, including in image-based or offline RL settings (Eysenbach et al., 2022).
Value Density Estimation: Normalizing-flow models $F_\Phi(g | s, a)$ estimate the discounted future-state density or occupancy $F_\gamma^\pi$ , enabling unbiased GCVF learning in stochastic domains and providing a foundation for robust goal-reaching and imitation learning (Schroecker et al., 2020).
Quasimetric and PDE Constraints: Enforce the value structure via triangle inequalities and Eikonal constraints, leading to trajectory-free, sample-based offline algorithms. Hierarchically factorized quasimetric models are used to handle high-dimensional problems (Giammarino et al., 12 Dec 2025).
Model-Based Offline Planning: Value landscape artifacts—local and global estimation spurious maxima—are corrected via model-based planning and graph-based value aggregation, ensuring zero-shot goal-reaching after unsupervised exploration (Bagatella et al., 2023).
Robust and Hierarchical MPC Integration: In robust RL-MPC pipelines, the global GCVF (trained offline) acts as a terminal cost in scenario-based MPC, uniting statistical learning and robust planning for safe, goal-driven control (Lawrence et al., 10 Feb 2025, Morita et al., 2024).

4. Connections to Classical Control, Dual Control, and Robustness

The GCVF reveals a precise distinction between classical, dense (e.g., quadratic) control objectives and probabilistic goal-conditioned rewards. While minimizing arrival cost $c(s,a)=\|s-g\|_M^2$ locally upper bounds maximizing goal-attainment probability, there is a significant "optimality gap" globally. Explicitly, only locally (near $g$ ) does minimizing quadratic cost approximate maximizing the probability of hitting $g$ ; for long-range and highly sparse tasks, GCVF formulations are strictly superior (Lawrence et al., 6 Dec 2025).

In partially observed domains, belief-state GCVFs formalize the dual control paradigm: the expected reward decomposes into a control term (posterior probability of attaining $g$ after new observation) and an estimation term (informativity about the state), forcing active information acquisition during goal-reaching (Lawrence et al., 6 Dec 2025).

Robust GCVF extensions employ scenario trees for both exploration and online MPC, ensuring that the learned value functions and policies generalize across modeled uncertainties. Scenario-averaged Bellman operators and scenario-based trajectory optimizations are used to guarantee safety and performance under uncertainty (Lawrence et al., 10 Feb 2025).

5. Hierarchical and Multitask Extensions

Hierarchical GCVF architectures are necessary in high-dimensional or complex systems where single-level value fitting is infeasible. The decomposition proceeds as follows:

High-level: A quasimetric $d^{h}(z_0, z_1)$ is fitted in a low-dimensional latent or spatial subspace $\mathcal{Z}$ . This provides an abstract, geometry-preserving score for subgoal planning.
Low-level: Local goal-conditioned values $V^\ell(s, z)$ and policies $\pi^\ell(a \mid s, z)$ are trained using standard temporal-difference methods.
Hierarchical planning: The overall GCVF is approximated by minimizing the sum $d^h(\phi(s), z) + V^\ell(s, z)$ over subgoals $z \in \mathcal{Z}$ .

Such architectures allow efficient navigation and manipulation in environments with complex, contact-rich, or long-horizon dynamics (Giammarino et al., 12 Dec 2025).

In multitask and real-time MPC, a single goal-conditioned terminal value network $\hat V_\phi(x, g)$ is trained to provide accurate cost-to-go predictions for a wide range of tasks. This enables sample-efficient, flexible planning in domains where task descriptors (e.g., desired trajectories, velocities, slopes) change rapidly, as exemplified by the hierarchical GCVF-MPC controller for a wheeled biped on varied terrain (Morita et al., 2024).

6. Empirical Benchmarks and Comparative Performance

GCVF-based algorithms have been evaluated on multi-goal robotic manipulation (Fetch suite, Shadow Hand), high-dimensional navigation (OGbench, maze-large, AntMaze, HumanoidMaze), and process control (CSTR, double pendulum). Consistent findings include:

Bilinear and factorized architectures accelerate goal-conditioned RL in sample-limited settings and improve adaptation to out-of-distribution goals (Hong et al., 2022).
Contrastive and density-based formulations outperform HER and traditional actor-critic baselines, particularly in sparse-reward, image-based, or offline RL regimes, achieving higher success rates and robustness (Eysenbach et al., 2022, Schroecker et al., 2020).
PDE-constrained and hierarchical approaches establish new state-of-the-art results in long-horizon navigation and manipulation, with markedly better generalization and collision avoidance (Giammarino et al., 12 Dec 2025).
Model-based planning and value aggregation correct value artifacts that trap naïve actors and enable substantial gains in zero-shot reachability after unsupervised exploration (Bagatella et al., 2023).

Empirical studies confirm improved real-time feasibility, robustness, and safety when GCVFs are used as terminal costs in scenario-based or hierarchical MPC (Morita et al., 2024, Lawrence et al., 10 Feb 2025).

7. Limitations and Future Extensions

While GCVF frameworks yield strong theoretical and practical advances, several limitations remain:

Assumptions: Many GCVF theorems (e.g., Lipschitz, PDE-regularizations) assume isotropic, continuous dynamics, which rarely hold exactly in real, contact-rich robots.
Value Approximation Error: Estimation errors lead to spurious local/global artifacts; hierarchical decomposition and value aggregation partially mitigate these effects.
Out-of-Distribution Generalization: Value accuracy degrades for goals far outside the training support, requiring adaptation or meta-learning.
Computational Costs: Offline data generation and MPC-based backups can be resource-intensive, particularly in complex domains.

Future research directions include: richer goal embeddings (e.g., vision-conditioned goals), attention-based or mixture-of-experts networks to scale to expansive goal-spaces, adaptive prediction horizons in MPC, incorporation of stochasticity or partial observability, and rapid meta-adaptation to new tasks (Morita et al., 2024, Giammarino et al., 12 Dec 2025).

By unifying learning and planning through explicit goal-conditioning, GCVF has provided a principled and flexible foundation for solving long-horizon, multi-goal, and robust RL problems, with documented theoretical guarantees and substantial empirical advances across diverse control domains (Hong et al., 2022, Eysenbach et al., 2022, Lawrence et al., 6 Dec 2025, Giammarino et al., 12 Dec 2025, Schroecker et al., 2020, Bagatella et al., 2023, Lawrence et al., 10 Feb 2025, Morita et al., 2024).