Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Published 8 Oct 2025 in cs.LG | (2510.07257v1)

Abstract: Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

Abstract PDF Upgrade to Chat

Summary

The paper introduces TTGS, a framework that leverages graph search and subgoal execution to overcome long-horizon decision challenges in offline goal-conditioned reinforcement learning.
It constructs a directed graph from pre-computed datasets using learned value functions or domain-specific signals to enable efficient trajectory stitching without extra training.
Experiments demonstrate that TTGS significantly improves performance on various locomotion tasks, outperforming more complex planning methods while requiring no additional online interaction.

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Abstract

The paper introduces Test-Time Graph Search (TTGS), a planning framework for enhancing goal-conditioned reinforcement learning (GCRL) agents in offline settings. TTGS addresses the challenges of long-horizon decision-making by leveraging pre-computed datasets and learned value functions to guide trajectory stitching. This approach provides an inference-time solution that significantly improves success rates across various locomotion tasks without requiring additional training, supervision, or online interaction.

Introduction

Goal-conditioned reinforcement learning (GCRL) has emerged as a prominent method for training agents to achieve user-specified objectives. By decoupling behavioral policies from specific reward structures, GCRL facilitates broad data utilization without delicate reward engineering, making it applicable to domains like robotics and autonomous driving. However, offline GCRL faces significant challenges in long-horizon scenarios due to compounded errors in temporal credit assignment and decision-making.

TTGS addresses these challenges by employing a graph-based technique that constructs a weighted graph over the state space using either value functions or domain-specific distance signals. This graph is leveraged during inference to compute efficient subgoal sequences through Dijkstra's algorithm, enabling agents to converge to distant objectives reliably.

Figure 1: Rollouts from HIQL fail to reach a distant goal.

Methodology

The methodological core of TTGS revolves around three key processes: distance prediction, graph construction, and subgoal execution.

Distance Prediction

Distance prediction forms the backbone of TTGS, allowing the system to map state transitions in terms of expected rollout lengths. For value-based configurations, TTGS derives distances from the learned goal-conditioned value function, ensuring compatibility with existing GCRL frameworks. Domain-specific distances can also be employed, broadening TTGS's flexibility across various application landscapes.

Graph Construction

TTGS builds a directed graph over sampled states in the dataset. Distances between vertices are computed using either value-derived metrics or domain-specific signals, incorporating penalties for long jumps to enforce realistic traversal paths. This process ensures the graph reflects the feasible trajectories that align with the agent's learned capabilities, maintaining computational efficiency through sampling techniques and fast distance evaluations.

Subgoal Execution

Upon graph construction, TTGS uses a shortest-path algorithm to determine optimal subgoal sequences, which are subsequently fed to a frozen policy for execution. Subgoal selection occurs adaptively, dynamically selecting reachable subgoals that promote efficient task completion while mitigating common pitfalls inherent in direct, long-horizon execution attempts.

Figure 2: Goal-reaching success rates for QRL, GCIQL, and HIQL with and without TTGS. Distances are predicted from each base agent's learned value function. TTGS consistently improves or preserves performance on locomotion tasks that require trajectory stitching.

Experiments

TTGS was evaluated on the OGBench suite, targeting diverse locomotion challenges that demand hierarchical reasoning. The framework consistently enhanced performance across multiple GCRL algorithms (HIQL, GCIQL, QRL), often achieving superior results compared to existing complex planning solutions like GAS and CompDiffuser, which require additional training or model sophistication.

These results underscore TTGS's capacity to unlock latent capabilities within pre-trained policies, leveraging only test-time planning to overcome traditional model limitations.

Figure 3: Ablations of HIQL+TTGS-value.

Limitations

Despite its successes, TTGS introduces mild computational overhead in graph construction and search operations. Furthermore, its reliance on the accuracy of the distance predictor and dataset coverage can occasionally limit performance, particularly in environments where the learned value functions are unreliable.

Conclusion

TTGS presents a robust framework for enhancing offline GCRL agents by transforming available datasets into actionable planning substrates without necessitating modifications to the existing training pipeline. It offers a practical solution to long-horizon planning, encouraging further exploration into modular planning strategies that combine the strengths of learned and algorithmic systems.

Overall, TTGS emerges as a potent resource for advancing goal-conditioned abilities within reinforcement learning paradigms, with promising applicability across diverse domains requiring nuanced planning capabilities.