Reward-Conditioned Trajectory Policy (RCTP)

Updated 10 February 2026

RCTP is a reinforcement learning framework that reframes policy learning as a supervised problem conditioned on reward returns.
It models actions and entire trajectories as conditional distributions, enabling fine-grained control and efficient reuse of optimal and suboptimal data.
Population-based variants integrate evolutionary operators with supervised training to boost sample efficiency across robotics, RL benchmarks, and LLM tool use.

A Reward-Conditioned Trajectory Policy (RCTP) is a class of policy architectures and algorithms in reinforcement learning (RL) and sequential decision-making that frames the learning and selection of trajectory-generating policies as a supervised learning problem, conditioned explicitly on observed or desired reward returns. Unlike standard RL methods, which maximize expected reward via policy gradients or temporal-difference objectives, RCTP leverages the full space of previously collected trajectories—successful or not—by modeling policy outputs as conditional distributions over actions or trajectories, parameterized by a target reward variable. This reward-conditioning mechanism enables fine-grained control, efficient exploitation of suboptimal data, and, in population-based variants, the use of evolutionary operators for exploration and skill composition. RCTP has been implemented for continuous control in robotics (Akbulut et al., 2020), standard RL benchmarks (Kumar et al., 2019), and even discrete action spaces in multi-turn LLM tool-calling pipelines (Zhong et al., 3 Feb 2026).

1. Probabilistic and Algorithmic Formulation

The canonical RCTP framework operates over trajectory-return tuples. Let $\tau=(x_1,\ldots,x_T)$ (robotics) or $\tau=(s_0,a_0,\ldots,s_T,a_T)$ (RL) be a trajectory; $r$ (or $Z$ ) its associated (possibly discounted) cumulative reward. The objective is to learn a conditional policy $p_\theta(\tau|r)$ or, more granularly, $p_\theta(a_t|s_t, r)$ , such that the sampled trajectory under given $r$ achieves the corresponding return in expectation.

In the variational formulation (Akbulut et al., 2020), $p(\tau|r)$ is factorized via a latent style variable $z$ :

$p(\tau|r) = \int p(\tau|z, r)\, p(z)\, dz$

where inference over $z$ is facilitated by a variational posterior $q_\phi(z|O, r)$ , with $O$ a random subset (“context”) from $\tau$ . Training maximizes the ELBO, summed over time:

$\mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q_\phi(z|O,r)}\left[\sum_{t=1}^T \log p_\theta(x_t|z, t, r)\right] - \beta D_{KL}(q_\phi(z|O,r) \| p(z))$

with $p(z)$ Gaussian and $\beta$ a KL-weighting hyperparameter.

In the RL-centric approach (Kumar et al., 2019), RCTP can be derived as the solution of a constrained policy search problem: maximize return $Z$ subject to KL divergence from the behavior distribution, yielding the optimal joint

$p_*(\tau, Z) \propto p_\mu(\tau, Z) \exp(Z / \beta)$

and a supervised learning projection:

$\theta \leftarrow \arg\max_\theta \mathbb{E}_{(s,a,Z)\sim D}[ \log \pi_\theta(a|s,Z)]$

where data $D$ is sampled from a replay buffer populated with both optimal and suboptimal samples.

2. Architecture and Training Protocols

The architecture of RCTP networks varies by domain, but a consistent theme is explicit reward (or return) input at every decision point.

In trajectory-centric robotic tasks, RCTP is implemented with a Neural Processes backbone (Akbulut et al., 2020). The encoder ingests $M$ randomly chosen trajectory points and target reward $r$ , producing a Gaussian posterior over latent codes $z$ . The decoder receives $(z, t, r)$ and outputs the distribution parameters for $x_t$ at each timestep.
In standard RL settings, policy networks take as input the state $s$ concatenated or multiplicatively modulated (“FiLM”/feature-wise linear modulation) with target return $Z$ , outputting parameters for $a_t$ (Kumar et al., 2019).
For LLM policies in multi-turn tool use, the RCTP formalism is realized via fine-tuning a transformer model with a reserved discrete reward token $r$ prepended to the dialogue context; action generation is then conditioned at every token-generation step (Zhong et al., 3 Feb 2026).

Training is typically supervised: at each iteration, trajectories and their observed returns are sampled from the replay buffer, the conditioning variable ( $r$ or $Z$ ) is provided to the network along with the observed state/action/history, and the negative log-likelihood of the demonstrated action under the network’s output is minimized. The loss may be unweighted or scaled exponentially as $\exp(Z/\beta)$ to prioritize higher returns.

3. Population-Based Variational Policy Optimization

A distinct contribution of RCTP is population-based improvement through evolutionary policy search (Akbulut et al., 2020). The procedure alternates between supervised learning and population-based search in the reward-conditioned latent space:

Population generation: Fix a target reward $r^*$ (typically the maximal observed), sample $N$ latent codes $z_i$ from the posterior or prior, and decode corresponding trajectories $\tau_i$ .
Evolutionary operators:
- Crossover: For each random pair $(z_i, z_j)$ , select a time index $k$ and splice segments from latent-decodings to construct offspring trajectories. This blends sub-trajectories, recombining distinct skills without manual segmentation.
- Mutation: Smooth Gaussian noise is added to each point of the trajectory to inject local exploration.
Selection: All candidate trajectories are executed (optionally with a tracking controller to ensure feasibility), real returns are measured, and top performers are retained in the buffer for further training and posterior refinement.
Replay and buffer update: The buffer retains both successful and failed rollouts, enhancing generalization and robust posterior estimation.

This integration of supervised learning, reward-conditioning, and evolutionary search substantially accelerates policy improvement and supports stable learning across challenging tasks.

4. Practical Implementation and Hyperparameterization

RCTP methods have standardized training and optimization procedures, employing the following typical hyperparameters and workflow (Akbulut et al., 2020, Kumar et al., 2019):

Component	Typical Value/Setup	Role
Latent dimension ( $d$ )	16–64	Expressiveness of trajectory/stylistic space
Context points ( $M$ )	5–10	Information subset for posterior inference
Population size ( $N$ )	20	Breadth of exploration in evolutionary search
Crossover pairs ( $m$ )	10	Number of recombination operations
Mutation noise ( $\sigma_{\text{mut}}$ )	≈0.01	Magnitude of local perturbation
KL weight ( $\beta$ )	1.0	Strength of posterior regularization
Learning rate ( $\alpha$ )	$10^{-3}$	Optimization stability & speed
Policy network arch.	3-layer MLP + FiLM (RL), MLPs (NP)	Prevents network from ignoring conditioning

A generic workflow involves initializing the buffer (with demonstrations or random data), alternately performing supervised training epochs and evolutionary population search and selection, and updating parameters via Adam or similar optimizers. For LLM-based RCTP, the protocol consists of two stages: supervised fine-tuning on reward-labeled mixed-quality data, followed by downstream reward-conditioned RL (Zhong et al., 3 Feb 2026).

5. Variants, Extensions, and Empirical Evaluation

Multiple variants of RCTP exist, differing in the nature of the conditioning variable and policy parameterization (Kumar et al., 2019):

Return-Conditioned (RCP-R): Conditioning variable is empirical return-to-go $Z$ .
Advantage-Conditioned (RCP-A): Condition on advantage $A$ , computed using a fitted value function. Empirically, RCP-A exhibits faster learning and superior final returns.
Multiplicative conditioning: “FiLM”-style multiplicative modulation is superior to simple concatenation of state and reward input.
Weighting schemes: Exponential sample reweighting can accelerate return improvement at the cost of variance.

RCTP has demonstrated competitive or superior sample efficiency compared to strong baselines. In continuous control (MuJoCo, LunarLander), RCP-A outperformed TRPO/PPO, and—with exponential weighting—matched the performance of Advantage-Weighted Regression (AWR) (Kumar et al., 2019). In robot trajectory generation, combining population-based search with RCTP provided >5x sample efficiency improvement over REPS and DREPS in sparse-reward settings, and achieved >95% success on real-robot obstacle avoidance with UR10 within 200 trials (Akbulut et al., 2020). In LLM tool-calling, reward-conditioned fine-tuned policies gave large accuracy gains over standard SFT and PPO pipelines, with ablations confirming the necessity of reward-conditioning in both pretraining and RL phases (Zhong et al., 3 Feb 2026).

6. Application Domains and Real-World Considerations

RCTP frameworks have proven adaptable across robotics, control, and LLM domains:

Robots & Movement Primitives: RCTP enables robots to form complex, high-reward trajectories even in discontinuous or sparse-reward spaces, supporting multimodal trajectory generation and sample-efficient coverage of trajectory manifolds (Akbulut et al., 2020).
General Policy Learning: By converting RL into regression problems, RCTP facilitates stable, scalable, and off-policy learning, reducing variance and hyperparameter sensitivity (Kumar et al., 2019).
LLMs and Tool Use: Discrete reward tokens enable LLMs to generate high- or low-quality behavioral modes on demand, ensuring informative group-normalized advantage signals and robust training under sparse-reward multi-turn tool use (Zhong et al., 3 Feb 2026).

Additional practical advantages include robust use of sub-optimal trajectories, efficient reuse of past experience, and the principled integration of exploration via population dynamics and reward-conditioned generative priors.

7. Empirical Insights, Limitations, and Directions

Experiments consistently find that RCTP’s reward conditioning yields tight alignment between conditioned and realized returns, high sample efficiency, and stability superior to baseline policy-gradient and weighted-regression algorithms. Evolutionary operators—especially latent space crossover and trajectory-space mutation—enable rapid expansion of the high-reward region and composition of sub-skills without hand-engineering task decompositions (Akbulut et al., 2020).

A plausible implication is that RCTP can generalize across disparate tasks wherever a supervised regression formulation with reward-side information is viable, and that hybridization with population-based search effectively offsets the limitations of finite demonstration data or difficulty in credit-assignment in RL.

Limitations include lagging ultimate performance behind the most optimized actor-critic methods (e.g., SAC in certain MuJoCo tasks) and potential degradation if buffer sizes or weighting are not appropriately tuned (Kumar et al., 2019). Nevertheless, RCTP’s stability, generality, and interpretability continue to motivate ongoing research and deployment in both academic and applied settings.

Markdown Report Issue Upgrade to Chat

References (3)

Reward Conditioned Neural Movement Primitives for Population Based Variational Policy Optimization (2020)

Reward-Conditioned Policies (2019)

RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-Conditioned Trajectory Policy (RCTP).