Reinforcement Learning from Execution-Based Rewards

Updated 23 January 2026

Reinforcement Learning from Execution-Based Rewards is a paradigm that leverages aggregate, trajectory-level feedback instead of dense per-action signals.
It formalizes reward feedback through frameworks like the Bagged Reward MDP and reward automata to ensure policy equivalence with standard RL formulations.
Innovative approaches such as Reward Bag Transformers and automata-based inference have demonstrated strong empirical performance in robotics, code synthesis, and long-horizon planning tasks.

Reinforcement learning from execution-based rewards refers to a broad class of RL paradigms in which the learning agent does not receive dense, immediate reward feedback for each action, but rather only receives signals at the level of entire executions, partial trajectories, bags of transitions, or outcomes evaluated on completion. These settings arise naturally in domains where reward attribution is unavailable or prohibitively sparse, including robotics, program synthesis, code generation, optimal execution in finance, and multi-turn interaction tasks. The field has developed novel theoretical foundations, general methodologies, and pragmatic algorithms to overcome the challenges posed by the lack of granular reward supervision.

1. Formal Frameworks for Execution-Based Reward RL

Execution-based rewards are typically formalized by extending or modifying the standard Markov Decision Process (MDP) formulation. A key framework is the Bagged Reward Markov Decision Process (BRMDP), defined as follows (Tang et al., 2024):

Standard MDP: $M = (S, A, P, r, \mu)$ , where $S$ is the state space, $A$ is the action space, $P$ is the transition kernel, $r: S \times A \to \mathbb{R}$ is the immediate reward (unobservable in this setting), and $\mu$ the initial state distribution.
Bag Partition: A trajectory $\tau = \{(s_0, a_0), \ldots, (s_{T-1}, a_{T-1})\}$ is partitioned into a sequence of contiguous segments (bags) $B_{i, n_i}$ , with $G: \tau \mapsto \{B\}$ .
Bagged Reward Function: $R(B_{i, n_i}) = \sum_{t=i}^{i+n_i-1} r(s_t, a_t)$ .
BRMDP: $M_B = (S, A, P, R, \mu)$ , with $R$ defined only over bags.

The agent never observes the individual $r(s_t, a_t)$ per action, but only the total reward over bags (or entire trajectories). Other approaches formalize outcome-based or non-Markovian reward models using finite-state reward automata (Xu et al., 2020) or reward machines (Parac et al., 2024), which map execution traces to scalar rewards.

The principal RL objective remains the maximization of expected cumulative reward, where the expectation is now taken over the sum of bag or trajectory-level rewards.

2. Theoretical Guarantees and Policy Optimality

A foundational theoretical result underpins the feasibility of learning from execution-based rewards: optimal policy equivalence. In the BRMDP framework, the set of optimal policies in the original MDP coincides with that in the BRMDP (Tang et al., 2024). Specifically,

Let $\Pi$ be the set of optimal policies for the original MDP, and $\Pi_B$ for the BRMDP.
Then, under standard exploration assumptions, $\Pi = \Pi_B$ .

The crucial insight is that as long as the sum of redistributed per-step rewards over each bag matches the observed bag reward, any such redistribution yields an equivalent MDP for policy optimization:

$\forall B_{i, n_i}: \quad \sum_{t=i}^{i+n_i-1} \tilde{r}(s_t, a_t) = R(B_{i, n_i}).$

Therefore, the RL problem can, in principle, be recast as standard RL using any consistent reward decomposition.

A similar guarantee holds in the automaton and reward-machine formalisms. If the non-Markovian reward function is encoded by a finite-state reward automaton, model-based inference (e.g., Angluin's L* algorithm) is guaranteed to recover a minimal automaton that, when combined with the MDP via a product construction, yields a Markovian reward in the augmented state space (Xu et al., 2020). Q-learning on the synchronous product converges to the optimal policy.

3. Algorithmic Methodologies for Execution-Based RL

Approaches to RL from execution-based rewards can be classified according to how they recover or infer per-step learning signals.

3.1 Reward Redistribution

The Reward Bag Transformer (RBT) is a parametric approach that learns to redistribute bag-level rewards over individual transitions using a Transformer-based model (Tang et al., 2024). RBT operates as follows:

Inputs: A bag of contiguous state-action pairs, $\sigma = [(s_0, a_0), \ldots, (s_{M-1}, a_{M-1})]$ .
Transformer Encoding: Apply a causal encoder, followed by a bidirectional attention layer, yielding embeddings for each instance.
Reward Prediction: A head computes per-instance rewards $\hat{r}_t$ using attention across the bag, ensuring context and temporal nuance is captured.
Training Objectives: Minimize bag-level reward prediction loss and, optionally, next-state prediction loss for each transition.
The reward model is iteratively updated from rollouts, after which all transitions in the agent's replay buffer are relabeled with the predicted $\hat{r}_t$ . Standard RL algorithms (e.g., SAC) are then applied to the relabeled data.

Ablation studies indicate that both bidirectional attention and joint modeling of transition dynamics are critical for accuracy, especially as bag lengths increase.

3.2 Learning Reward Automata

In domains with inherently non-Markovian, history-dependent rewards, an alternative is to explicitly infer a finite-state machine representing the reward logic (Xu et al., 2020, Parac et al., 2024). The key steps:

Induce a reward automaton via active learning (e.g., L* algorithm based on membership and equivalence queries), or via inductive logic programming techniques even from noisy traces.
Encode episodic or trajectory-level feedback into the automaton structure.
Construct the synchronous product with the environment, yielding a Markovian RL problem in the expanded state space.

This methodology provides interpretability, can generalize to noisy or partial labeling, and leverages temporal abstraction.

3.3 Outcome-Based, Model-Based, and Adversarial Reward Learning

Outcome-driven RL via variational inference directly models the probability of reaching a desired goal state and derives an execution-based reward as $\log p_\psi(g | s, a)$ , with dense shaping adaptively provided by the learned dynamics (Rudner et al., 2021).
Observation-based internal modeling leverages expert trajectories to fit a next-state predictor and defines the reward as negative prediction error between the model's output and observed $s_{t+1}$ , supplying a shaped signal for exploration and alignment to demonstrated behavior (Kimura et al., 2018).
Adversarial reward modeling constructs a binary classifier to discern successful goal states given instructions, training the agent with the classifier's outputs as step rewards, thus decoupling reward provision from the environment (Bahdanau et al., 2018).

3.4 Domain-Specific Execution-Based Reward Integration

In practical RL for code (e.g., program synthesis or code repair), outcome-based rewards are extracted via test-case execution, while dense and structured feedback is introduced at the level of variable trajectory alignment (Jiang et al., 21 Oct 2025). In long-horizon planning tasks (e.g., multi-step software engineering in dockerized sandboxes), gated reward accumulation aggregates stepwise rewards only when the outcome (terminal execution) meets a predefined threshold, preventing policy collapse due to reward hacking (Sun et al., 14 Aug 2025).

4. Empirical Evidence and Comparative Performance

Empirical results across diverse environments demonstrate that principled reward redistribution and reward model learning are essential for robust performance under execution-based feedback.

In continuous-control MuJoCo tasks with increasing bag length, direct application of standard RL algorithms to sparse bag rewards results in collapse or instability. RBT maintains near-oracle performance up to large bag sizes (e.g., bag length 500), outperforming baselines such as IRCR, RRD, LIRPG, and HC (Tang et al., 2024).
In grid-world and Minecraft tasks with non-Markovian rewards, active automaton inference accelerates convergence (50-90% fewer episodes) over logic-inference and policy-optimization baselines (Xu et al., 2020).
Code generation tasks using CodeRL+ show pass@1 rates of 90.9% on HumanEval and large gains across code reasoning and test output tasks, with ablation confirming the critical role of execution-trajectory alignment (Jiang et al., 21 Oct 2025).
Gated reward accumulation in long-horizon code-editing raises completion rates from 47.6%→93.8% and modification rates from 19.6%→22.4%, outperforming direct reward accumulation and avoiding policy collapse (Sun et al., 14 Aug 2025).
Internal-model-based reward shaping achieved rapid convergence and higher stability compared to sparse or hand-crafted rewards in pixel-based and continuous-action domains (Kimura et al., 2018).
Reward machine inference from noisy traces using probabilistic ILP yields agent performance on par with perfect RM baselines, robust even under high label noise (Parac et al., 2024).

5. Challenges, Limitations, and Open Problems

Despite progress, execution-based reward RL remains challenging:

Identifiability and Credit Assignment: Reward redistribution under long or variable-length bags reduces informational granularity and impedes accurate attribution (Tang et al., 2024).
Model Dependence: Approaches such as outcome-driven inference, internal modeling, and adversarial reward shaping rely on accurate, generalizable models or discriminators. Sparse expert data, narrow bag coverage, or model misspecification degrade reward quality.
Computational Complexity: Transformer-based redistribution and automaton inference (especially under noise) introduce significant computation and data requirements.
Exploration: Execution-based signals are inherently sparse—exploration mechanisms must bridge potentially large gaps between observed signals.
Assumptions on Bagging: Most methods assume a known or task-defined bag partition function. Joint learning of partition and reward redistribution is an open problem (Tang et al., 2024).
Robustness to Noise and Non-IID Feedback: Automaton and reward machine methods have begun addressing this via Bayesian, belief-tracking, and ILP frameworks, but general robustness remains an area of active research (Parac et al., 2024).

6. Emerging Applications and Future Directions

Structured Program Synthesis and Repair: Program-level verifiable reward feedback, enhanced with variable-alignment and outcome tracking, is central to RL-driven code assistants (Jiang et al., 21 Oct 2025).
Long-Horizon Planning in Interactive and Multi-Agent Systems: Execution-verified and gated reward strategies stabilize learning for complex multi-turn domains such as software repositories and robotics (Sun et al., 14 Aug 2025).
Benchmarking and Monitoring: In finance, execution-based RL frameworks closely integrate regulatory-style benchmark comparison and retraining for continuous policy deployment (Pardo et al., 2022).

Anticipated research will focus on joint inference of bag partitions and reward mappings, higher-level temporal abstraction in reward modeling, generalization from sparse or noisy execution feedback, and theoretically grounded approaches to exploration in execution-based reward settings.

7. Summary Table: Key Algorithmic Paradigms

Approach	Execution-Based Reward Formulation	Core Method	Principal Citation
Reward Redistribution (RBT)	Bag/trajectory-level sum	Transformer model, bidirectional attention	(Tang et al., 2024)
Reward Automata/Reward Machines	Non-Markovian execution traces	Automaton induction (L*, ILP), RL on product MDP	(Xu et al., 2020, Parac et al., 2024)
Adversarial/Outcome-Driven Reward	Discriminator or probability model over terminal/goal states	Discriminator training, variational inference	(Bahdanau et al., 2018, Rudner et al., 2021)
Gated Reward Accumulation	Hierarchical outcome + stepwise critics	Policy gradient with reward gating	(Sun et al., 14 Aug 2025)
Execution Semantics Alignment (Code)	Test-case + variable trajectory	On-policy rollouts, dual-advantage policy gradient	(Jiang et al., 21 Oct 2025)
Model-Based Internal Reward	Next-state prediction from expert trajectories	RNN/CNN predictive model, negative prediction error	(Kimura et al., 2018)

This taxonomy illustrates the breadth of algorithmic innovation addressing the fundamental challenge of optimizing policies under execution-based, sparse, and temporally aggregated reward signals.