Reward Machines in Reinforcement Learning

Updated 9 February 2026

Reward Machines are finite-state automata that encode temporally extended, non-Markovian reward functions to decompose complex tasks.
They integrate logical event tracking and potential-based reward shaping to convert sparse rewards into informative, dense signals for efficient learning.
Extensions like hierarchical, stochastic, and pushdown reward machines enhance modular policy design and enable scalable multi-agent and automated synthesis applications.

A reward machine (RM) is a finite-state automaton designed to encode non-Markovian or temporally extended reward functions in reinforcement learning (RL). By explicitly representing high-level task structure using abstract states and transitions triggered by logical events or observations, RMs allow RL agents to decompose complex objectives into subproblems, utilize dense and shaped rewards for efficient learning, and facilitate modular policy design and transfer across tasks. The reward machine formalism supports a wide range of extensions—including automatic inference from demonstration or policy traces, stacking into hierarchies, accommodation of numeric or stochastic rewards, and automated synthesis via language or planning—which position it as a central tool for the specification, learning, and decomposition of structured RL tasks (Icarte et al., 2020, Camacho et al., 2020, Varricchione et al., 9 Aug 2025, Castanyer et al., 16 Oct 2025, Levina et al., 2024, Hu et al., 2021).

1. Formal Definition and Semantics

A reward machine is conventionally formulated as a Mealy-style finite-state machine. Given an underlying MDP $M = \langle S, s_0, A, T, r, \gamma \rangle$ , an RM is specified as

$\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$

where

$U$ is a finite set of abstract or automaton states (subtasks, task stages, or logical modes)
$u_0 \in U$ is the initial RM state
$\Sigma$ is a finite alphabet of observations, typically subsets of atomic propositions ( $2^{AP}$ ), labels, or symbolic events extracted from the MDP state
$\delta: U \times \Sigma \to U$ is the deterministic state transition function
$\rho: U \times \Sigma \to \mathbb{R}$ (or $\mathbb{R}^n$ in multi-agent) is the reward-output function

On each environment transition $(s, a, s')$ , the RM observes a label $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 0 (where $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 1 is a labeling function), updates $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 2, and emits reward $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 3. This effectively augments the MDP with a product state space $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 4, transforming a non-Markovian reward structure into a Markovian one over the joint state, and enabling standard RL algorithms to be applied directly (Camacho et al., 2020, Icarte et al., 2020, Levina et al., 2024).

RMs can natively handle sparse, temporally extended, and logically complex tasks by tracking progression through abstract states and gating rewards on the satisfaction of both logical and temporal task requirements (Icarte et al., 2020, Furelos-Blanco et al., 2022).

2. Reward Machine Construction and Shaping

Structure and Synthesis

RM states correspond to high-level task abstractions, which may be derived from:

Atomic Propositions: Boolean features indicating relevant environment events (e.g., "block-on-goal", "gripper-holding-red") (Camacho et al., 2020).
Demonstrations: Trajectories from expert behaviour, mapped to feature traces; abstract planning graphs are constructed via clustering and event extraction, from which RM states and transitions are induced (Baert et al., 2024, Camacho et al., 2020).
Temporal Logic: Task specifications given in (sequential) LTL or similar formal languages can be algorithmically compiled into minimal RMs via formula progression and decomposition (Zheng et al., 2021).

Reward Shaping

Potential-based reward shaping is integrated by assigning a potential function $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 5 over RM states. The shaped reward is

$\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 6

where $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 7. Optimal "zero-centered" shaping uses $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 8, with $\mathcal{R} = (U, u_0, \Sigma, \delta, \rho)$ 9 the shortest distance in the abstract planning graph to a goal node, to stabilize Q-values and training (Camacho et al., 2020, Icarte et al., 2020).

Automated construction and shaping often leverage only expert state sequences, not action labels, and require careful selection of atomic propositions to make goal and non-goal abstract states distinguishable (Camacho et al., 2020, Baert et al., 2024).

3. Learning and Inference with Reward Machines

RL Integration

The canonical approach is to define the product MDP $U$ 0, where

$U$ 1 if $U$ 2
$U$ 3

A Q-network or Q-table is maintained over $U$ 4, and training proceeds as in classical RL. Policy architectures often concatenate visual or state features with a one-hot RM state encoding (Camacho et al., 2020, Icarte et al., 2020). Off-policy techniques such as counterfactual reasoning (CRM) leverage the known RM structure to accelerate learning by updating Q-values for all possible RM states using each transition (Icarte et al., 2020).

RM Inference and Learning

When the reward function is not provided as an RM, inference is possible from:

Counterexamples: Iterative learning methods (JIRP) alternate between policy optimization under the current RM hypothesis and RM minimization from observed reward-label inconsistencies, converging to the true minimal RM under sufficient coverage (Xu et al., 2019).
Partially Observable Policies: Prefix-tree policies derived from traces suffice to reconstruct RM structure via SAT-based constraint solving, up to policy equivalence (Shehab et al., 6 Feb 2025).
Demonstrations: Feature clustering and event segmentation from demonstrations induce RM transitions without predefined propositions, particularly effective in vision-based robotic manipulation (Baert et al., 2024).

Theoretical bounds establish sufficient episode length and counterexample coverage for minimality and correctness, with practical heuristics for polynomial-time inference in large domains (Xu et al., 2019, Shehab et al., 6 Feb 2025).

4. Extensions: Numeric, Stochastic, and Hierarchical Reward Machines

Numeric and Stochastic Extensions

Numeric Reward Machines extend classical RMs to directly handle real-valued features, enabling dense reward shaping for inherently numeric tasks (e.g., distance-to-goal). Hybrid (numeric-Boolean) RMs use Boolean proxies for numerical changes (e.g., "distance-decreased") (Levina et al., 2024).
Stochastic Reward Machines (SRMs) further generalize RMs to model rewards as random variables, parameterized by output-distribution families and inferred from noisy data via constraint-based learning (Corazza et al., 16 Oct 2025).

These extensions preserve the Markovization property in the product MDP and inherit RM-facilitated sample efficiency.

Hierarchical and Pushdown Reward Machines

Hierarchies of Reward Machines (HRMs): By allowing RMs to invoke other RMs as subroutines, HRMs enable the modular decomposition of deeply structured tasks. Learning and policy optimization leverages the options framework, matching the expressiveness of flat RMs with potentially exponential gains in compactness and sample efficiency (Furelos-Blanco et al., 2022, Zheng et al., 2024).
Pushdown Reward Machines (pdRMs): By equipping RMs with a (possibly truncated) stack, pdRMs enable the specification of deterministic context-free reward languages, admitting tasks involving nested or retracing patterns that cannot be expressed with regular-language RMs (Varricchione et al., 9 Aug 2025). Policy space complexity grows exponentially with stack depth, but in many domains only top- $U$ 5 access suffices.

5. Applications: Single-Agent, Multi-Agent, and Automated RM Synthesis

Single-Agent and Vision-Based RL

In vision-based robotic manipulation, RMs constructed from demonstration induce dense reward signals and guide learning over high-level subgoals, drastically improving convergence speed and policy quality compared to purely end-to-end RL (Camacho et al., 2020, Baert et al., 2024). RMs also support curriculum-guided self-paced RL, enabling rapid learning of long-horizon tasks by focusing curricula and value updates on subtask-relevant context dimensions (Koprulu et al., 2023).

Multi-Agent RL and Decentralization

In cooperative multi-agent environments, team-level RMs are decomposed (via projection and bisimulation) into local RMs under provable correctness conditions, empowering fully decentralized policy optimization while guaranteeing completion of the global task. Value bounds and sample efficiency improvements of up to an order of magnitude over centralized or independent baselines are reported (Neary et al., 2020, Hu et al., 2021, Zheng et al., 2024). Hierarchical RM structures enable efficient multi-agent learning of temporally and logically entangled subtasks (Zheng et al., 2024, Furelos-Blanco et al., 2022).

Automated Synthesis and Foundation Model Integration

Automated RM Synthesis: ARM-FM (Castanyer et al., 16 Oct 2025) employs foundation models to generate RMs from natural-language task descriptions, with generation, self-critique, and verification in multiple rounds. Event labeling functions and language-aligned state embeddings are produced, supporting sample-efficient compositional RL and zero-shot transfer.
Maximally Permissive RMs: Synthesis from AI planning (e.g., STRIPS domains) using the full set of partial-order plans generates RMs that encode all valid high-level behaviors, provably permitting optimal policies and avoiding the over-constraint of single-plan RMs (Varricchione et al., 2024).

Table: Representative Extensions and Synthesis Methods

Extension/Method	Core Feature	Reference
Numeric Reward Machines	Real-valued features and hybrid labels	(Levina et al., 2024)
Stochastic Reward Machines	Output distributions (CDFs), noise handling	(Corazza et al., 16 Oct 2025)
Pushdown Reward Machines	Stack memory, deterministic context-free Lang.	(Varricchione et al., 9 Aug 2025)
Hierarchical RMs (HRM/MAHRM)	Compositional, options-based HRL	(Furelos-Blanco et al., 2022, Zheng et al., 2024)
Automated Synthesis (ARM-FM)	Foundation models, language-based generation	(Castanyer et al., 16 Oct 2025)
Planning-based Synthesis (MPRM)	Partial-order, maximally permissive plans	(Varricchione et al., 2024)

6. Theoretical Properties, Sample Efficiency, and Limitations

Theoretical Properties

RMs admit a regular-language characterization; RMs can encode any temporally extended reward representable by a DFA over event labels (Icarte et al., 2020, Varricchione et al., 9 Aug 2025).
The potential-based shaping guarantees preservation of optimal policies under suitable boundedness and potential assignment to goal states (Camacho et al., 2020, Icarte et al., 2020).
Joint RM/policy inference algorithms enjoy convergence guarantees under sufficient exploration and identifiable structure (Xu et al., 2019, Corazza et al., 16 Oct 2025, Shehab et al., 6 Feb 2025).

Sample Efficiency and Empirical Findings

Empirical studies report order-of-magnitude speedups in learning and performance, especially in long-horizon, sparse, or compositional RL domains (Camacho et al., 2020, Icarte et al., 2020, Baert et al., 2024, Zheng et al., 2021, Furelos-Blanco et al., 2022). Dense reward shaping, counterfactual experience, and accurate progression through abstract states enable rapid policy refinement.

Limitations

RM expressiveness is bounded to regular languages, unless extended (e.g., via pdRMs).
Inference of minimal RMs is computationally hard in general; practical methods rely on heuristics, approximations, or function approximation in large or continuous domains (Xu et al., 2019, Levina et al., 2024).
Symbol grounding in high-dimensional or noisy environments requires abstraction models or auxiliary inference modules, with theoretical and practical limitations under partial observability (Li et al., 2024).

The design of atomic propositions and labeling functions remains a bottleneck, although recent foundations-model-based and unsupervised approaches are mitigating this constraint in practice (Castanyer et al., 16 Oct 2025, Baert et al., 2024).

7. Future Directions and Research Outlook

Principal directions for RM methodology advancement include:

Scalable and incremental RM inference, such as online methods and active learning of event labels and RM transitions (Xu et al., 2019, Baert et al., 2024).
Integration with expressive automata (e.g., pushdown, stochastic) for complex or partially observed environments (Varricchione et al., 9 Aug 2025, Corazza et al., 16 Oct 2025).
Automated RM synthesis from language (prompt-based) or planning domains, coupled with foundation model verification and code generation (Castanyer et al., 16 Oct 2025, Varricchione et al., 2024).
Deep RL architectures that are robust to noisy or ambiguous symbol grounding, via belief-based or temporally dependent inference (Li et al., 2024).
Multi-agent composition and equilibrium design using RM-structured incentives, task decompositions, and coordinated or decentralized learning (Najib et al., 2024, Hu et al., 2023, Zheng et al., 2024).

Reward machines, through their modular, automaton-based abstraction, are establishing themselves as a central substrate for scalable, interpretable, and theoretically grounded reinforcement learning in temporally and logically elaborate domains.