Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward Machines in Reinforcement Learning

Updated 18 February 2026
  • Reward Machines are finite-state automata that transform non-Markovian rewards into Markovian equivalents by tracking high-level events and subgoal progress.
  • They enable reward decomposition, automated reward shaping, and transfer learning by breaking down tasks into temporally extended subgoals.
  • Integration with RL via hierarchical decomposition and counterfactual reasoning leads to significant improvements in sample efficiency and multi-agent coordination.

Reward Machines (RM) are a formal automata-based abstraction for encoding non-Markovian reward functions in reinforcement learning (RL) via finite-state machines whose transitions and outputs depend on high-level events or propositional atoms. Unlike traditional black-box reward functions, RMs expose reward function structure, support reward decomposition at the level of temporally extended subgoals, and enable substantial improvements in sample efficiency, transfer, and lifelong learning by making explicit the stages and logic of task progress (Icarte et al., 2020, Zheng et al., 2021, Ardon et al., 2023, Ardon et al., 2024). RMs have been generalized to first-order logic, stochastic and probabilistic settings, and hierarchical and multi-agent contexts, and serve as a foundation for transfer, automated reward shaping, multi-agent decomposition, and compositional RL.

1. Formal Structure and Expressive Power

An RM is a tuple (U,u0,F,δ,R)(U, u_0, F, \delta, R) with:

  • UU: finite set of internal RM states (corresponding to formulas, subgoals, or abstract progress markers),
  • u0Uu_0 \in U: initial RM state,
  • FUF \subseteq U: set of terminal (accepting, success, failure) RM states,
  • δ:U×ΣU\delta: U \times \Sigma \to U: transition function on RM state and high-level label Σ=2P\ell \in \Sigma = 2^P (where PP is the set of propositional atoms / high-level events),
  • R:U×ΣRR: U \times \Sigma \to \mathbb{R}: transition-based reward function, e.g., R(u,)=1R(u, \ell) = 1 if δ(u,)u\delta(u, \ell) \neq u and $0$ otherwise or potential-based shaping variants (Moroi et al., 2020).

RMs augment standard Markov Decision Processes (MDPs) by tracking sufficient automata state to make any regular history-based reward Markovian. RMs subsume Markovian rewards (one-state RM) and encode any reward function over regular event languages, supporting temporally extended objectives such as sequences, branching, conditional logic, arbitrary regular expressions, and specifying temporally extended properties equivalent to co-safe linear temporal logic (LTL) over propositional atoms (Icarte et al., 2020). More expressive classes, such as pushdown reward machines, extend to deterministic context-free reward languages (Varricchione et al., 9 Aug 2025).

2. Synthesis and Construction of Reward Machines

2.1. Construction from Temporal Logic

For tasks specified in LTL or its extensions (e.g., Sequential LTL, SLTL), RMs are synthesized by progression: each RM-state corresponds to a “progressed” subformula, and transitions encode how observed event labels update the remaining specification. The available states UU are the fixpoint closure under subformula decomposition and progression prog(ψ,)\text{prog}(\psi, \ell). For ψ1ψ2\psi_1 \sim \psi_2 (“do ψ1\psi_1 then ψ2\psi_2”), if ψ1\psi_1 is false, the RM moves to prog(ψ2,)\text{prog}(\psi_2, \ell); otherwise, it “progresses” ψ1\psi_1 (Zheng et al., 2021). This supports flexible, automatic RM construction from high-level LTL or SLTL logical task specifications.

2.2. Learning RMs from Demonstrations

Reward Machines can also be inferred from demonstration traces. Techniques include density-based clustering over learned high-level features extracted from raw observations (e.g., visual state embeddings via ResNet-50, then DBSCAN clustering) to identify candidate RM states (prototypes), followed by construction of a minimal automaton that encodes observed transitions in demonstration traces (Baert et al., 2024). Without prior knowledge of explicit event labels, unsupervised clustering and inductive logic programming (ILP) frameworks (e.g., ILASP) can synthesize minimal, trace-consistent RMs (Icarte et al., 2021, Parac et al., 2024).

2.3. Automatic Extension and Transfer

Given a growing sequence of tasks, procedures such as ExtendRM(φ\varphi) incrementally add to the “memory RM” all new subformulas and transitions needed for a new specification φ\varphi, reusing or composing policies for already learned sub-RMs (Zheng et al., 2021). This mechanism supports modular accumulation of reusable skills, state graphs, and Q-functions over the agent’s lifetime.

3. Exploiting RM Structure in Reinforcement Learning

RMs expose information that can be used for:

  • Reward shaping: Design of potential-based shaped rewards R(u,)=R(u,)+γΦ(δ(u,))Φ(u)R'(u, \ell) = R(u, \ell) + \gamma \Phi(\delta(u, \ell)) - \Phi(u), where Φ\Phi encodes value-to-go in the RM state graph computed by value iteration, proven to preserve policy optimality (Icarte et al., 2020, Camacho et al., 2020, Azran et al., 2023).
  • Task decomposition: Decomposition of a global RM into options, subtasks, or local RMs enables hierarchical RL, parallel learning of constituent skills, counterfactual off-policy learning, and reuse of learned skills across tasks (Icarte et al., 2020, Zheng et al., 2021).
  • Counterfactual reasoning: For each real experience (s,u,a,r,s,u)(s, u, a, r, s', u'), synthetic experiences are generated for hypothetical RM-states uˉ\bar{u}, allowing massive across-state data efficiency (Icarte et al., 2020).
  • Multi-agent coordination: Team-level RMs can be decomposed into agent-specific RMs via event-set projections and automata equivalence (bisimulation), permitting decentralized learning with strong value function bounds and guaranteed compositionality (Neary et al., 2020).
  • Deep RL integration: RM-state is concatenated with raw MDP inputs (e.g., as one-hot or semantic embedding), allowing policy networks to disentangle subtask planning from low-level control, dramatically accelerating convergence and improving sample efficiency (Camacho et al., 2020, Castanyer et al., 16 Oct 2025). Automatic and language-grounded RM generation pipelines leverage foundation models to translate natural-language task specifications into executable RMs (Castanyer et al., 16 Oct 2025).

4. Extensions: Expressiveness, Robustness, and Multi-Agent Generalizations

  • First-order RMs (FORMs) encode transitions using full first-order logic (support for existential/universal quantifiers over object-centric domains), greatly compacting representations for relational tasks such as “visit all yellow objects” (previously exponential in objects; collapses to constant size with quantifiers). FORMs support logical ILP-based learning and multi-agent exploitation, where policy decomposition aligns with subtask automata states (Ardon et al., 2024).
  • Stochastic RMs assign cumulative distribution functions to transitions, supporting non-deterministic, noise-tolerant rewards and convergence to optimal policies in expectation under guaranteed equivalence (Corazza et al., 16 Oct 2025).
  • Robust learning from noise: Probabilistic ILP and belief-updated reward shaping (PROB-IRM) enable RM learning and exploitation even under noisy or inconsistent event labeling, with Bayesian update of automaton state belief and shaping over belief distributions (Parac et al., 2024).
  • Pushdown RMs and beyond: Pushdown reward machines (pdRMs) use a stack to encode context-free temporal relations, supporting reward specifications outside the regular language class, with well-characterized policy complexity bounds and modular exploitation (Varricchione et al., 9 Aug 2025).
  • Hierarchies and coupling: Hierarchical reward machines (HRMs) and coupled RMs enable the compact representation of extremely long-horizon, compositional, and unordered subtasks, permitting exponential reductions in automaton state space and linear sample efficiency scaling for highly compositional domains (Furelos-Blanco et al., 2022, Zheng et al., 2024, Levina et al., 31 Oct 2025).

5. Applications, Empirical Findings, and Impact

RMs have been demonstrated to produce substantial gains in sample efficiency, transfer learning, and scalability across various domains:

  • Vision-based robotics: In pick-and-place manipulation, RM signal and automaton-state input together yield $10$–50×50\times improvements in success rate and policy convergence versus unstructured DQN (Camacho et al., 2020, Baert et al., 2024).
  • Lifelong and transfer RL: Modular augmentation and reuse of RMs enables rapid adaptation to new logical-task specifications, outperforming from-scratch RL by exploiting subtask decomposition and shaping (Zheng et al., 2021, Azran et al., 2023). Automatic pre-planning over RM graphs accelerates contextual transfer in deep RL, cutting time-to-threshold by $20$–40%40\% (Azran et al., 2023).
  • Cooperative and decentralized multi-agent systems: RM-based agent reward decomposition results in an order-of-magnitude faster convergence and improved policy scalability in multi-agent rendezvous, buttons, and navigation benchmarks, outperforming centralized and hierarchical Q-learning baselines (Neary et al., 2020, Ardon et al., 2023, Zheng et al., 2024, Hu et al., 2021). Hierarchical/coupled RMs address tasks that are intractable for flat automata by exploiting subtask synchronization and modular policy learning (Furelos-Blanco et al., 2022, Zheng et al., 2024, Levina et al., 31 Oct 2025).
  • Automated reward design and language grounding: Foundation model pipelines (ARM-FM) automatically generate RMs from natural-language objectives and induce language-grounded state embeddings, supporting multi-task/zero-shot generalization across MiniGrid, Craftium, and MetaWorld robotic domains (Castanyer et al., 16 Oct 2025).
  • Plan synthesis and flexibility: Maximally permissive RMs, synthesizing all partial-order plans, yield strictly higher expected return than single-plan or single-recipe RMs and remove rigidity in plan adherence, although increasing automaton size and planning cost (Varricchione et al., 2024).

6. Limitations, Current Research, and Future Directions

Key limitations include the exponential blow-up in RM state size for domains with many unordered subgoals (mitigated by numeric, agenda, and coupled RMs), the challenge of reward specification and proposition extraction in unstructured domains (partially addressed by learning from demonstration and foundation models), and the computational cost in hierarchical or pushdown automata construction. Ongoing research directions, as evidenced in recent work, focus on:

  • Extending RM expressivity to first-order (and richer) logics for relational and compositional environments (Ardon et al., 2024).
  • Learning RMs robustly under noisy, partial, or ambiguous observation and integrating with visual perception (Parac et al., 2024, Baert et al., 2024).
  • Hierarchical and modular automaton composition for complex, sparse-reward, and long-horizon domains (Furelos-Blanco et al., 2022, Zheng et al., 2024).
  • Automated natural language to reward automaton pipelines and language-conditioned skill transfer (Castanyer et al., 16 Oct 2025).
  • Pushdown and context-free reward formalism for tasks demanding rich history dependence (Varricchione et al., 9 Aug 2025). Theoretical interests include formal analysis of sample complexity under various RM generalizations, optimal policy computation over coupled RM products, automated discovery of propositions/events from raw input, and scaling inference in Inductive Logic Programming frameworks.

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward Machines (RM).