Papers
Topics
Authors
Recent
Search
2000 character limit reached

Successor Representations in Reinforcement Learning

Updated 11 November 2025
  • Successor Representations (SRs) are defined as the discounted expected visitation counts of states under a fixed policy, serving as a predictive model in RL.
  • SRs enable rapid reward revaluation and efficient transfer across tasks by decoupling value functions from reward structures.
  • Deep and feature-based extensions of SRs scale to high-dimensional spaces, facilitating intrinsic exploration and option discovery.

Successor representations (SRs) form a predictive model of the expected future occupancy of states under a fixed policy, providing a middle ground between model-based and model-free approaches to reinforcement learning (RL). The SR defines for each state (or feature) the discounted expected visitation counts of all possible successors under the current policy, enabling rapid adaptation to changed reward structures and facilitating efficient option discovery, temporal abstraction, exploration, and transfer across goals or tasks.

1. Formal Definition and Mathematical Properties

Let S\mathcal{S} denote the (possibly finite) state space, π\pi a stationary policy, and γ∈[0,1)\gamma\in[0,1) a discount factor. The canonical (tabular) SR Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R} is defined as: Mπ(s,s′)=Eπ[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right] where StS_t denotes the state at time tt under dynamics pp and π\pi, and 1\mathbb{1} the indicator function. π\pi0 quantifies the expected discounted number of times state π\pi1 will be visited in the future after starting from π\pi2.

Writing π\pi3 in matrix form (with π\pi4 as the policy-induced one-step transition matrix): π\pi5

π\pi6 satisfies the Bellman-style fixed-point equation: π\pi7

This structure underpins several key uses:

  • The value function for any reward Ï€\pi8 (vector) is Ï€\pi9.
  • Rapid reward revaluation: If γ∈[0,1)\gamma\in[0,1)0 changes but γ∈[0,1)\gamma\in[0,1)1 does not, update γ∈[0,1)\gamma\in[0,1)2 by a single matrix-vector multiplication.
  • Extensions to feature-based versions, where each state γ∈[0,1)\gamma\in[0,1)3 is mapped to γ∈[0,1)\gamma\in[0,1)4 and successor features γ∈[0,1)\gamma\in[0,1)5 are used directly for large or continuous spaces.

2. SRs in Deep and Feature-based Architectures

SR theory has been extended to address high-dimensional or continuous state spaces with deep neural architectures. In such cases, SRs are not stored as explicit γ∈[0,1)\gamma\in[0,1)6 matrices, but as feature-based predictors or networks:

  • State embeddings γ∈[0,1)\gamma\in[0,1)7 are learned by a CNN or MLP for visual input.
  • Successor feature modules γ∈[0,1)\gamma\in[0,1)8 approximate γ∈[0,1)\gamma\in[0,1)9.
  • Temporal-difference (TD) learning is employed to iteratively update MÏ€:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}0 with the loss:

Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}1

where Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}2 are parameters from slowly updated target networks.

This structure is used within both DQN-like value learning (Kulkarni et al., 2016) and actor-critic agents (Siriwardhana et al., 2018), often augmented with auxiliary tasks (e.g., next-frame prediction) to stabilize feature learning.

3. SRs and Option/Eigenoption Discovery

The SR exhibits deep connections to proto-value functions (PVFs) and graph Laplacians, enabling principled construction of temporally-extended options ("eigenoptions"):

  • In graph-theoretic terms, the eigenvectors of the SR matrix (corresponding to the largest eigenvalues) encode directions of diffusive information flow, coinciding with the smoothest eigenfunctions of the normalized Laplacian:

Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}3

where Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}4 is the degree matrix and Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}5 the adjacency matrix.

  • Eigenoptions are discovered by:
    1. Collecting SR/feature vectors Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}6 over a rollout.
    2. Performing eigendecomposition to extract leading eigenvectors.
    3. Defining intrinsic rewards Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}7 per eigenvector.
    4. Training option policies to maximize each Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}8.
    5. Defining option initiation and termination sets using Mπ:S×S→RM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}9-values with respect to Mπ(s,s′)=Eπ[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]0.

Empirical evidence (Machado et al., 2017) demonstrates that adding a small number of eigenoptions constructed from SRs sharply reduces diffusion times in navigation tasks and accelerates goal-reaching, even with raw pixel inputs.

4. Successor Features, Universal Successor Representations, and Task Transfer

Beyond simple SRs, the successor feature (SF) or universal successor representation (USR) framework further factorizes value functions:

  • Define MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]1 and, for a policy MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]2,

Mπ(s,s′)=Eπ[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]3

  • For reward functions MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]4 parameterized by MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]5, the MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]6-function is MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]7.
  • USRs further generalize MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]8 to be goal-conditional, so that adaptation to new rewards/goals is achieved by learning/updating only MÏ€(s,s′)=EÏ€[∑t=0∞γt 1{St=s′}∣S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]9 or StS_t0, leaving StS_t1 unchanged.

Transfer results:

  • Once SR/SF StS_t2 is learned for the environment, adaptation to new tasks with the same dynamics but different StS_t3 requires only quick adaptation of StS_t4 (or StS_t5).
  • Empirically, in environments like AI2THOR, task transfer via SR adaptation reduces required learning episodes by an order of magnitude compared to full network retraining (Zhu et al., 2017, Ma et al., 2018).
  • Theoretical bounds show that transfer loss is proportional to StS_t6 between old and new task weights.

5. Exploration and Count-based Intrinsic Reward via the SR

SRs can be used to incentivize exploration without resorting to explicit density models:

  • The StS_t7-norm StS_t8 (for tabular SR) or StS_t9 under learned features quantifies the expected cumulative visitation of state tt0.
  • The (inverse) norm can be used as a count-based exploration bonus:

tt1

where tt2 is a tuning parameter.

  • The substochastic SR (SSR) variant analytically relates the SR norm to empirical visitation counts, providing a justification for this bonus (Machado et al., 2018).
  • Algorithms using SR-norm-based bonuses achieve order-of-magnitude gains in sparse-reward tasks and match R-Max/Ett3 style sample complexity.

In deep RL, SR-derived intrinsic rewards match or outperform more complex density-model approaches (e.g., PixelCNN, CTS, RND) in sparse Atari environments—particularly in the low-sample regime (Machado et al., 2018).

6. Advanced Theoretical Extensions and Empirical Results

Recent directions include:

  • Probabilistic/Uncertainty-aware SRs: Kalman Temporal Differences (KTD) for SR give a posterior distribution (mean and covariance) over tt4, capturing uncertainty and covariances. This results in nonlocal updates and partial transition revaluation, matching human credit-assignment in chains (Geerts et al., 2019).
  • Distributional/Partially Observable SRs: Distributional codes for SR enable value computation and policy derivation when state is not directly observable; learning is achieved with biologically plausible local synaptic updates (Vertes et al., 2019).
  • Active Inference and SRs: SRs offer an efficient amortization for Active Inference agents by precomputing tt5 once and enabling instantaneous value reevaluation for new priors or expected free energy objectives. This significantly reduces planning costs in large discrete state spaces (Millidge et al., 2022).
  • Temporal Abstractions (t-SR): The t-SR framework generalizes SRs to temporally extended actions (repeat-elsewhere operators), reducing policy-sampling frequency and accelerating reward revaluation in dynamic environments (Sargent et al., 2022).
  • Exploration Maximizing State Entropy: Conditioning SRs on the explicit past trajectory enables maximizing the entropy of the whole single-episode visitation distribution, systematically driving policies to explore previously unseen states (Jain et al., 2023).

Empirical observations:

  • SR-based bottleneck and option extraction regularly identifies semantically meaningful subgoals (e.g., room doorways) (Kulkarni et al., 2016).
  • Neural-network-based SRs recover place and grid-cell–like representations, supporting the link to hippocampal function and multi-modal cognitive maps (Stoewer et al., 2022, Stoewer et al., 2023).
  • In continual learning, SR-based decomposition allows new predictions (GVFs) to be learned more rapidly as only their one-step predictions need updating; this improves learning speed in both simulation and real-robot datasets (Sherstan et al., 2018).

7. Limitations, Extensions, and Open Problems

  • SRs and SFs require the reward to be (approximately) linear in features; Successor Feature Representations (SFRs) extend to general reward functions by learning a density over successor features and integrating against arbitrary reward models (Reinke et al., 2021).
  • Ensemble SRs mitigate coverage and bootstrapping issues in offline-to-online transfer, increasing robustness when offline datasets are narrow (Wang et al., 2024).
  • Exact tabular inversion scales poorly with state-space size; deep architectures required for large environments introduce new challenges (e.g., feature collapse, off-policy instability).
  • The choice of features tt6 is critical for transfer bounds and option expressivity.
  • Real-world transfer and sample-efficient learning using SRs remain areas of active investigation, including approaches for partial observability, hierarchical abstraction without manual option design, and continual/lifelong learning.

SRs thus provide a unifying predictive substrate in RL, supporting efficient credit assignment, hierarchical decomposition via spectral properties, rapid task transfer, and principal forms of exploration. Their connections to representation learning, neuroscience, option theory, and transfer continue to fuel ongoing research and practical deployment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Successor Representations (SRs).