Papers
Topics
Authors
Recent
Search
2000 character limit reached

Successor Representation in RL

Updated 15 January 2026
  • Successor Representation (SR) is a predictive state encoding that factorizes the value function into transition dynamics and rewards, enabling rapid value re-evaluation when rewards change.
  • Generalizations like successor features and distributional extensions adapt SR for high-dimensional, partially observable, and temporally extended scenarios.
  • SR-based algorithms employ methods from TD updates to neural network approximations, leading to enhanced sample efficiency, fast transfer, and robust exploration.

The successor representation (SR) is a predictive state encoding introduced in reinforcement learning to factorize the value function into transition dynamics and reward components. SR forms the foundation for numerous algorithmic advances, theoretical analyses, neuroscientific models, and practical applications in transfer learning, exploration, and temporal abstraction.

1. Definition, Mathematical Formulation, and Core Properties

The SR for a fixed policy π\pi in a Markov decision process (MDP) is defined as the expected discounted count of future visits to each state ss' given an initial state ss: Mπ(s,s)=Eπ[t=0γt1{st=s}s0=s]M^\pi(s,s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \mathbf{1}\{s_t = s'\} \mid s_0 = s\right] with discount factor γ[0,1)\gamma \in [0,1). In matrix form, using the policy-induced one-step transition matrix PπP^\pi: Mπ=t=0(γPπ)t=(IγPπ)1M^\pi = \sum_{t=0}^\infty (\gamma P^\pi)^t = (I - \gamma P^\pi)^{-1} The value function is factorized as: Vπ(s)=sMπ(s,s)R(s)V^\pi(s) = \sum_{s'} M^\pi(s,s') R(s') where R(s)R(s') is the immediate reward. This separation enables rapid recomputation of values under reward changes without relearning environment dynamics.

The SR satisfies a Bellman-like recursion: M(st,:)=ϕ(st)+γM(st+1,:)M(s_t,:) = \phi(s_t)^\top + \gamma M(s_{t+1},:) where ϕ(s)\phi(s) is the one-hot state feature. TD-style updates further refine SR estimates via

δt=ϕ(st)+γM(st+1,:)M(st,:)\delta_t = \phi(s_t) + \gamma M(s_{t+1},:) - M(s_t,:)

M(st,:)M(st,:)+αδtM(s_t,:) \leftarrow M(s_t,:) + \alpha \delta_t

for learning rate α\alpha (Geerts et al., 2019).

2. Generalizations: Successor Features, Distributional Extensions, and Temporal Abstraction

Successor Features

High-dimensional and partially observable domains motivate successor features (SFs), replacing the indicator with a feature vector ϕ(s)\phi(s): ψπ(s)=Eπ[t=0γtϕ(st+1)s0=s]\psi^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t \phi(s_{t+1}) \mid s_0 = s\right] Action-conditioned SFs allow: qπ(s,a)=ψπ(s,a)wq_\pi(s,a) = \psi^\pi(s,a)^\top w for linear reward r(s,a)=ϕ(s,a)wr(s,a) = \phi(s,a)^\top w (Machado et al., 2021, Kulkarni et al., 2016).

Distributional Successor Measure

The distributional analogue extends SR to a distribution over discounted occupancy measures: M~π(x)=(1γ)t=0γtδXt\tilde{\mathcal{M}}_\pi(x) = (1-\gamma)\sum_{t=0}^\infty \gamma^t \delta_{X_t} Its law SMπ(x)\mathcal{SM}_\pi(x) enables zero-shot policy evaluation for arbitrary, risk-sensitive reward criteria (Wiltzer et al., 2024).

Temporally Extended SR

The t-SR framework generalizes SR over action repeats, forming predictive maps for temporally extended primitives: Mπ(s,a,s,j)=E[k=0j1γk1{Sk=s}+γjMπ(Sj,s)S0=s,A0=a,j]M^{\pi}(s,a,s',j) = \mathbb{E}\left[\sum_{k=0}^{j-1} \gamma^k \mathbf{1}\{S_k=s'\} + \gamma^j M_\pi(S_j,s') \mid S_0=s, A_0=a, j\right] This reduces the required decision frequency and adapts faster to reward changes than standard model-free methods (Sargent et al., 2022).

3. Algorithms, Computational Methods, and Sample Efficiency

SR supports diverse learning algorithms:

  • Exact computation by matrix inversion (IγPπ)1(I - \gamma P^\pi)^{-1}.
  • Dynamic programming with fixed-point iterations MI+γPπMM \leftarrow I + \gamma P^\pi M.
  • Temporal difference (TD) updates in both tabular and function approximation settings (Geerts et al., 2019, Sherstan et al., 2018).
  • Kalman Temporal Differences (KTD-SR, AKF-SR) introduce uncertainty-aware SR via linear-Gaussian state-space models and adaptive Kalman filtering, yielding covariance matrices for stimulus-specific learning rates and uncertainty-guided exploration (Geerts et al., 2019, Malekzadeh et al., 2022).

Function approximation approaches learn successor features via neural networks, autoencoders, or factor graph models, enabling robust online learning, highly adaptive in non-stationary environments, and biologically plausible implementations (Kulkarni et al., 2016, Dzhivelikian et al., 2023).

Table: Algorithmic Approaches in Successor Representation Learning

Algorithm Approach Uncertainty
TD-SR Point estimate Scalar
KTD-SR / AKF-SR Covariance Full matrix
Deep SR (DSR) NN function None (unless combined with Bayesian layers)
DHTM Hebbian, factor graph Local cell activations

4. Theoretical Insights, Transfer, and Temporal Abstraction

SR decouples the environment’s predictive model from the reward structure:

  • Fast transfer: changes in R(s)R(s) only require recomputation Vπ=MπRV^\pi = M^\pi R, not relearning MπM^\pi (Carvalho et al., 2024).
  • Policy improvement: Generalized Policy Improvement (GPI) synthesizes policies from previously learned SRs/SFs across tasks (Machado et al., 2021).
  • Option discovery: Spectral analysis of SR matrices yields intrinsic eigenoptions and the option keyboard, which allow construction of combinatorially large option sets without additional environment sampling (Machado et al., 2021).
  • Multi-task transfer: Task clustering with nonparametric Bayesian inference on SRs (BSR, GSR) handles unsignalled reward changes and spatial context remapping (Madarasz, 2019).
  • Temporal abstraction via t-SR, as above, supports bottom-up construction of temporally extended macro-actions (Sargent et al., 2022).

5. SR in Exploration, Partial Observability, and Neuroscience

Exploration

Early in learning, the 1\ell_1-norm of the SR can be used as an intrinsic reward bonus: rtint(s)=β/Ψ(s,)1r_t^{\textrm{int}}(s) = \beta / \|\Psi(s, \cdot)\|_1 This recovers count-based exploration in tabular domains and scales to high-dimensional feature spaces in deep RL (Machado et al., 2018). The substochastic SR (SSR) formalism connects the SR norm directly to empirical state visit counts during transient learning phases.

Partial Observability

Distributional successor features with distributed distributional codes enable SR-based value prediction under sensory noise and uncertain beliefs in POMDPs, matching the performance of latent full state access (Vertes et al., 2019). Local Hebbian updates and factor-graph implementations such as DHTM replace RNNs and HMMs for fast, robust online predictive modeling (Dzhivelikian et al., 2023).

Neuroscientific Model

SR provides a quantitative account of hippocampal predictive maps:

  • CA1 place cell firing rates mirror SR matrix entries, encoding future expected state transitions (predictive map theory) (Lee, 2020).
  • SR learning in CA1 can be mapped to heterosynaptic plasticity driven by coincident input from CA3 (“current state”) and EC (“next state”), matching observed phenomena such as rate ramps, sharp-wave ripple replay, and rapid place field remapping after goal relocation (Lee, 2020).
  • Grid cell periodicity corresponds to eigenvectors of the SR matrix (Stoewer et al., 2022, Seabrook et al., 2024), with formal connection to slow feature analysis (SFA) eigenbasis in gridworlds (Seabrook et al., 2024).

6. Advanced Extensions: Distributional Evaluation, General Reward Functions, and Future Directions

  • Distributional successor measures (DSM) enable zero-shot, risk-sensitive policy evaluation via learned generative models over discounted occupancy distributions (Wiltzer et al., 2024).
  • Successor Feature Representations (SFR) generalize SFs by encoding cumulative densities over successor features, supporting general (possibly nonlinear) reward functional R(ϕ)R(\phi): Qπ(s,a)=ΦR(ϕ)ξπ(s,a,ϕ)dϕQ^\pi(s,a) = \int_\Phi R(\phi) \xi^\pi(s,a,\phi) d\phi and admit TD-style convergence and GPI transfer guarantees (Reinke et al., 2021).
  • Temporal abstraction, option discovery, and hierarchical composition are being combined with deep learning, universal value function approximation, and matrix-valued KFs to scale SR to high-dimensional, continuous, and multi-agent domains (Malekzadeh et al., 2022, Salimibeni et al., 2021).

7. Empirical Applications and Benchmark Performance

SR-based algorithms are empirically validated across tabular, continuous control, robotics, Atari, and multi-agent benchmarks:

  • Sample-efficient continual learning in GVFs for constructivist agents; rapid revaluation after reward changes; up to 50× faster learning over direct TD (Sherstan et al., 2018).
  • Offline reward inference (SR-Reward) matches true-reward RL and behavioral cloning on D4RL tasks, with negative sampling for conservative bias against out-of-distribution estimation (Azad et al., 4 Jan 2025).
  • Uncertainty-aware SR via Kalman TD (KTD-SR, AKF-SR, MAK-SR) outperforms deep RL and model-free algorithms in convergence speed, stability, and adaptation to changing rewards, including variants for multi-agent settings (Malekzadeh et al., 2022, Salimibeni et al., 2021).
  • Deep SR-based exploration achieves state-of-the-art in hard exploration environments; SSR-based intrinsic reward matches PAC-MDP sample complexity (Machado et al., 2018).
  • Neural-network SR models recover grid-like and place-cell firing fields in space and in abstract language domains, paralleling entorhinal-hippocampal system structure; SFA and SR are mathematically equivalent in temporal coherence and occupancy principles (Stoewer et al., 2022, Seabrook et al., 2024).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Successor Representation (SR).