Successor Representation in RL
- Successor Representation (SR) is a predictive state encoding that factorizes the value function into transition dynamics and rewards, enabling rapid value re-evaluation when rewards change.
- Generalizations like successor features and distributional extensions adapt SR for high-dimensional, partially observable, and temporally extended scenarios.
- SR-based algorithms employ methods from TD updates to neural network approximations, leading to enhanced sample efficiency, fast transfer, and robust exploration.
The successor representation (SR) is a predictive state encoding introduced in reinforcement learning to factorize the value function into transition dynamics and reward components. SR forms the foundation for numerous algorithmic advances, theoretical analyses, neuroscientific models, and practical applications in transfer learning, exploration, and temporal abstraction.
1. Definition, Mathematical Formulation, and Core Properties
The SR for a fixed policy in a Markov decision process (MDP) is defined as the expected discounted count of future visits to each state given an initial state : with discount factor . In matrix form, using the policy-induced one-step transition matrix : The value function is factorized as: where is the immediate reward. This separation enables rapid recomputation of values under reward changes without relearning environment dynamics.
The SR satisfies a Bellman-like recursion: where is the one-hot state feature. TD-style updates further refine SR estimates via
for learning rate (Geerts et al., 2019).
2. Generalizations: Successor Features, Distributional Extensions, and Temporal Abstraction
Successor Features
High-dimensional and partially observable domains motivate successor features (SFs), replacing the indicator with a feature vector : Action-conditioned SFs allow: for linear reward (Machado et al., 2021, Kulkarni et al., 2016).
Distributional Successor Measure
The distributional analogue extends SR to a distribution over discounted occupancy measures: Its law enables zero-shot policy evaluation for arbitrary, risk-sensitive reward criteria (Wiltzer et al., 2024).
Temporally Extended SR
The t-SR framework generalizes SR over action repeats, forming predictive maps for temporally extended primitives: This reduces the required decision frequency and adapts faster to reward changes than standard model-free methods (Sargent et al., 2022).
3. Algorithms, Computational Methods, and Sample Efficiency
SR supports diverse learning algorithms:
- Exact computation by matrix inversion .
- Dynamic programming with fixed-point iterations .
- Temporal difference (TD) updates in both tabular and function approximation settings (Geerts et al., 2019, Sherstan et al., 2018).
- Kalman Temporal Differences (KTD-SR, AKF-SR) introduce uncertainty-aware SR via linear-Gaussian state-space models and adaptive Kalman filtering, yielding covariance matrices for stimulus-specific learning rates and uncertainty-guided exploration (Geerts et al., 2019, Malekzadeh et al., 2022).
Function approximation approaches learn successor features via neural networks, autoencoders, or factor graph models, enabling robust online learning, highly adaptive in non-stationary environments, and biologically plausible implementations (Kulkarni et al., 2016, Dzhivelikian et al., 2023).
Table: Algorithmic Approaches in Successor Representation Learning
| Algorithm | Approach | Uncertainty |
|---|---|---|
| TD-SR | Point estimate | Scalar |
| KTD-SR / AKF-SR | Covariance | Full matrix |
| Deep SR (DSR) | NN function | None (unless combined with Bayesian layers) |
| DHTM | Hebbian, factor graph | Local cell activations |
4. Theoretical Insights, Transfer, and Temporal Abstraction
SR decouples the environment’s predictive model from the reward structure:
- Fast transfer: changes in only require recomputation , not relearning (Carvalho et al., 2024).
- Policy improvement: Generalized Policy Improvement (GPI) synthesizes policies from previously learned SRs/SFs across tasks (Machado et al., 2021).
- Option discovery: Spectral analysis of SR matrices yields intrinsic eigenoptions and the option keyboard, which allow construction of combinatorially large option sets without additional environment sampling (Machado et al., 2021).
- Multi-task transfer: Task clustering with nonparametric Bayesian inference on SRs (BSR, GSR) handles unsignalled reward changes and spatial context remapping (Madarasz, 2019).
- Temporal abstraction via t-SR, as above, supports bottom-up construction of temporally extended macro-actions (Sargent et al., 2022).
5. SR in Exploration, Partial Observability, and Neuroscience
Exploration
Early in learning, the -norm of the SR can be used as an intrinsic reward bonus: This recovers count-based exploration in tabular domains and scales to high-dimensional feature spaces in deep RL (Machado et al., 2018). The substochastic SR (SSR) formalism connects the SR norm directly to empirical state visit counts during transient learning phases.
Partial Observability
Distributional successor features with distributed distributional codes enable SR-based value prediction under sensory noise and uncertain beliefs in POMDPs, matching the performance of latent full state access (Vertes et al., 2019). Local Hebbian updates and factor-graph implementations such as DHTM replace RNNs and HMMs for fast, robust online predictive modeling (Dzhivelikian et al., 2023).
Neuroscientific Model
SR provides a quantitative account of hippocampal predictive maps:
- CA1 place cell firing rates mirror SR matrix entries, encoding future expected state transitions (predictive map theory) (Lee, 2020).
- SR learning in CA1 can be mapped to heterosynaptic plasticity driven by coincident input from CA3 (“current state”) and EC (“next state”), matching observed phenomena such as rate ramps, sharp-wave ripple replay, and rapid place field remapping after goal relocation (Lee, 2020).
- Grid cell periodicity corresponds to eigenvectors of the SR matrix (Stoewer et al., 2022, Seabrook et al., 2024), with formal connection to slow feature analysis (SFA) eigenbasis in gridworlds (Seabrook et al., 2024).
6. Advanced Extensions: Distributional Evaluation, General Reward Functions, and Future Directions
- Distributional successor measures (DSM) enable zero-shot, risk-sensitive policy evaluation via learned generative models over discounted occupancy distributions (Wiltzer et al., 2024).
- Successor Feature Representations (SFR) generalize SFs by encoding cumulative densities over successor features, supporting general (possibly nonlinear) reward functional : and admit TD-style convergence and GPI transfer guarantees (Reinke et al., 2021).
- Temporal abstraction, option discovery, and hierarchical composition are being combined with deep learning, universal value function approximation, and matrix-valued KFs to scale SR to high-dimensional, continuous, and multi-agent domains (Malekzadeh et al., 2022, Salimibeni et al., 2021).
7. Empirical Applications and Benchmark Performance
SR-based algorithms are empirically validated across tabular, continuous control, robotics, Atari, and multi-agent benchmarks:
- Sample-efficient continual learning in GVFs for constructivist agents; rapid revaluation after reward changes; up to 50× faster learning over direct TD (Sherstan et al., 2018).
- Offline reward inference (SR-Reward) matches true-reward RL and behavioral cloning on D4RL tasks, with negative sampling for conservative bias against out-of-distribution estimation (Azad et al., 4 Jan 2025).
- Uncertainty-aware SR via Kalman TD (KTD-SR, AKF-SR, MAK-SR) outperforms deep RL and model-free algorithms in convergence speed, stability, and adaptation to changing rewards, including variants for multi-agent settings (Malekzadeh et al., 2022, Salimibeni et al., 2021).
- Deep SR-based exploration achieves state-of-the-art in hard exploration environments; SSR-based intrinsic reward matches PAC-MDP sample complexity (Machado et al., 2018).
- Neural-network SR models recover grid-like and place-cell firing fields in space and in abstract language domains, paralleling entorhinal-hippocampal system structure; SFA and SR are mathematically equivalent in temporal coherence and occupancy principles (Stoewer et al., 2022, Seabrook et al., 2024).
References
- (Geerts et al., 2019) Probabilistic Successor Representations with Kalman Temporal Differences
- (Sherstan et al., 2018) Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation
- (Seabrook et al., 2024) What is the relation between Slow Feature Analysis and the Successor Representation?
- (Madarasz, 2019) Better transfer learning with inferred successor maps
- (Stoewer et al., 2022) Neural Network based Successor Representations of Space and Language
- (Machado et al., 2021) Temporal Abstraction in Reinforcement Learning with the Successor Representation
- (Moskovitz et al., 2021) A First-Occupancy Representation for Reinforcement Learning
- (Dzhivelikian et al., 2023) Learning Successor Features with Distributed Hebbian Temporal Memory
- (Machado et al., 2018) Count-Based Exploration with the Successor Representation
- (Sargent et al., 2022) Temporally Extended Successor Representations
- (Carvalho et al., 2024) Predictive representations: building blocks of intelligence
- (Wiltzer et al., 2024) A Distributional Analogue to the Successor Representation
- (Millidge et al., 2022) Successor Representation Active Inference
- (Lee, 2020) Toward the biological model of the hippocampus as the successor representation agent
- (Vertes et al., 2019) A neurally plausible model learns successor representations in partially observable environments
- (Azad et al., 4 Jan 2025) SR-Reward: Taking The Path More Traveled
- (Kulkarni et al., 2016) Deep Successor Reinforcement Learning
- (Malekzadeh et al., 2022) AKF-SR: Adaptive Kalman Filtering-based Successor Representation
- (Salimibeni et al., 2021) Multi-Agent Reinforcement Learning via Adaptive Kalman Temporal Difference and Successor Representation
- (Reinke et al., 2021) Successor Feature Representations