Hindsight Experience Replay

Updated 12 February 2026

Hindsight Experience Replay is a goal-relabeled experience replay technique that converts failed episodes into synthetic successes to enhance learning in sparse reward environments.
It augments off-policy RL algorithms by redefining achieved states as alternative goals, thus creating an implicit curriculum that improves sample efficiency.
Variants such as model-based, curiosity-driven, and multi-criteria HER further refine the approach to mitigate bias and boost performance in complex robotic and navigation tasks.

Hindsight Experience Replay (HER) is a goal-relabeled experience replay technique devised to address the challenge of extremely sparse rewards in multi-goal reinforcement learning. By retrospectively modifying episode data to pretend the agent was pursuing goals that were actually achieved, HER ensures effective reward propagation even when the agent never experiences success with the original goal. Since its introduction, HER has become foundational for sparse-reward robotic manipulation, navigation, and more generally, multi-goal RL. This article provides a comprehensive overview of HER’s conceptual foundation, algorithmic structure, theoretical properties, variants, empirical performance, and limitations, referencing major advances and key technical refinements from the literature.

1. Motivation and Conceptual Foundations

Sparse-reward multi-goal RL tasks are characterized by the agent receiving a non-trivial reward only upon exactly achieving the goal—otherwise reward remains zero (or negative). In these settings, standard off-policy RL methods struggle due to the exponential improbability of stumbling upon the goal by random exploration. HER, first introduced by Andrychowicz et al. (Andrychowicz et al., 2017), leverages the fact that even failed trajectories typically realize some state achievements. By relabeling these actual achieved states as alternative (hindsight) goals for the purposes of replay and learning, HER turns failures (with respect to the original goal) into synthetic "successes," providing informative reward signals and dramatically enhancing sample efficiency.

The defining principle of HER is the construction of an implicit curriculum: as the agent learns to reach "accidentally achieved" goals, its competence generalizes, eventually enabling successful pursuit of the intended goals as well. This provides a mechanism for endogenous transfer and propagation of sparse reward, without shaped feedback or demonstrations.

2. Algorithmic Structure and Formalization

HER operates as an augmentation around standard off-policy RL algorithms (e.g., DDPG, SAC, TD3). At each episode, the agent interacts with the environment under a desired goal $g$ , collecting a trajectory $\tau = \left( (s_0, a_0, s_1), \dots, (s_{T-1}, a_{T-1}, s_T) \right)$ . The original transitions $(s_t,a_t,s_{t+1},g)$ are stored as usual. HER then applies a goal-relabelling strategy (typically "future"—randomly selecting $k$ future states as hindsight goals), generating alternative transitions by replacing $g$ with $g'$ , where $g'$ is an achieved state later in the trajectory. The reward is recomputed as $r'(s_t,a_t,g')$ based on proximity to $g'$ (e.g., zero if $s_{t+1}$ achieves $g'$ , negative otherwise).

This procedure is best articulated by the pseudocode and equations in (Andrychowicz et al., 2017), with HER storing $k$ additional hindsight transitions per original one, and training proceeding via standard off-policy Bellman backups over the enlarged replay buffer. The probability of sampling an original vs. hindsight transition is $1/(1+k)$ vs. $k/(1+k)$ , respectively.

Extensions of HER to on-policy methods are also established: PPO-HER relabels episodes post hoc and recomputes log-probabilities and advantages for the relabeled goals, resulting in significant acceleration even when the on-policy assumption is violated, provided the policy maintains sufficient entropy (Crowder et al., 2024).

3. Theoretical Properties and Bias

A central theoretical trade-off in HER arises from the mismatch between the likelihood of trajectories under the original and relabeled goals. In deterministic environments, this is benign, but in stochastic domains HER introduces "hindsight bias" by underrepresenting bad outcomes under relabeled goals, leading to overoptimistic value estimates and risk underestimation (Schramm et al., 2022). Several works address this:

USHER uses importance sampling to construct an unbiased Bellman target, weighting each update by the ratio of the relabeling mixture probability over the effective sampling density at the next state, guaranteeing convergence to true $Q^*$ in the infinite-data limit (Schramm et al., 2022).
ARCHER counters HER's optimism by up-weighting hindsight rewards relative to real ones, correcting overconfidence and improving sample efficiency. Selection of trade-off parameters $\lambda_r, \lambda_h$ for reward scaling is crucial for stable performance (Lanka et al., 2018).
Goal prioritization techniques (e.g., HGR, EBP, IBS-HER) rank hindsight goals by TD error, trajectory energy, or instructiveness, ensuring that relabelings focus on informative, transferable, or high-variance experiences (Luu et al., 2021, Zhao et al., 2018, Manela et al., 2019).
Filtering out misleading relabelings—cases where the goal was already achieved in the preceding state—mitigates Q-function overestimation due to degenerate transitions (Manela et al., 2019).

4. Extensions and Variants

HER's architecture has catalyzed a wide array of variants and enhancements:

Model-based relabeling and imagination: Model-based HER (MHER), Imaginary HER (I-HER), and MRHER learn a forward model to imagine virtual future goals or roll out synthetic trajectories, enabling relabels vastly beyond the visited-state distribution, and substantially improving sample complexity (Yang et al., 2021, McCarthy et al., 2021, Huang et al., 2023).
Curiosity-driven exploration: Combining HER with curiosity-based intrinsic rewards, wherein novelty or forward-model prediction error rewards drive exploration, accelerates learning in ultra-sparse domains (notably multi-block stacking) and facilitates deeper policy coverage (Lanier et al., 2019, McCarthy et al., 2021).
Multi-criteria HER: In tasks with composite goals (e.g., multi-object), relabeling each subgoal independently yields a combinatorial diversity of hindsight goals, thereby greatly enriching the replay buffer (Lanier et al., 2019).
Diversity-driven replay: DTGSH samples episodes and transitions maximizing goal diversity via determinantal point processes (DPP), demonstrating that maximizing the span of achieved/relabeled goals is highly effective in high-dimensional manipulation tasks (Dai et al., 2021).
Maximum entropy formulations: SHER introduces entropy-regularization in the actor loss, stabilizing HER in high-dimensional, multi-modal action spaces and boosting reproducibility and final performance (He et al., 2020). MEHER and related works frame the optimal hindsight relabeling rate as maximizing the entropy of the success/failure signal in the replay buffer, yielding new data-centric hyperparameters for optimal learning (Crowder et al., 2024).
Language and symbolic goals: Learning a mapping from end-states to language instructions enables HER to operate without a manually defined relabeling function in instruction-following environments; HIGhER builds this capacity using a hindsight instruction generator (Cideron et al., 2019).
On-policy and search-based algorithms: PPO-HER and Adaptable HER demonstrate that the core idea of relabeling episodes remains beneficial in on-policy RL (PPO, AlphaZero-style MCTS), provided relabelings are handled with correct likelihood calculations (Crowder et al., 2024, Vazaios et al., 5 Nov 2025).

5. Empirical Performance and Applications

HER and its variants have broad empirical validation:

In standard sparse-reward robotic tasks (pushing, sliding, pick-and-place), HER is essential for tractable learning; ablations show that DDPG without HER is incapable of attaining non-trivial success rates given reasonable sample budgets (Andrychowicz et al., 2017).
Table: Final Success Rates after 1M steps (from (Andrychowicz et al., 2017)):

| Task | DDPG (no HER) | DDPG + HER | |-----------------|---------------|------------| | Pushing | 1.7% | 98.5% | | Sliding | 0.5% | 62.3% | | Pick-and-Place | 2.1% | 64.8% |

Model-based variants achieve 5–30x reductions in real-environment sample complexity in Fetch suite and custom pieces, with MRHER and I-HER reaching over 90% success with an order-of-magnitude fewer environment steps than vanilla HER (McCarthy et al., 2021, Huang et al., 2023).
In sequential object manipulation, MRHER demonstrates 13–14% faster convergence over prior model-based and relay HER methods (Huang et al., 2023).
Hindsight goal prioritization and diversity-based relabeling robustly increase sample efficiency by 2–5× across robotic tasks (Luu et al., 2021, Dai et al., 2021).
In composite or language-driven tasks, HER with multi-criteria or hindsight-generated goals is the only viable approach that can solve deep sparse-reward objectives without human demonstrations (Lanier et al., 2019, Cideron et al., 2019).
On-policy methods: PPO-HER outperforms both vanilla PPO and off-policy SAC on predator-prey tasks and converges in significantly fewer episodes (Crowder et al., 2024).

6. Practical Recommendations and Limitations

Implementation details for effective HER include:

Number of hindsight relabels per transition ( $k$ ): common settings $k=4$ –$8$; tune per environment (Andrychowicz et al., 2017).
Selection strategy: "future" and "final" strategies are standard; multi-criteria or diversity-based approaches are preferred for complex/generative goals (Dai et al., 2021, Lanier et al., 2019).
Off-policy base: HER is agnostic to the base RL method; plug into DDPG, SAC, TD3—maximum-entropy (SAC, SHER) variants is recommended for stability in high-dimensional control (He et al., 2020).
Bias mitigation: in stochastic or safety-critical domains, deploy USHER or ARCHER for provably unbiased or more robust value estimation (Schramm et al., 2022, Lanka et al., 2018).
Advanced domains: for tasks with multi-modal goals, model-based rollouts (MHER, I-HER, MRHER) and curriculum structures (CDMC-HER) further accelerate progress (Yang et al., 2021, McCarthy et al., 2021, Lanier et al., 2019).

Limitations:

HER is less effective in environments where the achieved-goal space is stationary or trivially reachable under random policies. Goal relabeling admits diminishing returns as the agent achieves high coverage.
Model-based approaches (I-HER, MHER, MRHER) require accurate dynamics models; excessive rollout length can incur compounding error (Huang et al., 2023).
Computational cost for prioritization methods (HGR, DTGSH) and multi-criteria relabeling can be substantial, particularly in high-dimensional or long-horizon settings (Luu et al., 2021, Dai et al., 2021).
In highly stochastic domains, HER may introduce harmful bias unless explicitly mitigated by importance weighting or reward rescaling (Schramm et al., 2022, Lanka et al., 2018).
On-policy variants gain from HER primarily when the policy maintains high entropy and the marginal likelihood remains non-vanishing—PPO-HER may fail as entropy collapses late in training (Crowder et al., 2024).

7. Outlook and Future Directions

Hindsight Experience Replay continues to serve as a foundational tool in modern RL for challenging, sparse-reward, multi-goal settings. Future research directions include:

Unified bias-corrected and adaptive prioritization methods that combine the strengths of USHER, goal instructiveness, trajectory energy, and diversity-driven sampling.
Advanced model-based relabeling techniques leveraging uncertainty estimation, robust dynamics, and learned goal embeddings, extending HER to complex, partially observed, or high-dimensional spaces.
Automation of subgoal or instruction generation for HER, bridging the gap to language-guided RL and complex, open-ended task spaces (Cideron et al., 2019).
Theoretical analysis of entropy-constrained, relabeling-optimal replay strategies, maximizing information transfer while guaranteeing convergence and robustness across RL architectures.
Exploration of HER in large-scale search, combinatorial, and planning domains, especially via integrations with Monte Carlo Tree Search, parameterized policy hierarchies, and differentiable program synthesis (Vazaios et al., 5 Nov 2025).

HER and its variants remain an active area of methodological advancement and are now indispensable components in the RL toolkit for addressing sparse rewards, enabling tractability, and accelerating high-dimensional multi-goal policy learning.