Hindsight Goal Relabeling with LoRA Finetuning

Updated 30 December 2025

The paper introduces a reward-free, online adaptation technique that employs hindsight relabeling with LoRA finetuning to transform recent rollouts into self-supervised training data.
It achieves rapid adaptation by updating only low-rank adapter parameters, enabling substantial performance gains in tasks like plug-in insertion within minutes.
Empirical results demonstrate that the method improves success rates from around 30% to 90% in as little as 15 minutes, proving its efficiency for long-horizon robotic tasks.

Hindsight goal relabeling with LoRA-based finetuning is a reward-free online adaptation technique for general goal-conditioned policy learning employed in robotic manipulation. It is characterized by the direct conversion of recent rollouts into self-supervised training data via hindsight-style goal relabeling, followed by rapid behavioral cloning updates restricted to low-rank adapter (LoRA) parameters. This approach enables real-robot policies to self-improve on long-horizon, out-of-distribution tasks autonomously, without reliance on external reward signals or human feedback. The method has been concretely instantiated in the Act2Goal framework, enabling substantial gains in manipulation policy performance within a matter of minutes (Zhou et al., 29 Dec 2025).

1. Hindsight Goal Relabeling Procedure

During online deployment, the policy $\pi_{\theta}$ —comprising frozen base weights and trainable LoRA adapters—executes under a user-specified goal $g$ . At each timestep $t$ , the agent observes the current RGB-D image $o_t$ , proprioceptive state $c_{p,t}$ , executes $a_t = \pi_{\theta}(o_t, c_{p,t}, g)$ , and then records the resultant $o_{t+1}$ . Each transition $(o_t, c_{p,t}, a_t, o_{t+1})$ is stored in an in-memory replay buffer $\mathbb{B}$ , with a capacity $N$ (typically $N=20$ ).

Upon filling this buffer, an alternate-goal sampling procedure relabels each transition: the new goal $g' \leftarrow o_{t+1}$ , framing the task as achieving the observed outcome. The relabeled dataset $D_{\text{relabel}}$ therefore consists of tuples $(s_t, c_{p,t}, g', a_t)$ , with $s_t \equiv o_t$ , turning each experience into a nominal “success” for the policy to imitate.

2. Fine-tuning Objective and Behavioral Cloning

Finetuning is restricted to only the LoRA adapter parameters $\phi$ ; the base model $\theta_{\text{base}}$ is held fixed. The supervised loss per adaptation round is: $L(\phi; D_{\text{relabel}}) = \frac{1}{|D_{\text{relabel}}|}\sum_{(s, c_p, g', a) \in D_{\text{relabel}}} \left\| \pi_{\theta_{\text{base}}, \phi}(s, c_p, g') - a \right\|_2^2 + \lambda_{\text{reg}}\|\phi\|_2^2$ where $\lambda_{\text{reg}}$ is a small weight decay penalty (e.g., $1\times10^{-2}$ ). The policy thus undergoes $K$ epochs (with $K=10$ typical) of supervised regression on $D_{\text{relabel}}$ using behavioral cloning, leveraging all relabeled transitions.

3. LoRA Adapter Architecture and Injection

Rather than editing the multi-billion-parameter base model directly, LoRA adapters are injected into every linear layer $W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ within cross-attention or MLP modules. Each adapted weight is computed as: $W' = W_0 + BA$ with $A \in \mathbb{R}^{r \times d_{\text{in}}}$ and $B \in \mathbb{R}^{d_{\text{out}} \times r}$ , where $r \ll \min(d_{\text{in}}, d_{\text{out}})$ (e.g., $r=64$ ). This compression reduces per-layer adaptation complexity by $10-30\times$ , as only $r\cdot(d_{\text{in}}+d_{\text{out}})$ parameters per layer are updated.

4. Gradient-Based Update and Optimization

The adaptation process employs straightforward gradient-based optimization (e.g., AdamW), updating $\phi = \{A_l,B_l\}_{l=1}^L$ for all LoRA layers. With learning rate $\eta \approx 1\times10^{-4}$ and weight decay $\lambda_{\text{reg}}$ , parameters are advanced by: $A_l \leftarrow A_l - \eta\frac{\partial L}{\partial A_l}$

$B_l \leftarrow B_l - \eta\frac{\partial L}{\partial B_l}$

The forward pass utilizes $W' = W_0 + BA$ with $W_0$ frozen, ensuring that adaptation remains computationally lightweight and tractable in a real-robot setting.

5. Complete Online Adaptation Workflow

The following sequence operationalizes the adaptation:

Initialization: All LoRA adapters $\phi$ are zero-initialized; replay buffer $\mathbb{B}$ is empty.
Experience Collection: The deployed policy gathers $T_{\text{rollout}}$ transitions, storing $(s_t, c_{p,t}, a_t, s_{t+1})$ in $\mathbb{B}$ .
Hindsight Relabeling: Once $|\mathbb{B}| = N$ , generate $D_{\text{relabel}}$ by relabeling each transition’s goal $g' \leftarrow s_{t+1}$ .
Behavioral Cloning Finetuning: For $K$ epochs and minibatch size $B_s$ (typically $B_s=8–16$ per GPU), optimize $L(\phi; D_{\text{relabel}})$ as detailed above.
Buffer Reset: Clear $\mathbb{B}$ and repeat until task success or timeout.

A typical adaptation “round” (collection + $10$ epochs of finetuning) requires $\approx 5$ minutes on consumer-grade hardware (RTX 4090), making the loop suitable for real-robot deployment (Zhou et al., 29 Dec 2025).

6. Hyperparameters and Implementation

Empirically optimized hyperparameters driving the approach include:

Parameter	Value	Notes
Replay buffer $N$	$20$ transitions	Sufficient for rapid relabeling/adaptation
LoRA rank $r$	$64$	Per linear layer
Epochs $K$	$10$	Per adaptation round
Minibatch $B_s$	$8\!-\!16$ per GPU	Varies by hardware
Learning rate $\eta$	$1 \times 10^{-4}$	AdamW optimizer
Weight decay $\lambda_{\text{reg}}$	$1 \times 10^{-2}$	Regularization
Relabel ratio	$100\%$ of transitions	Comprehensive use

Each adaptation round (rollout, relabeling, and finetuning) amounts to $\approx 5$ minutes, confirming feasibility for online adjustment.

7. Empirical Results and Significance

Application of hindsight goal relabeling with LoRA-based finetuning demonstrates rapid policy improvement in real-world and simulated settings. In the Act2Goal “Plug-In” insertion task, the online success rate increases as follows:

Adaptation Round	Success Rate	Cumulative Time
$0$ (pre-finetune)	$0.30$	$0$ min
$1$ ( $\approx$ 5 min)	$0.55$	$5$ min
$2$ ( $\approx$ 10 min)	$0.75$	$10$ min
$3$ ( $\approx$ 15 min)	$0.90$	$15$ min

On the “Move Can” task (simulation, Robotwin 2.0 hard mode), analogous jumps occur: $0.13 \rightarrow 0.45 \rightarrow 0.70 \rightarrow 0.90$ over three rounds. These results substantiate that self-supervised adaptation with LoRA is sufficiently rapid and compute-efficient for deployment in long-horizon robotic tasks, with performance improvements from approximately $30\%$ to $90\%$ success rates within $10–15$ minutes.

Hindsight relabeling provides densely informative supervision irrespective of task outcome, and LoRA adaptation minimizes the risk of overfitting and resource constraints associated with updating larger models. The overall methodology transforms every observed trajectory—successful or not—into actionable learning signals, thereby eliminating the need for explicit reward engineering or human-in-the-loop correction (Zhou et al., 29 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Act2Goal: From World Model To General Goal-conditioned Policy (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hindsight Goal Relabeling with LoRA-based Finetuning.

Hindsight Goal Relabeling with LoRA Finetuning

1. Hindsight Goal Relabeling Procedure

2. Fine-tuning Objective and Behavioral Cloning

3. LoRA Adapter Architecture and Injection

4. Gradient-Based Update and Optimization

5. Complete Online Adaptation Workflow

6. Hyperparameters and Implementation

7. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hindsight Goal Relabeling with LoRA Finetuning

1. Hindsight Goal Relabeling Procedure

2. Fine-tuning Objective and Behavioral Cloning

3. LoRA Adapter Architecture and Injection

4. Gradient-Based Update and Optimization

5. Complete Online Adaptation Workflow

6. Hyperparameters and Implementation

7. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research