Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hindsight Goal Relabeling with LoRA Finetuning

Updated 30 December 2025
  • The paper introduces a reward-free, online adaptation technique that employs hindsight relabeling with LoRA finetuning to transform recent rollouts into self-supervised training data.
  • It achieves rapid adaptation by updating only low-rank adapter parameters, enabling substantial performance gains in tasks like plug-in insertion within minutes.
  • Empirical results demonstrate that the method improves success rates from around 30% to 90% in as little as 15 minutes, proving its efficiency for long-horizon robotic tasks.

Hindsight goal relabeling with LoRA-based finetuning is a reward-free online adaptation technique for general goal-conditioned policy learning employed in robotic manipulation. It is characterized by the direct conversion of recent rollouts into self-supervised training data via hindsight-style goal relabeling, followed by rapid behavioral cloning updates restricted to low-rank adapter (LoRA) parameters. This approach enables real-robot policies to self-improve on long-horizon, out-of-distribution tasks autonomously, without reliance on external reward signals or human feedback. The method has been concretely instantiated in the Act2Goal framework, enabling substantial gains in manipulation policy performance within a matter of minutes (Zhou et al., 29 Dec 2025).

1. Hindsight Goal Relabeling Procedure

During online deployment, the policy πθ\pi_{\theta}—comprising frozen base weights and trainable LoRA adapters—executes under a user-specified goal gg. At each timestep tt, the agent observes the current RGB-D image oto_t, proprioceptive state cp,tc_{p,t}, executes at=πθ(ot,cp,t,g)a_t = \pi_{\theta}(o_t, c_{p,t}, g), and then records the resultant ot+1o_{t+1}. Each transition (ot,cp,t,at,ot+1)(o_t, c_{p,t}, a_t, o_{t+1}) is stored in an in-memory replay buffer B\mathbb{B}, with a capacity NN (typically N=20N=20).

Upon filling this buffer, an alternate-goal sampling procedure relabels each transition: the new goal got+1g' \leftarrow o_{t+1}, framing the task as achieving the observed outcome. The relabeled dataset DrelabelD_{\text{relabel}} therefore consists of tuples (st,cp,t,g,at)(s_t, c_{p,t}, g', a_t), with stots_t \equiv o_t, turning each experience into a nominal “success” for the policy to imitate.

2. Fine-tuning Objective and Behavioral Cloning

Finetuning is restricted to only the LoRA adapter parameters ϕ\phi; the base model θbase\theta_{\text{base}} is held fixed. The supervised loss per adaptation round is: L(ϕ;Drelabel)=1Drelabel(s,cp,g,a)Drelabelπθbase,ϕ(s,cp,g)a22+λregϕ22L(\phi; D_{\text{relabel}}) = \frac{1}{|D_{\text{relabel}}|}\sum_{(s, c_p, g', a) \in D_{\text{relabel}}} \left\| \pi_{\theta_{\text{base}}, \phi}(s, c_p, g') - a \right\|_2^2 + \lambda_{\text{reg}}\|\phi\|_2^2 where λreg\lambda_{\text{reg}} is a small weight decay penalty (e.g., 1×1021\times10^{-2}). The policy thus undergoes KK epochs (with K=10K=10 typical) of supervised regression on DrelabelD_{\text{relabel}} using behavioral cloning, leveraging all relabeled transitions.

3. LoRA Adapter Architecture and Injection

Rather than editing the multi-billion-parameter base model directly, LoRA adapters are injected into every linear layer W0Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} within cross-attention or MLP modules. Each adapted weight is computed as: W=W0+BAW' = W_0 + BA with ARr×dinA \in \mathbb{R}^{r \times d_{\text{in}}} and BRdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}, where rmin(din,dout)r \ll \min(d_{\text{in}}, d_{\text{out}}) (e.g., r=64r=64). This compression reduces per-layer adaptation complexity by 1030×10-30\times, as only r(din+dout)r\cdot(d_{\text{in}}+d_{\text{out}}) parameters per layer are updated.

4. Gradient-Based Update and Optimization

The adaptation process employs straightforward gradient-based optimization (e.g., AdamW), updating ϕ={Al,Bl}l=1L\phi = \{A_l,B_l\}_{l=1}^L for all LoRA layers. With learning rate η1×104\eta \approx 1\times10^{-4} and weight decay λreg\lambda_{\text{reg}}, parameters are advanced by: AlAlηLAlA_l \leftarrow A_l - \eta\frac{\partial L}{\partial A_l}

BlBlηLBlB_l \leftarrow B_l - \eta\frac{\partial L}{\partial B_l}

The forward pass utilizes W=W0+BAW' = W_0 + BA with W0W_0 frozen, ensuring that adaptation remains computationally lightweight and tractable in a real-robot setting.

5. Complete Online Adaptation Workflow

The following sequence operationalizes the adaptation:

  1. Initialization: All LoRA adapters ϕ\phi are zero-initialized; replay buffer B\mathbb{B} is empty.
  2. Experience Collection: The deployed policy gathers TrolloutT_{\text{rollout}} transitions, storing (st,cp,t,at,st+1)(s_t, c_{p,t}, a_t, s_{t+1}) in B\mathbb{B}.
  3. Hindsight Relabeling: Once B=N|\mathbb{B}| = N, generate DrelabelD_{\text{relabel}} by relabeling each transition’s goal gst+1g' \leftarrow s_{t+1}.
  4. Behavioral Cloning Finetuning: For KK epochs and minibatch size BsB_s (typically Bs=816B_s=8–16 per GPU), optimize L(ϕ;Drelabel)L(\phi; D_{\text{relabel}}) as detailed above.
  5. Buffer Reset: Clear B\mathbb{B} and repeat until task success or timeout.

A typical adaptation “round” (collection + $10$ epochs of finetuning) requires 5\approx 5 minutes on consumer-grade hardware (RTX 4090), making the loop suitable for real-robot deployment (Zhou et al., 29 Dec 2025).

6. Hyperparameters and Implementation

Empirically optimized hyperparameters driving the approach include:

Parameter Value Notes
Replay buffer NN $20$ transitions Sufficient for rapid relabeling/adaptation
LoRA rank rr $64$ Per linear layer
Epochs KK $10$ Per adaptation round
Minibatch BsB_s 8 ⁣ ⁣168\!-\!16 per GPU Varies by hardware
Learning rate η\eta 1×1041 \times 10^{-4} AdamW optimizer
Weight decay λreg\lambda_{\text{reg}} 1×1021 \times 10^{-2} Regularization
Relabel ratio 100%100\% of transitions Comprehensive use

Each adaptation round (rollout, relabeling, and finetuning) amounts to 5\approx 5 minutes, confirming feasibility for online adjustment.

7. Empirical Results and Significance

Application of hindsight goal relabeling with LoRA-based finetuning demonstrates rapid policy improvement in real-world and simulated settings. In the Act2Goal “Plug-In” insertion task, the online success rate increases as follows:

Adaptation Round Success Rate Cumulative Time
$0$ (pre-finetune) $0.30$ $0$ min
$1$ (\approx5 min) $0.55$ $5$ min
$2$ (\approx10 min) $0.75$ $10$ min
$3$ (\approx15 min) $0.90$ $15$ min

On the “Move Can” task (simulation, Robotwin 2.0 hard mode), analogous jumps occur: 0.130.450.700.900.13 \rightarrow 0.45 \rightarrow 0.70 \rightarrow 0.90 over three rounds. These results substantiate that self-supervised adaptation with LoRA is sufficiently rapid and compute-efficient for deployment in long-horizon robotic tasks, with performance improvements from approximately 30%30\% to 90%90\% success rates within $10–15$ minutes.

Hindsight relabeling provides densely informative supervision irrespective of task outcome, and LoRA adaptation minimizes the risk of overfitting and resource constraints associated with updating larger models. The overall methodology transforms every observed trajectory—successful or not—into actionable learning signals, thereby eliminating the need for explicit reward engineering or human-in-the-loop correction (Zhou et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hindsight Goal Relabeling with LoRA-based Finetuning.