Papers
Topics
Authors
Recent
Search
2000 character limit reached

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Published 18 Jun 2025 in cs.RO and cs.LG | (2506.15799v2)

Abstract: Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.

Summary

  • The paper proposes Dsrl, which steers diffusion policies by modifying the latent noise distribution using reinforcement learning, enabling efficient policy adaptation.
  • The method is sample-efficient and works with black-box access, avoiding modifications to policy weights while leveraging noise aliasing for improved learning.
  • Experimental results on simulated and real-world robotic tasks show that Dsrl significantly outperforms traditional behavioral cloning and other RL variants.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Introduction

The paper "Steering Your Diffusion Policy with Latent Space Reinforcement Learning" (2506.15799) addresses the challenge of adapting behavioral cloning (BC) policies for robotic control learned from human demonstrations. Traditional BC approaches do not utilize experience gathered during deployment to refine policy behavior, necessitating further demonstrations, which are costly and time-consuming. Reinforcement learning (RL) offers autonomous online policy improvement but often demands numerous samples. The authors propose an efficient approach called Diffusion Steering via Reinforcement Learning (Dsrl), exploiting the latent-noise space of diffusion policies for fast, effective policy adaptation. Figure 1

Figure 1: The Dsrl framework modifies the initial noise distribution with an RL-trained policy to enhance real-world policy adaptation.

Methodology

Diffusion Steering Approach

The Dsrl technique modifies the initial noise distribution used by diffusion models in contrast to traditional methods that adjust policy weights. This reimagining of the diffusion process involves altering the input noise, which determines policy behavior. The latent-noise space policy, trained via RL, efficiently steers the diffusion policy to yield desired actions. Crucially, Dsrl operates with black-box access without requiring modification of the underlying policy weights, facilitating practical application and enabling adaptation through API access. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Dsrl's adaptation capabilities demonstrated on OpenAI Gym tasks.

Latent-Noise Space Optimization

Dsrl formalizes the latent-noise space optimization by leveraging RL techniques. It conceptualizes the noise space as a transformed MDP, allowing for standard policy optimization using RL algorithms. This approach circumvents the computational challenges of back-propagating through multi-step denoising, enabling sample-efficient training of auxiliary latent-space policies.

Noise Aliasing Strategy

Dsrl integrates a noise aliasing strategy that enhances learning efficiency. This approach exploits similarities in the denoised output of different latent noise values, thereby reducing exploration requirements. The actor-critic framework incorporates these dynamics, offering a robust pathway to offline and online learning. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Dsrl significantly improves the performance of a multi-task BC policy.

Experimental Results

Simulated Benchmarks

Dsrl was evaluated on OpenAI Gym environments and the Robomimic benchmark, demonstrating significant sample efficiency compared to existing methods. Dsrl efficiently steered diffusion policies to optimal behavior, outperforming variants such as Dipo and Idql. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Dsrl enables successful adaptation of diffusion policies in Robomimic tasks.

Real-World Applications

In practical settings, Dsrl demonstrated substantial improvements in robotic tasks, refining both single-task and multi-task policies with limited online interaction. Its adaptation capabilities on the Bridge V2 dataset with the WidowX robot and π0\pi_0 generalist policies highlight its real-world applicability. Figure 5

Figure 5: Real-world demonstration of Dsrl adaptation capabilities.

Implications and Future Directions

Dsrl's framework presents a practical approach for real-world robotic policy adaptation, offering significant improvements in sample efficiency and ease of implementation. Future work could explore Dsrl's application beyond robotics, such as in image generation or protein modeling, leveraging its latent-space optimization potential.

Additionally, investigations into the theoretical underpinnings of diffusion steering could provide insights into the expressivity and structure of noise space, enhancing understanding and application across domains.

Conclusion

The paper introduces Dsrl as a novel approach for adapting BC-trained diffusion policies using latent-space RL. Dsrl showcases robust improvements in sample efficiency for policy refinement, making it a promising tool for autonomous robotic learning and potentially other domains where diffusion models are prevalent.

Paper to Video (Beta)

Whiteboard

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, structured to guide future research.

  • Steerability conditions: Formal characterization of when varying the initial noise ww can reliably steer a diffusion/flow policy to desirable actions. What properties of the denoiser, dataset, or task ensure that the mapping wa=g(s,w)w \mapsto a = g(s,w) has sufficient coverage of high-value actions?
  • Reachability limits: Criteria and diagnostics to detect when optimal or necessary actions are outside the image of g(s,)g(s,\cdot) for a given state, making Dsrl fundamentally incapable of reaching them without weight finetuning or residuals.
  • Deterministic vs. stochastic sampling: Dependence of Dsrl on DDIM/flow (deterministic) sampling. Can Dsrl handle DDPM-style stochastic sampling (non-zero σt\sigma_t), or does it require determinism for stability and sample efficiency?
  • Sensitivity to diffusion hyperparameters: Systematic study of how denoising schedules, number of reverse steps, guidance weights, and architectural choices affect steerability, stability, and performance.
  • Latent dimensionality and action chunking: Effects of the latent-noise dimension (especially for high-dimensional, chunked actions such as in π0\pi_0) on exploration difficulty, credit assignment, and scaling of sample complexity.
  • Exploration in latent space: Analysis and methods for efficient exploration in WW (e.g., structured priors, intrinsic rewards, or optimism) to avoid myopic or mode-collapsed steering when high-reward regions of WW are rare.
  • Noise aliasing theory: Theoretical justification and failure modes of the “noise aliasing” distillation (mapping QA(s,a)Q_A(s,a) to QW(s,w)Q_W(s,w) via a=g(s,w)a=g(s,w) with wN(0,I)w\sim N(0,I)). How much bias does this introduce, and how does it affect reaching high-value regions unlikely under N(0,I)N(0,I)?
  • Offline learning conservatism: Formal guarantees (or counterexamples) that Dsrl-Na remains conservative in offline settings. Under what conditions does steering push decoded actions out of the behavior distribution, and how can deviation be controlled (e.g., KL regularization to N(0,I)N(0,I))?
  • Inverse mapping from actions to noise: Methods to find ww that approximately invert g(s,w)ag(s,w)\approx a for given offline (s,a)(s,a) pairs, potentially enabling stronger use of offline datasets than sampling wN(0,I)w\sim N(0,I).
  • Safety and constraints: Mechanisms for safe exploration and constraint satisfaction when steering (e.g., action-space safety filters, constrained RL in WW), especially in real-world robotics where risky decoded actions are possible.
  • API/practical access constraints: Many deployed diffusion/flow policy APIs do not expose the initial noise. What interfaces or wrappers are required to enable Dsrl in practice, and how restrictive is this requirement?
  • Compute and latency budget: Quantitative analysis of control-loop latency and compute overhead from denoising and aliasing-based distillation. How many forward passes per control step are needed, and how does this impact high-frequency control?
  • Comparative finetuning baselines: Missing head-to-head comparisons with gradient-based finetuning of diffusion/flow policies on real robots, and with strong residual RL baselines under identical conditions and budgets.
  • Task interference in generalist policies: When steering a multi-task or generalist policy, does the learned latent policy degrade performance on other tasks or generalization? How to condition or regularize steering to avoid catastrophic interference?
  • Reward specification/feedback: The approach relies on sparse binary rewards in experiments; how can Dsrl incorporate preference-based feedback, learned reward models, or language-based rewards to scale to open-world deployment?
  • Long-horizon temporal structure: Strategies for coordinating ww across long horizons and action chunks (e.g., temporal abstractions, sequence-level latent policies, or planning in WW). How does temporal correlation of ww affect stability?
  • Robustness and generalization: Sensitivity to observation noise, domain shifts, and sim-to-real gaps. Are Dsrl policies brittle to slight changes in scene layout, dynamics, or camera viewpoints?
  • Optimization landscape in WW: Empirical/analytical study of (non)smoothness and multimodality of wg(s,w)w\mapsto g(s,w) and its impact on policy gradient variance, convergence, and stability.
  • Offline-to-online transition: Why did standard offline-to-online baselines fail while Dsrl succeeded? Ablations to isolate which Dsrl components (e.g., aliasing, initialization from base rollouts) drive the gains.
  • Hybrid methods: Potential benefits of combining steering with small-weight finetuning or residual policies when steerability is insufficient. When and how should Dsrl switch or blend with these alternatives?
  • Interpretability and diagnostics: Tools to visualize and diagnose how ww manipulates the action manifold, detect out-of-support decoding, and monitor safety margins during online adaptation.
  • Theoretical sample complexity: Lack of convergence/sample-efficiency analysis for RL in the latent-noise MDP. How does performance scale with W|W|, task complexity, and the calibration of gg?
  • Robustness to denoiser miscalibration: Effects of approximation error in g(s,w)g(s,w) (e.g., under-trained or miscalibrated diffusion/flow policies) on Dsrl performance and reliability, especially when ss is out-of-distribution relative to demonstrations.
  • Deployment considerations: Persistence and portability of learned steering policies across robots, scenes, and firmware updates; memory/compute footprint; and whether steering learned for one task negatively biases future tasks.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 15 likes about this paper.