Reactive Diffusion Policy for Robotic Control

Updated 20 December 2025

Reactive Diffusion Policy (RDP) is a control paradigm that leverages conditional denoising diffusion for immediate, closed-loop, multi-modal robotic action generation.
It integrates innovations like noise-relaying buffers, real-time iteration, and hierarchical architectures to drastically reduce inference latency and boost response rates.
Empirical evaluations show improvements including an 18% higher success rate and high-frequency replanning (up to 50–100 Hz) in dynamic, contact-rich tasks.

A Reactive Diffusion Policy (RDP) is a control paradigm for robotic visuomotor and multi-modal policy learning that leverages conditional denoising diffusion processes to generate actions in a highly responsive, closed-loop manner. Unlike classical diffusion policies, which suffer from high latency due to multi-step stochastic denoising required for action generation and may lack reactivity during action chunk execution, RDPs are architected to ensure immediate adaptation to the most recent observations. By integrating algorithmic innovations such as noise-relaying buffers, real-time iteration schemes, fast-slow hierarchical architectures, and explicit modality fusion, RDPs enable low-latency, robust, and multimodal action generation suitable for contact-rich, dynamic, and real-time tasks.

1. Mathematical Foundations

Reactive Diffusion Policies generalize the denoising diffusion probabilistic model (DDPM) framework to the closed-loop, real-time control of robotic systems. The canonical RDP involves the forward noising of a ground-truth action or action sequence and a learned reverse denoising process, conditioned on the observation at each control step.

Forward (Noising) Process

Given a clean action vector $\mathbf{a} \equiv a^{(0)}$ or action sequence, the forward diffusion process is defined as

$q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$

with a pre-specified variance schedule $\{\beta_k\}_{k=1}^f$ . Equivalently,

$a^{(k)} = \sqrt{\bar\alpha_k}\,a^{(0)} + \sqrt{1-\bar\alpha_k}\,\epsilon,\quad \epsilon\sim\mathcal{N}(0,I),\quad \bar\alpha_k=\prod_{i=1}^k(1-\beta_i)$

applies to each element or sequence chunk as appropriate (Chen et al., 18 Feb 2025, Chi et al., 2023).

Reverse (Denoising) Process

The learned model approximates the reverse distribution,

$p_\theta\bigl(a^{(k-1)}\mid a^{(k)},o_t,k\bigr) = \mathcal{N}\left(a^{(k-1)};\, \mu_\theta(a^{(k)},o_t,k),\, \Sigma_k\right)$

where $\mu_\theta$ is parameterized through noise prediction via $\varepsilon_\theta$ : $\mu_\theta(a^{(k)},o_t,k) = \frac{1}{\sqrt{1-\beta_k}} \left[a^{(k)} - \frac{\beta_k}{\sqrt{1-\bar\alpha_k}}\, \varepsilon_\theta(a^{(k)},o_t,k)\right]$ $\Sigma_k$ is often set to $\beta_k I$ (Chen et al., 18 Feb 2025).

The training objective typically uses mean-squared error denoising score matching,

$q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 0

where $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 1 is a noise-corrupted version of the demonstrator action(s) (Chen et al., 18 Feb 2025, Chi et al., 2023).

2. Real-Time Inference and Algorithmic Mechanisms

RDPs achieve their hallmark reactivity by replacing the typical “long-horizon, multi-step” denoising with specialized mechanisms enabling immediate per-step closed-loop control.

Noise-Relaying Buffer and Sequential Denoising (RNR-DP)

Responsive Noise-Relaying Diffusion Policy (RNR-DP) maintains a sliding buffer of actions at incrementally decreasing noise levels. At each environment step:

The denoiser is applied once to the entire buffer (length $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 2), producing a fully denoised head action and denoised intermediate states.
The head (clean) action is executed; the buffer is shifted, appending a new high-noise sample at the tail.
This achieves a single network evaluation per control step, which yields $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 3 NFE (network forward evaluation) per action, compared to $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 4 for the vanilla diffusion policy of chunk size $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 5 and $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 6 denoising steps (Chen et al., 18 Feb 2025).

Initialization employs a “laddering” phase, running $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 7 pre-steps to ensure the buffer’s distribution matches that seen during training.

Real-Time Iteration (RTI) Scheme

The RTI-DP scheme amortizes multi-step denoising across steps by “warm-starting” the diffusion chain with the tail of the last predicted trajectory. At time $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 8, the input chunk is obtained by shifting the previous chunk, dropping the executed action, and duplicating the last element. Only $q\bigl(a^{(k)}\mid a^{(k-1)}\bigr) = \mathcal{N}\left(a^{(k)};\, \sqrt{1-\beta_k} a^{(k-1)},\, \beta_k I\right)$ 9 denoising steps are then performed. Contractivity guarantees (given Lipschitz score networks) show that a small $\{\beta_k\}_{k=1}^f$ 0 suffices for accurate recovery (Duan et al., 7 Aug 2025).

Action Queues and High-Frequency Replanning (RA-DP)

RA-DP trains a policy to denoise actions at mixed noise levels within an “action queue.” At inference, the queue is updated at each timestep:

The buffer’s entries are each denoised by one step, outputting a clean action at the head.
The queue is shifted; a new high-noise action is appended.
Action-specific differentiable guidance terms (e.g., obstacle avoidance) can be applied at test time without retraining, by backpropagating through the denoising step (Ye et al., 6 Mar 2025).

Slow-Fast Hierarchies

Several RDP variants (e.g., “RDP” in (Xue et al., 4 Mar 2025), ImplicitRDP (Chen et al., 11 Dec 2025)) implement a two-level policy:

Slow (high-level) diffusion policy: Samples coarse action chunks or latent codes via DDPM at a low rate (1–2 Hz).
Fast (low-level) corrective module: Takes the current chunk/latent and high-frequency feedback (tactile/force), producing real-time action corrections at $\{\beta_k\}_{k=1}^f$ 1– $\{\beta_k\}_{k=1}^f$ 2 Hz.

The slow policy operates on compressed latent representations; the fast policy employs a decoder (e.g., asymmetric tokenizer GRU) that learns to integrate the fixed latent trajectory with the latest touch or force features (Xue et al., 4 Mar 2025). In ImplicitRDP, a structural slow–fast architecture uses causal attention to directly fuse asynchronous modalities (vision, proprio, force) within a single Transformer, where attention masking and virtual-target regularization mitigate “modality collapse” and ensure closed-loop force adjustability (Chen et al., 11 Dec 2025).

Virtual-Target and Representation Regularization

ImplicitRDP introduces virtual-target-based representation regularization (VRR), where force signals are mapped into the same Cartesian space as actions using a compliance model $\{\beta_k\}_{k=1}^f$ 3. Predicting the virtual target provides a strong learning signal for contact-rich events, guiding the network to utilize force feedback appropriately and overcoming the tendency of vision-only models to ignore high-frequency contacts (Chen et al., 11 Dec 2025).

4. Empirical Evaluation and Performance

Table: Representative Policy Benchmarking

Framework	Task Domain	Success Rate / Acceleration	Key Feature
RNR-DP (Chen et al., 18 Feb 2025)	ManiSkill/Adroit	+18% success vs DP on response-sensitive	Noise-relaying buffer, true “single-step” inference
RTI-DP (Duan et al., 7 Aug 2025)	Robomimic	25 ms per step (<800 ms DP); 0.90+ success	Real-time iteration, contractivity analysis, scaling discrete actions
RA-DP (Ye et al., 6 Mar 2025)	MetaWorld/Real	3.6–5x replanning frequency, +2.9–10.3 pp SR	Training-free guidance compatible action-queue, obstacle avoidance
RDP (SF) (Xue et al., 4 Mar 2025)	Visual-Tactile	0.90/0.95 “All” score on peeling w/ force	Slow-fast hierarchy, tactile fusion, asymmetric tokenizer
ImplicitRDP (Chen et al., 11 Dec 2025)	Contact-rich	18/20 box flipping vs 0/20 vision-only	End-to-end vision-force fusion, SSL, VRR
SRPO (Chen et al., 2023)	D4RL/MuJoCo	25–1000x faster sampling, SOTA score	Score regularization, no denoising at test time

RDPs consistently yield both superior task success and order-of-magnitude inference acceleration across diverse robotic domains. Notably, RNR-DP achieves $\{\beta_k\}_{k=1}^f$ 4 higher success over baseline DP on dynamic manipulation, and RA-DP supports 50–100 Hz replanning on physical robots with dynamic obstacles (Chen et al., 18 Feb 2025, Ye et al., 6 Mar 2025). Hierarchical and slow-fast variants (RDP (Xue et al., 4 Mar 2025), ImplicitRDP (Chen et al., 11 Dec 2025)) outperform vision-only policies, particularly on contact-rich manipulation, with reported gains up to 0.95 score or full trial completion.

5. Implementation and Practical Considerations

Buffer and Action Queue Size

Empirical ablations show optimal buffer capacities or action queue lengths (RNR-DP: $\{\beta_k\}_{k=1}^f$ 5), with stable performance in a broad range. Policies relying only on linear or purely random diffusion schedules markedly underperform hybrid schedules (Chen et al., 18 Feb 2025).

Inference Complexity and Latency

All RDP mechanisms achieve near real-time control (20–100 Hz) on standard hardware (e.g., NVIDIA A100). Methods such as RTI-DP and RA-DP eliminate the need for retraining or distillation and are natively compatible with pre-trained diffusion policies (Ye et al., 6 Mar 2025, Duan et al., 7 Aug 2025).

Structural slow–fast learning and regularization (e.g., VRR in ImplicitRDP) are critical for fusing asynchronous or high-frequency modalities (force, tactile, proprio) and preventing the network from ignoring crucial feedback (modality collapse). This results in robust adaptation to contacts and perturbations, crucial for dexterous manipulation (Chen et al., 11 Dec 2025, Xue et al., 4 Mar 2025).

6. Limitations and Future Directions

Documented limitations include memory overhead for maintaining buffers or queues, initial laddering warm-up cost, and as yet incomplete theoretical characterization of diversity preservation in multi-modal behaviors (Chen et al., 18 Feb 2025). A promising direction is hierarchical buffering for very high-dimensional controllers and integration with temporal abstractions to allow strategic as well as reflexive actions (Chen et al., 18 Feb 2025). Extensions to vision-language and hybrid planning-reactive architectures are anticipated.

A plausible implication is that RDP frameworks, especially those supporting guidance signals (RA-DP), provide a practical path to “plug-and-play” constraint adaptation in real-time, without architectural retraining. This property, alongside demonstrated quantitative improvements in latency-sensitive and contact-rich settings, positions Reactive Diffusion Policies as a central paradigm for forthcoming generations of closed-loop, multimodal robot control.