Diffusion Models for Robotic Manipulation: A Survey

Published 11 Apr 2025 in cs.RO and stat.ML | (2504.08438v2)

Abstract: Diffusion generative models have demonstrated remarkable success in visual domains such as image and video generation. They have also recently emerged as a promising approach in robotics, especially in robot manipulations. Diffusion models leverage a probabilistic framework, and they stand out with their ability to model multi-modal distributions and their robustness to high-dimensional input and output spaces. This survey provides a comprehensive review of state-of-the-art diffusion models in robotic manipulation, including grasp learning, trajectory planning, and data augmentation. Diffusion models for scene and image augmentation lie at the intersection of robotics and computer vision for vision-based tasks to enhance generalizability and data scarcity. This paper also presents the two main frameworks of diffusion models and their integration with imitation learning and reinforcement learning. In addition, it discusses the common architectures and benchmarks and points out the challenges and advantages of current state-of-the-art diffusion-based methods.

Abstract PDF Upgrade to Chat

Summary

The paper provides a comprehensive survey on how diffusion models robustly model complex, multi-modal distributions for robotic manipulation tasks such as grasping, trajectory generation, and data augmentation.
It details the mathematical foundations of score-based and denoising diffusion models and describes architectural improvements like faster sampling techniques and robot-specific conditioning.
The survey benchmarks these methods in both simulation and real-world settings, while discussing challenges in generalizability and sampling speed critical for practical robotic applications.

This survey provides a comprehensive review of the application of diffusion models (DMs) in robotic manipulation, covering grasp learning, trajectory planning, and data augmentation (2504.08438). It highlights the advantages of DMs, such as their ability to model complex, multi-modal distributions robustly in high-dimensional spaces, often outperforming methods like GMMs, EBMs, and GANs. The survey notes a significant increase in DM applications in robotics since 2022.

Fundamentals of Diffusion Models:

The paper first introduces the mathematical foundations of two primary DM frameworks:

Score-Based DMs (SMLD/NCSM): Learn the score (gradient of the log-likelihood) of perturbed data distributions to reverse a noise-adding (forward) process using Langevin dynamics.
Denoising Diffusion Probabilistic Models (DDPM): Train a network to predict the noise added during the forward process and use this prediction to iteratively denoise samples in the reverse process.

It discusses key architectural improvements aimed at addressing the slow sampling speed inherent in DMs:

Faster Sampling: Techniques like Denoising Diffusion Implicit Models (DDIM) (Ho et al., 2020), SDE/ODE formulations (Song et al., 2020), and specialized solvers (e.g., DPM-Solver (Lu et al., 2022)) allow for fewer sampling steps, often using deterministic processes. Non-uniform step sizes and learned noise schedules (iDDPM (Nichol et al., 2021)) further enhance sample quality and speed.
Robotics Adaptations: Conditioning the denoising process on robot observations (proprioception, visual data like images/point clouds, language instructions) is crucial. Handling temporally correlated data like trajectories often involves predicting subsequences using receding horizon control.

Architectures for Robotic Manipulation:

Three main architectures are used for the denoising network in robotic DMs:

CNNs (Temporal U-Nets): Adapted from image generation U-Nets (Ronneberger et al., 2015), using 1D temporal convolutions. Conditioning is often done via FiLM layers (Perez et al., 2017). They are generally robust but can cause over-smoothing. (Fraser et al., 2022, Chi et al., 2023)
Transformers: Process sequences of observations, actions, and time steps as tokens, using attention mechanisms for conditioning. They excel at long-range dependencies but can be hyperparameter-sensitive and resource-intensive. (Chi et al., 2023, Peebles et al., 2022)
MLPs: Often used in RL settings, computationally efficient but may struggle with high-dimensional (visual) inputs unless combined with encoders. (Wang et al., 2022)

The number of sampling steps is a critical trade-off between speed and quality, with DDIM commonly used with 5-10 steps during inference, down from 50-100 training steps.

Applications in Robotic Manipulation:

Trajectory Generation:
- Imitation Learning (IL): DMs generate smooth, multi-modal trajectories conditioned on observations. Key aspects include:
  - Action Representation: Predicting end-effector poses (task space), joint angles (joint space), or even image sequences (image space). Receding horizon control is common.
  - Visual Input: Using 2D images or increasingly 3D data (point clouds, embeddings) for better geometric understanding. (2403.0395, Pozanco et al., 2024)
  - Long-Horizon/Multi-Task: Addressed via hierarchical planning, skill learning (often using DMs for low-level skills), or integrating Vision-Language-Action models (VLAs) where DMs refine VLA outputs or generate actions directly. (Chernobai, 2023, Liu et al., 2024)
  - Constrained Planning: Achieved through classifier guidance (separate model steers diffusion) or classifier-free guidance (conditioning integrated into the DM). (Ho et al., 2022, Nichol et al., 2021)
- Offline Reinforcement Learning (RL): Integrates rewards into the DM framework:
  - Guidance: Using reward models to guide sampling (Diffuser (Janner et al., 2022)).
  - Conditioning: Directly conditioning the DM on returns (Decision Diffuser (Bruyère et al., 2022)).
  - Q-Learning Integration: Modifying the DM loss with a critic (Diffusion-QL (Wang et al., 2022)). Offline RL can leverage suboptimal data better than IL but requires careful tuning and often relies on state inputs rather than raw visual data.
Robotic Grasp Generation:
- SE(3) Diffusion: Directly generating 6-DoF grasp poses, addressing the non-Euclidean nature of SE(3) via EBMs on Lie groups (Urain et al., 2022), flow matching (Lipman et al., 2022, Akramov et al., 2024), or ensuring SE(3)-equivariance (Zhang et al., 2023). This applies to both parallel jaw and dexterous grasps (Garrouste et al., 2023).
- Latent Diffusion: Performing diffusion in a latent space learned by a VAE (Thainá-Batista et al., 2023).
- Affordance/Task-Driven: Using language guidance (Lee et al., 2024), learning pre-grasp manipulations (Wu et al., 2024), synthesizing human-object interactions (HOI) (Okamoto et al., 2024), or diffusing object poses for rearrangement (Yang et al., 2022).
Visual Data Augmentation:
- Scaling Data: Using pretrained DMs (like Stable Diffusion (Rombach et al., 2021)) for inpainting to change textures, objects, or backgrounds, increasing dataset size and diversity for IL/RL (Chen et al., 2023).
- Sensor Data Reconstruction: Completing partial point clouds or images from sensors using DM-based inpainting, sometimes combined with view planning (Prajnanaswaroopa, 2023, Hassouna et al., 2024).
- Object Rearrangement: Generating target scene arrangements from language prompts using text-to-image DMs, often combined with other models like LLMs or NeRFs (Ydrefors et al., 2022, Kapelyukh et al., 2022).

Experiments and Benchmarks:

The survey lists common benchmarks (CALVIN, RLBench, D4RL Kitchen, Meta-World) and baselines (SE(3)-Diffusion Policy, Diffuser, Diffusion Policy, 3D Diffusion Policy). Most methods are evaluated in simulation, with many also tested on real robots, though often trained on real data or requiring sim-to-real techniques.

Conclusion, Limitations, and Outlook:

DMs excel at modeling multi-modal distributions and handling high-dimensional data, making them powerful tools for robotic manipulation. Key limitations remain:

Generalizability: Performance often depends heavily on training data quality and diversity (covariate shift in IL, distribution shift in offline RL). Data augmentation helps but has limits. VLAs offer promise but need refinement.
Sampling Speed: Iterative sampling remains a bottleneck for real-time control, despite improvements like DDIM. Faster samplers need more investigation in robotics contexts.

Future directions include exploring faster sampling methods, improving generalizability through continual learning and foundation model integration, enhancing robustness in complex/occluded scenes, and leveraging semantic reasoning capabilities.

Markdown Report Issue