SIM(3)-Equivariant Diffusion Policy
- The paper introduces a robot imitation learning method that combines SIM(3)-equivariant networks with diffusion models to ensure transformation-consistency and robust generalization.
- The methodology leverages canonicalization, a conditional SO(3)-equivariant U-Net, and denoising diffusion processes to efficiently predict control actions in 3D manipulation.
- Empirical results demonstrate significant data efficiency and enhanced generalization across diverse simulated and real-world robotic tasks.
A SIM(3)-equivariant diffusion policy is a robot imitation learning approach that combines SIM(3)-equivariant neural networks with diffusion models, ensuring policies that are equivariant under the group of 3D similarity transformations (rotation, translation, isotropic scaling). The approach, exemplified by EquiBot, is motivated by the need for robust, data-efficient generalization across real-world environments and objects, where tasks and visual scenes may vary in pose, location, and scale. By guaranteeing transformation-consistent policy outputs and leveraging the stochastic, multi-modal benefits of diffusion policy learning, this method establishes new benchmarks for generalizability and data efficiency in mobile and manipulation robotics (Yang et al., 2024).
1. SIM(3) Group Action and Representations
The SIM(3) group consists of all similarity transformations in , formalized as where (rotation), (translation), and (isotropic scale). Its action is defined as follows:
- On a point :
- On a direction :
- On a scalar : (scalars are invariant)
- On a point cloud : (row-wise)
This formalism provides the foundation for constructing neural policy architectures whose outputs and intermediate representations transform according to SIM(3) group actions, ensuring consistent behavior under environment or task reparameterization.
2. Network Architecture and Equivariance by Construction
EquiBot implements SIM(3)-equivariance through explicit architectural choices layered over the “Diffusion Policy” structure [Chi et al., RSS 2023]:
- SIM(3)-Equivariant Encoder: Modeled after PointNet++, the encoder takes as input a segmented point cloud and produces latent codes: , with rotation-equivariant, invariant, centroid, and scale terms, respectively. The encoding satisfies by design.
- Canonicalization: Downstream processing is performed in a canonical frame—input vectors are centered by and scaled by : .
- Conditional U-Net (SO(3)-Equivariant): The U-Net core processes noisy action sequences and conditioning features within the canonical frame using:
- VecConv1D: treats each 3D-vector as a batch of scalar channels, ensuring group-consistent convolutions,
- VecLinear [Deng et al., ICCV 2021]: replaces linear layers with vector-neuron layers,
- FiLM modulations applied in a vector-equivariant fashion.
- De-canonicalization: Action predictions from the U-Net are mapped back to global coordinates:
- For positions:
- For directions:
- For scalars:
- The centroid is added to recover absolute positions.
This construction enables strict layer-wise equivariance under SIM(3), a property directly verified through ablation with alternative encoders or by removing equivariant construction (Yang et al., 2024).
3. Diffusion Policy Formulation
The SIM(3)-equivariant policy is realized through a Denoising Diffusion Probabilistic Model (DDPM) [Ho et al., NeurIPS 2020]:
- Forward Process: Noising action sequences by , with cumulative schedule .
- Reverse (Denoising) Process: Learned Gaussian reverse steps: , where
- Implementation Update:
The network architecture and canonicalization ensure all conditioning and sampling steps respect SIM(3) symmetry.
4. Conditioning, Equivariance Guarantees, and Policy Output Behavior
The noise-prediction network is constructed to be SIM(3)-equivariant through the use of canonicalized inputs and SO(3)-equivariant U-Net blocks. As the Gaussian prior is itself self-equivariant, and all reverse diffusion steps are group-equivariant, the final conditional marginal distribution is invariant under SIM(3) group actions:
In practical terms, this enforces that for any rotation, translation, or isotropic scale transformation of the scene or objects, the policy output—i.e., the sampled action trajectory—undergoes the corresponding transformation.
5. Training Objective
Training uses the standard “simplified” DDPM noise-prediction loss:
As the core imitation learning task involves learning from demonstration actions , this loss suffices without additional behavioral cloning (BC) objectives. However, an optional BC term can be included:
and the total loss written for potential tradeoff tuning.
6. Implementation Details, Hyperparameters, and Optimization
The approach employs the following key technical settings:
- Encoder: PointNet++-style SIM(3)-equivariant, 4 abstraction layers, hidden dimension 128.
- Conditional U-Net: Channel widths aligned to [Chi et al. 2023] with VecConv1D and VecLinear substituting for standard 1D convolutional/linear layers.
- Diffusion schedule: Simulation— steps (DDPM); Real-robot— steps (DDIM for efficiency).
- Point cloud sampling: 1024 points per frame (256/512 for Robomimic tasks).
- Policy horizon: Observation: 2 steps; Prediction: 16 steps; Execution batch: 8 steps.
- Normalization: All vector inputs divided by mean scene/action scale; scalar features z-score normalized.
- Optimization: Adam optimizer, learning rate $1$e, batch size 32. Training: 2000 epochs (simulation), 1000 epochs (real data).
7. Experimental Setup, Baselines, and Empirical Findings
Experiments cover six simulated manipulation tasks (e.g., Cloth Folding, Object Covering, Box Closing, Push T, Robomimic Can/Square) and six real-world mobile manipulation tasks (e.g., Push Chair, Luggage Packing/Closing, Laundry Door Closing, Bimanual Folding, Bimanual Make Bed), with demonstrations parsed as point cloud and finger-pose sequences.
- Evaluation Metrics: Mean final reward or success rate (simulation: over 10 episodes, 3 seeds; real world: 10 trials, novel object/poses).
- Baselines: Vanilla Diffusion Policy (DP), DP with data augmentation (DP+Aug), and EquivAct (SIM(3)-equivariant, no diffusion).
- Generalization: Investigated across original, rotated+rescaled, rotated+nonuniform scale, and large position shift setups.
- Data Efficiency: Measured on Robomimic with 25/50/100 demonstration splits.
- Ablations: Encoder replaced with DP3 encoder [Ze et al., RSS '24] with and without equivariant wrapper; combined results indicate the necessity of full SIM(3) equivariance with diffusion for optimal performance.
The method demonstrated a reduction of required demonstration data by – and robust generalization to novel object scales, poses, and scenes, both in simulation and on real robotic hardware.
Summary Table: Core Components and Operations
| Component | Equivariance Mechanism | Role in Policy |
|---|---|---|
| Encoder (PointNet++-style) | SIM(3)-equivariant layers | Extracts canonical features |
| Canonicalization | Center, scale normalization | Inputs to canonical frame |
| Conditional U-Net | SO(3)-equivariant blocks | Denoising/noise-prediction |
| De-canonicalization | Restore original frame | Transforms outputs globally |
| Diffusion Model | DDPM/Reverse equivariance | Stochastic policy learning |
EquiBot establishes the first closed-loop, SIM(3)-equivariant diffusion policy for 3D robotic manipulation that provides, by construction, equivariance under similarity transformations, eliminates the need for explicit data augmentation, achieves substantial data efficiency, and generalizes robustly across diverse manipulation contexts (Yang et al., 2024).