SIM(3)-Equivariant Diffusion Policy

Updated 19 December 2025

The paper introduces a robot imitation learning method that combines SIM(3)-equivariant networks with diffusion models to ensure transformation-consistency and robust generalization.
The methodology leverages canonicalization, a conditional SO(3)-equivariant U-Net, and denoising diffusion processes to efficiently predict control actions in 3D manipulation.
Empirical results demonstrate significant data efficiency and enhanced generalization across diverse simulated and real-world robotic tasks.

A SIM(3)-equivariant diffusion policy is a robot imitation learning approach that combines SIM(3)-equivariant neural networks with diffusion models, ensuring policies that are equivariant under the group of 3D similarity transformations (rotation, translation, isotropic scaling). The approach, exemplified by EquiBot, is motivated by the need for robust, data-efficient generalization across real-world environments and objects, where tasks and visual scenes may vary in pose, location, and scale. By guaranteeing transformation-consistent policy outputs and leveraging the stochastic, multi-modal benefits of diffusion policy learning, this method establishes new benchmarks for generalizability and data efficiency in mobile and manipulation robotics (Yang et al., 2024).

1. SIM(3) Group Action and Representations

The SIM(3) group consists of all similarity transformations in $\mathbb{R}^3$ , formalized as $(R, t, s)$ where $R\in SO(3)$ (rotation), $t\in\mathbb{R}^3$ (translation), and $s>0$ (isotropic scale). Its action is defined as follows:

On a point $x$ : $g\cdot x = s\,R\,x + t$
On a direction $d$ : $g\cdot d = R\,d$
On a scalar $u$ : $g\cdot u = u$ (scalars are invariant)
On a point cloud $X \in \mathbb{R}^{N \times 3}$ : $g\cdot X = s\,X\,R^{\top} + \mathbf{1}\,t^{\top}$ (row-wise)

This formalism provides the foundation for constructing neural policy architectures whose outputs and intermediate representations transform according to SIM(3) group actions, ensuring consistent behavior under environment or task reparameterization.

2. Network Architecture and Equivariance by Construction

EquiBot implements SIM(3)-equivariance through explicit architectural choices layered over the “Diffusion Policy” structure [Chi et al., RSS 2023]:

SIM(3)-Equivariant Encoder: Modeled after PointNet++, the encoder takes as input a segmented point cloud and produces latent codes: $\Theta = (\Theta_R, \Theta_{\rm inv}, \Theta_c, \Theta_s)$ , with rotation-equivariant, invariant, centroid, and scale terms, respectively. The encoding satisfies $\Phi(g \cdot X) = g \cdot \Phi(X)$ by design.
Canonicalization: Downstream processing is performed in a canonical frame—input vectors are centered by $\Theta_c$ and scaled by $\Theta_s$ : $\tilde{x} = (x - \Theta_c)/\Theta_s$ .
Conditional U-Net (SO(3)-Equivariant): The U-Net core processes noisy action sequences and conditioning features within the canonical frame using:
- VecConv1D: treats each 3D-vector as a batch of scalar channels, ensuring group-consistent convolutions,
- VecLinear [Deng et al., ICCV 2021]: replaces linear layers with vector-neuron layers,
- FiLM modulations applied in a vector-equivariant fashion.
De-canonicalization: Action predictions $\hat{a}_{\rm inv}$ $\overset{a}{^}_{inv}$ from the U-Net are mapped back to global coordinates:
- For positions: $\hat{a}^{(v)} = \hat{a}_{\rm inv}^{(v)} \cdot \Theta_s$
- For directions: $\hat{a}^{(d)} = \hat{a}_{\rm inv}^{(d)}$
- For scalars: $\hat{a}^{(s)} = \hat{a}_{\rm inv}^{(s)}$
- The centroid is added to recover absolute positions.

This construction enables strict layer-wise equivariance under SIM(3), a property directly verified through ablation with alternative encoders or by removing equivariant construction (Yang et al., 2024).

3. Diffusion Policy Formulation

The SIM(3)-equivariant policy is realized through a Denoising Diffusion Probabilistic Model (DDPM) [Ho et al., NeurIPS 2020]:

Forward Process: Noising action sequences by $q(a^k \mid a^{k-1}) = \mathcal{N}(a^k; \sqrt{1-\beta_k}\,a^{k-1}, \beta_k I)$ , with cumulative schedule $\bar{\alpha}_k = \prod_{i=1}^k (1-\beta_i)$ .
Reverse (Denoising) Process: Learned Gaussian reverse steps: $p_\theta(a^{k-1} \mid a^k, o_t) = \mathcal{N}\big(a^{k-1}; \mu_\theta(a^k, o_t, k), \sigma_k^2 I \big)$ , where

$\mu_\theta(a^k, o_t, k) = \frac{1}{\sqrt{1-\beta_k}} \bigg( a^k - \frac{\beta_k}{\sqrt{1-\bar\alpha_k}} \epsilon_\theta(a^k,o_t,k) \bigg)$

Implementation Update: $a^{k-1} = \alpha_k(a^k - \gamma_k \epsilon_\theta(a^k,o_t,k)) + \mathcal{N}(0, \sigma_k^2 I)$

The network architecture and canonicalization ensure all conditioning and sampling steps respect SIM(3) symmetry.

4. Conditioning, Equivariance Guarantees, and Policy Output Behavior

The noise-prediction network $\epsilon_\theta$ is constructed to be SIM(3)-equivariant through the use of canonicalized inputs and SO(3)-equivariant U-Net blocks. As the Gaussian prior is itself self-equivariant, and all reverse diffusion steps are group-equivariant, the final conditional marginal distribution $p_\theta(a^0 \mid o_t)$ is invariant under SIM(3) group actions:

$p_\theta\big(g\,a^0 \mid g\,o_t\big) = p_\theta(a^0 \mid o_t),\quad \forall g \in \mathrm{SIM(3)}$

In practical terms, this enforces that for any rotation, translation, or isotropic scale transformation of the scene or objects, the policy output—i.e., the sampled action trajectory—undergoes the corresponding transformation.

5. Training Objective

Training uses the standard “simplified” DDPM noise-prediction loss:

$\mathcal{L}_{\rm diff} = \mathbb{E}_{a^0, o_t, k, \epsilon} \bigl[\|\epsilon - \epsilon_\theta(a^k, o_t, k)\|^2 \bigr]$

As the core imitation learning task involves learning from demonstration actions $a^0_t$ , this loss suffices without additional behavioral cloning (BC) objectives. However, an optional BC term can be included:

$\mathcal{L}_{\rm BC} = \mathbb{E}_{a^0, o_t}[\|\hat{a}^0 - a^0\|^2]$

and the total loss written $\mathcal{L} = \mathcal{L}_{\rm diff} + \lambda \mathcal{L}_{\rm BC}$ for potential tradeoff tuning.

6. Implementation Details, Hyperparameters, and Optimization

The approach employs the following key technical settings:

Encoder: PointNet++-style SIM(3)-equivariant, 4 abstraction layers, hidden dimension 128.
Conditional U-Net: Channel widths aligned to [Chi et al. 2023] with VecConv1D and VecLinear substituting for standard 1D convolutional/linear layers.
Diffusion schedule: Simulation— $K=100$ steps (DDPM); Real-robot— $K=8$ steps (DDIM for efficiency).
Point cloud sampling: 1024 points per frame (256/512 for Robomimic tasks).
Policy horizon: Observation: 2 steps; Prediction: 16 steps; Execution batch: 8 steps.
Normalization: All vector inputs divided by mean scene/action scale; scalar features z-score normalized.
Optimization: Adam optimizer, learning rate $1$e $-4$ , batch size 32. Training: 2000 epochs (simulation), 1000 epochs (real data).

7. Experimental Setup, Baselines, and Empirical Findings

Experiments cover six simulated manipulation tasks (e.g., Cloth Folding, Object Covering, Box Closing, Push T, Robomimic Can/Square) and six real-world mobile manipulation tasks (e.g., Push Chair, Luggage Packing/Closing, Laundry Door Closing, Bimanual Folding, Bimanual Make Bed), with demonstrations parsed as point cloud and finger-pose sequences.

Evaluation Metrics: Mean final reward or success rate (simulation: over 10 episodes, 3 seeds; real world: 10 trials, novel object/poses).
Baselines: Vanilla Diffusion Policy (DP), DP with data augmentation (DP+Aug), and EquivAct (SIM(3)-equivariant, no diffusion).
Generalization: Investigated across original, rotated+rescaled, rotated+nonuniform scale, and large position shift setups.
Data Efficiency: Measured on Robomimic with 25/50/100 demonstration splits.
Ablations: Encoder replaced with DP3 encoder [Ze et al., RSS '24] with and without equivariant wrapper; combined results indicate the necessity of full SIM(3) equivariance with diffusion for optimal performance.

The method demonstrated a reduction of required demonstration data by $5\times$ – $10\times$ and robust generalization to novel object scales, poses, and scenes, both in simulation and on real robotic hardware.

Summary Table: Core Components and Operations

Component	Equivariance Mechanism	Role in Policy
Encoder (PointNet++-style)	SIM(3)-equivariant layers	Extracts canonical features
Canonicalization	Center, scale normalization	Inputs to canonical frame
Conditional U-Net	SO(3)-equivariant blocks	Denoising/noise-prediction
De-canonicalization	Restore original frame	Transforms outputs globally
Diffusion Model	DDPM/Reverse equivariance	Stochastic policy learning

EquiBot establishes the first closed-loop, SIM(3)-equivariant diffusion policy for 3D robotic manipulation that provides, by construction, equivariance under similarity transformations, eliminates the need for explicit data augmentation, achieves substantial data efficiency, and generalizes robustly across diverse manipulation contexts (Yang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SIM(3)-Equivariant Diffusion Policy.