How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning
This article provides a comprehensive, step‑by‑step analysis of Diffusion Policy for robot visuomotor control, covering its motivation, task characteristics, model design, dataset preparation, training pipeline, inference procedure, experimental results, and open research questions.
Introduction
The paper Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (arXiv:2303.04137) proposes to represent a robot’s visuomotor policy as a conditional denoising diffusion process, achieving a 46.9% average performance boost across 15 manipulation tasks.
Task Characteristics
Multi‑modal action distribution – multiple reasonable actions may exist for the same observation.
Sequential dependency – each action depends on previous actions.
High precision – small deviations can cause failure in tasks such as surgical manipulation.
Modeling Approach
Instead of explicit action regression, the method models the conditional score function of the action sequence. By adding Gaussian noise to the target action sequence and training a network to predict this noise, the diffusion process can generate coherent multi‑step actions.
Network Architecture
A 1‑D UNet ( ConditionalUnet1d) receives noisy actions, a timestep embedding ( SinusoidalPosEmb), and a global conditioning vector that concatenates visual features (from a ResNet‑18 backbone with BatchNorm replaced by GroupNorm) and low‑dimensional robot state.
class ConditionalUnet1d(nn.Module):
def __init__(self, input_dim, global_cond_dim, ...):
...
def forward(self, sample, timestep, global_cond=None):
...Dataset Construction
The raw dataset stores per‑step images, robot positions, and single‑step actions. To feed the diffusion model, sequences are built with obs_horizon past observations, action_horizon future actions to execute, and pred_horizon = obs_horizon + action_horizon total diffusion steps. Padding ensures sequences never cross episode boundaries.
def create_sample_indices(episode_ends, sequence_length, pad_before=0, pad_after=0):
...
return np.array(indices)Training Pipeline
Training follows the DDPM schedule with a squared‑cosine beta schedule. For each batch, observations are encoded, actions are normalized, random timesteps are sampled, and Gaussian noise is added. The network predicts the noise, and an L2 loss is minimized. An EMA of model weights improves stability.
noise = torch.randn_like(action)
noisy_action = noise_scheduler.add_noise(action, noise, timesteps)
noise_pred = model(noisy_action, timesteps, global_cond=obs_cond)
loss = nn.functional.mse_loss(noise_pred, noise)Inference Procedure
During deployment, the EMA weights are loaded, the visual encoder extracts features, and the diffusion model generates a full action sequence with 100 DDPM steps (or faster DDIM steps). Only the last action_horizon actions are executed before replanning.
# initialize with random noise
action = torch.randn(1, pred_horizon, 2)
for t in scheduler.timesteps:
pred_noise = model(action, t, global_cond=obs_cond)
action = scheduler.step(pred_noise, t, action).prev_sampleExperimental Results
On the Push‑T task, the method achieves a success coverage of 0.19 after a single rollout, with smooth, temporally consistent trajectories. Visualizations show the robot pushing a T‑shaped block to a target region without jitter.
Discussion & Open Questions
Why predict past actions that are never executed? Possible benefits include temporal consistency and higher data reuse.
How does flow‑matching training compare to DDPM? Preliminary experiments show higher loss and unstable trajectories.
The article also provides full code snippets for the environment, dataset, model, and training loops, enabling reproducibility.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
