How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

This article provides a comprehensive, step‑by‑step analysis of Diffusion Policy for robot visuomotor control, covering its motivation, task characteristics, model design, dataset preparation, training pipeline, inference procedure, experimental results, and open research questions.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

Introduction

The paper Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (arXiv:2303.04137) proposes to represent a robot’s visuomotor policy as a conditional denoising diffusion process, achieving a 46.9% average performance boost across 15 manipulation tasks.

Task Characteristics

Multi‑modal action distribution – multiple reasonable actions may exist for the same observation.

Sequential dependency – each action depends on previous actions.

High precision – small deviations can cause failure in tasks such as surgical manipulation.

Modeling Approach

Instead of explicit action regression, the method models the conditional score function of the action sequence. By adding Gaussian noise to the target action sequence and training a network to predict this noise, the diffusion process can generate coherent multi‑step actions.

Network Architecture

A 1‑D UNet ( ConditionalUnet1d) receives noisy actions, a timestep embedding ( SinusoidalPosEmb), and a global conditioning vector that concatenates visual features (from a ResNet‑18 backbone with BatchNorm replaced by GroupNorm) and low‑dimensional robot state.

class ConditionalUnet1d(nn.Module):
    def __init__(self, input_dim, global_cond_dim, ...):
        ...
    def forward(self, sample, timestep, global_cond=None):
        ...

Dataset Construction

The raw dataset stores per‑step images, robot positions, and single‑step actions. To feed the diffusion model, sequences are built with obs_horizon past observations, action_horizon future actions to execute, and pred_horizon = obs_horizon + action_horizon total diffusion steps. Padding ensures sequences never cross episode boundaries.

def create_sample_indices(episode_ends, sequence_length, pad_before=0, pad_after=0):
    ...
    return np.array(indices)

Training Pipeline

Training follows the DDPM schedule with a squared‑cosine beta schedule. For each batch, observations are encoded, actions are normalized, random timesteps are sampled, and Gaussian noise is added. The network predicts the noise, and an L2 loss is minimized. An EMA of model weights improves stability.

noise = torch.randn_like(action)
noisy_action = noise_scheduler.add_noise(action, noise, timesteps)
noise_pred = model(noisy_action, timesteps, global_cond=obs_cond)
loss = nn.functional.mse_loss(noise_pred, noise)

Inference Procedure

During deployment, the EMA weights are loaded, the visual encoder extracts features, and the diffusion model generates a full action sequence with 100 DDPM steps (or faster DDIM steps). Only the last action_horizon actions are executed before replanning.

# initialize with random noise
action = torch.randn(1, pred_horizon, 2)
for t in scheduler.timesteps:
    pred_noise = model(action, t, global_cond=obs_cond)
    action = scheduler.step(pred_noise, t, action).prev_sample

Experimental Results

On the Push‑T task, the method achieves a success coverage of 0.19 after a single rollout, with smooth, temporally consistent trajectories. Visualizations show the robot pushing a T‑shaped block to a target region without jitter.

Discussion & Open Questions

Why predict past actions that are never executed? Possible benefits include temporal consistency and higher data reuse.

How does flow‑matching training compare to DDPM? Preliminary experiments show higher loss and unstable trajectories.

The article also provides full code snippets for the environment, dataset, model, and training loops, enabling reproducibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningRoboticsdiffusion modelsreinforcement learningpolicy learningvisuomotor
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.