Model-Based Reinforcement Learning from Raw Video: A Detailed Walkthrough

The article explains how to train robots to learn tasks directly from raw video using model-based reinforcement learning, covering POMDP formulation, CNN auto‑encoders, latent‑space representations, iLQR optimization, and a step‑by‑step pipeline with concrete examples and references.

Code DAO
Code DAO
Code DAO
Model-Based Reinforcement Learning from Raw Video: A Detailed Walkthrough

Why Learn from Raw Video

Visual perception is essential for intelligent decision‑making. Hand‑crafted features are labor‑intensive and do not transfer well to real‑world tasks, so learning directly from raw video is crucial.

Training vs. Testing

During training a well‑controlled environment can be built to identify target states, which is not available when deploying solutions in the real world.

Example: a left robotic arm grasps a cube whose pose is known, helping to pre‑train the arm to move to the target. During testing the robot must rely only on its camera observations.

Partial‑Observable Markov Decision Process (POMDP)

Raw images reside in a high‑dimensional space where information is entangled. The robot must learn to encode observations into task‑relevant features.

The encoded representation is then used to decide actions.

Model in Latent Space

Encoding images into a low‑dimensional latent space is common in deep learning. The key idea is to reconstruct the image with minimal error, typically using mean‑squared error (MSE) on pixels.

Reference: http://ml.informatik.uni-freiburg.de/former/_media/publications/rieijcnn12.pdf

In reinforcement learning a CNN auto‑encoder can encode and decode raw images; minimizing reconstruction error ensures the CNN extracts key visual features.

Deep Spatial Auto‑Encoder

Introduced in the paper "Deep Spatial Auto‑Encoder" (arXiv:1509.06113), the method uses raw images to solve RL tasks. The training pipeline consists of:

Set a target end‑effector pose.

Train an exploratory controller.

Learn an image embedding.

Provide a goal.

Train a final controller to achieve the goal.

Step 1 – Set Target End‑Effector Pose

The target pose is defined by three 3‑D points on the end‑effector after pushing a LEGO block.

During pre‑training the arm learns the dynamics to reach the target pose, not how to push the block.

Step 2 – Train Exploratory Controller

Given the target pose, an exploratory controller is trained. The PR2 arm has seven degrees of freedom; the controller computes torques for the seven motors based on joint angles, velocities, and end‑effector position.

The policy is initialized randomly, takes safe random actions, and collects trajectories on the robot.

The loss is minimized to bring the arm as close as possible to the target pose with minimal effort.

Step 3 – Learn Image Embedding

Images from refined trajectories are collected and a CNN auto‑encoder is trained using reconstruction error to extract feature points and important locations.

During training the arm can hold the cube, allowing supervision of the three 3‑D points that define the target pose. A dense layer maps extracted feature points to these target poses, effectively combining supervised learning with the auto‑encoder.

Feature Points

Feature points are the spatial locations of maximal activation in each channel of the final convolutional layer. They correspond to task‑relevant positions, e.g., where the PR2 holds a spatula.

To extract them, a softmax is applied over the pixels of each channel, yielding a probability map. The feature point for channel c is computed as:

p_c = \sum_{i}\;\text{softmax}(\text{pixel}_i) \cdot \text{pixel}_i

Visualizations show two feature points tracking the spatula and a bag, moving from red to yellow to green over time.

Combining State and Feature Points – POMDP

Feature points are concatenated with joint angles and end‑effector positions to form the POMDP state space. Any RL algorithm can then be applied.

The cost function may be expressed as:

Here d measures the distance between feature points (or end‑effector) and their target positions. iLQR is used to plan optimal control, and dense networks can be trained to imitate the controller’s trajectory samples.

Overall, the pipeline demonstrates how raw visual data can be encoded, used to learn dynamics, and integrated into reinforcement‑learning policies for robotic manipulation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

roboticsreinforcement learningmodel-based RLPOMDPCNN autoencoderiLQRraw video
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.