How to Boost Robot Imitation Learning with Cosmos World Model Data Augmentation

This guide demonstrates an end‑to‑end workflow on Alibaba Cloud PAI that uses the Cosmos world model to replace Isaac simulation for robot action data augmentation, including minimal human demonstrations, prompt‑driven data expansion, rejection sampling, IDM inverse‑kinematics extraction, imitation‑learning fine‑tuning, and model evaluation.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How to Boost Robot Imitation Learning with Cosmos World Model Data Augmentation

In the previous Notebook series we introduced "Operation Action Data Augmentation and Imitation Learning based on Isaac Simulation". This article presents a similar pipeline that uses the Cosmos world model as the core, covering human demonstration, data augmentation, imitation learning, and model evaluation without requiring simulation hardware.

Compared with the Isaac‑based approach, the Cosmos‑based solution offers:

Human demonstration and data augmentation run entirely on AI compute (CUDA/Tensor cores) without simulation (RT Core) resources.

No need for action labeling; raw video data can be directly used for augmentation.

Data enhancement is achieved by adjusting prompts rather than a separate augmentation stage.

Additional steps such as rejection sampling and IDM inverse‑kinematics are required to filter unrealistic content and recover missing action sequences.

A ready‑to‑run example is provided in PAI’s Notebook Gallery:

https://gallery.pai-ml.com/#/preview/deepLearning/cv/isaac_gr00t_wf2

01 Artificial Minimal Demonstration

Human operators record short videos of pick‑and‑place tasks (e.g., picking a bok choy) without any action labeling. The video and a textual description such as "Use the right hand to pick up green bok choy from the tan table right side to the bottom level of the wire basket" are collected until enough samples (e.g., 100) are obtained for fine‑tuning the Cosmos‑Predict model.

02 Data Augmentation

The collected videos are used to fine‑tune the Cosmos‑Predict model (e.g., Cosmos-Predict2-2B-Video2World) on a 4‑GPU node:

!torchrun --nproc_per_node=4 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py --experiment=predict2_video2world_training_2b_groot_gr1_480

For larger models (e.g., Cosmos-Predict2-14B-Video2World) the same command can be run on a 4‑node DLC cluster with 8 GPUs per node.

import os
import json
import time
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_credentials.models import Config as CredConfig
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest, GetJobRequest

def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None

def main():
    current_time_tuple = time.localtime()
    year = current_time_tuple.tm_year
    month = current_time_tuple.tm_mon
    day = current_time_tuple.tm_mday
    hour = current_time_tuple.tm_hour
    minute = current_time_tuple.tm_min
    display_name = f"train_cosmos-predict2_14b_{day}_{hour}-{minute}"  # set task name
    region_id = os.environ.get("dsw_region")
    workspace_id = os.environ.get('PAI_WORKSPACE_ID')
    image_uri = f"dsw-registry.{region_id}.cr.aliyuncs.com/pai-training-algorithm/isaac-sim:gr00t-dreams-v9"
    ecs_spec = "ecs.gn8v-8x.16xlarge"
    num_gpus = 8
    num_nodes = 4
    config = "cosmos_predict2/configs/base/config.py"
    exp = "predict2_video2world_training_14b_groot_gr1_480"
    credentialsConfig = CredConfig(type='credentials_uri')
    cred = CredClient(credentialsConfig)
    dlc_client = DLCClient(config=Config(credential=cred, region_id=region_id, endpoint=f'pai-dlc.{region_id}.aliyuncs.com'))
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': display_name,
        'JobType': 'PyTorchJob',
        'JobSpecs': [{
            "Type": "Worker",
            "Image": image_uri,
            "PodCount": num_nodes,
            "EcsSpec": ecs_spec,
        }],
        'DataSources': [{"DataSourceId": dataset_id}],
        'UserVpc': {"VpcId": vpc_id, "SwitchId": switch_id, "SecurityGroupId": security_groupid},
        "UserCommand": f"export NVTE_FUSED_ATTN=0 && rm -rf /workspace/cosmos-predict2/checkpoints && rm -rf /workspace/cosmos-predict2/datasets/benchmark_train/gr1 && ln -s /mnt/data/notebook2/checkpoints /workspace/cosmos-predict2/checkpoints && ln -s /mnt/data/notebook2/gr1 /workspace/cosmos-predict2/datasets/benchmark_train/gr1 && cd /workspace/cosmos-predict2 && torchrun --nproc_per_node={num_gpus} --nnodes={num_nodes} --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 -m scripts.train --config={config} --experiment={exp} model.config.fsdp_shard_size=0"
    }))
    job_id = create_job_resp.body.job_id
    wait_for_job_to_terminate(dlc_client, job_id)

if __name__ == '__main__':
    main()

03 Rejection Sampling

Multiple candidate videos are generated and scored by Cosmos-Reason1. The highest‑scoring video is kept. Scoring criteria include motion continuity, temporal consistency, physical plausibility, visual quality, and logical scene coherence.

!torchrun --nproc_per_node=4 --master_port=12341 -m examples.video2world_bestofn \
    --model_size 14B --gr00t_variant gr1 \
    --prompt "Use the right hand to pick up rubik's cube from the bottom of the three‑tiered wooden shelf to the top of the three‑tiered wooden shelf." \
    --input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik's_cube_from_the_bottom_of_the_three-tiered_wooden_shelf_to_the_top_of_the_three-tiered_wooden_shelf..png \
    --num_gpus 2 --num_generations 4 --prompt_prefix "" \
    --disable_guardrail --save_path output/best-of-n-gr00t-gr1

04 IDM Inverse‑Dynamics Decoding

Since Cosmos‑Predict outputs videos without explicit action sequences, an Inverse Dynamics Model (IDM) extracts the missing actions:

!PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1,2,3 python IDM_dump/dump_idm_actions.py \
    --checkpoint "seonghyeonye/IDM_gr1" \
    --dataset "IDM_dump/data/gr1_unified.data" \
    --output_dir "IDM_dump/data/gr1_unified.data_idm" \
    --num_gpus 4 \
    --video_indices "0 8"

For environments with restricted access to HuggingFace, set the mirror endpoint: HF_ENDPOINT=https://hf-mirror.com Results are stored in Parquet files, which can be inspected with parquet-tools:

!uv pip install parquet-tools
!parquet-tools csv IDM_dump/data/gr1_unified.data_idm/data/chunk-000/episode_000000.parquet

05 Imitation Learning

The augmented dataset is used to fine‑tune the GR00T‑N1 model:

!cd /workspace/GR00T-Dreams/
!export HF_HOME=/mnt/data/notebook2 && export WANDB_MODE=offline && \
    bash IDM_dump/scripts/finetune/gr1.sh

The training script ( gr1.sh) defines a Config dataclass with hyper‑parameters such as batch size, learning rate, LoRA settings, and GPU allocation, then launches a TrainRunner using the LeRobotSingleDataset loader and the GR00T_N1 model.

06 Model Evaluation

Evaluation on a real GR1 robot shows that the baseline GR00T‑N1 model achieves 11.2% success on known scenes, while the model fine‑tuned with Cosmos‑augmented data reaches 43.2%. In unknown scenes, success improves from 0% to 28.5% after augmentation.

07 Summary

This best‑practice demonstrates a complete end‑to‑end pipeline on Alibaba Cloud PAI that leverages the Cosmos world model for robot action data augmentation and imitation learning, covering minimal human demonstration, prompt‑driven augmentation, rejection sampling, IDM inverse‑kinematics extraction, fine‑tuning, and quantitative model evaluation. Compared with Isaac simulation, the Cosmos‑based approach eliminates the need for simulation hardware, removes manual action labeling, and integrates data enhancement directly via prompt engineering, albeit requiring extra rejection sampling and IDM steps.

Data AugmentationAImodel evaluationroboticsImitation LearningCosmos
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.