How to Boost Robot Imitation Learning with Cosmos World Model Data Augmentation
This guide demonstrates an end‑to‑end workflow on Alibaba Cloud PAI that uses the Cosmos world model to replace Isaac simulation for robot action data augmentation, including minimal human demonstrations, prompt‑driven data expansion, rejection sampling, IDM inverse‑kinematics extraction, imitation‑learning fine‑tuning, and model evaluation.
In the previous Notebook series we introduced "Operation Action Data Augmentation and Imitation Learning based on Isaac Simulation". This article presents a similar pipeline that uses the Cosmos world model as the core, covering human demonstration, data augmentation, imitation learning, and model evaluation without requiring simulation hardware.
Compared with the Isaac‑based approach, the Cosmos‑based solution offers:
Human demonstration and data augmentation run entirely on AI compute (CUDA/Tensor cores) without simulation (RT Core) resources.
No need for action labeling; raw video data can be directly used for augmentation.
Data enhancement is achieved by adjusting prompts rather than a separate augmentation stage.
Additional steps such as rejection sampling and IDM inverse‑kinematics are required to filter unrealistic content and recover missing action sequences.
A ready‑to‑run example is provided in PAI’s Notebook Gallery:
https://gallery.pai-ml.com/#/preview/deepLearning/cv/isaac_gr00t_wf2
01 Artificial Minimal Demonstration
Human operators record short videos of pick‑and‑place tasks (e.g., picking a bok choy) without any action labeling. The video and a textual description such as "Use the right hand to pick up green bok choy from the tan table right side to the bottom level of the wire basket" are collected until enough samples (e.g., 100) are obtained for fine‑tuning the Cosmos‑Predict model.
02 Data Augmentation
The collected videos are used to fine‑tune the Cosmos‑Predict model (e.g., Cosmos-Predict2-2B-Video2World) on a 4‑GPU node:
!torchrun --nproc_per_node=4 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py --experiment=predict2_video2world_training_2b_groot_gr1_480For larger models (e.g., Cosmos-Predict2-14B-Video2World) the same command can be run on a 4‑node DLC cluster with 8 GPUs per node.
import os
import json
import time
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_credentials.models import Config as CredConfig
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest, GetJobRequest
def wait_for_job_to_terminate(client, job_id):
while True:
job = client.get_job(job_id, GetJobRequest()).body
print('job({}) is {}'.format(job_id, job.status))
if job.status in ('Succeeded', 'Failed', 'Stopped'):
return job.status
time.sleep(5)
return None
def main():
current_time_tuple = time.localtime()
year = current_time_tuple.tm_year
month = current_time_tuple.tm_mon
day = current_time_tuple.tm_mday
hour = current_time_tuple.tm_hour
minute = current_time_tuple.tm_min
display_name = f"train_cosmos-predict2_14b_{day}_{hour}-{minute}" # set task name
region_id = os.environ.get("dsw_region")
workspace_id = os.environ.get('PAI_WORKSPACE_ID')
image_uri = f"dsw-registry.{region_id}.cr.aliyuncs.com/pai-training-algorithm/isaac-sim:gr00t-dreams-v9"
ecs_spec = "ecs.gn8v-8x.16xlarge"
num_gpus = 8
num_nodes = 4
config = "cosmos_predict2/configs/base/config.py"
exp = "predict2_video2world_training_14b_groot_gr1_480"
credentialsConfig = CredConfig(type='credentials_uri')
cred = CredClient(credentialsConfig)
dlc_client = DLCClient(config=Config(credential=cred, region_id=region_id, endpoint=f'pai-dlc.{region_id}.aliyuncs.com'))
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': display_name,
'JobType': 'PyTorchJob',
'JobSpecs': [{
"Type": "Worker",
"Image": image_uri,
"PodCount": num_nodes,
"EcsSpec": ecs_spec,
}],
'DataSources': [{"DataSourceId": dataset_id}],
'UserVpc': {"VpcId": vpc_id, "SwitchId": switch_id, "SecurityGroupId": security_groupid},
"UserCommand": f"export NVTE_FUSED_ATTN=0 && rm -rf /workspace/cosmos-predict2/checkpoints && rm -rf /workspace/cosmos-predict2/datasets/benchmark_train/gr1 && ln -s /mnt/data/notebook2/checkpoints /workspace/cosmos-predict2/checkpoints && ln -s /mnt/data/notebook2/gr1 /workspace/cosmos-predict2/datasets/benchmark_train/gr1 && cd /workspace/cosmos-predict2 && torchrun --nproc_per_node={num_gpus} --nnodes={num_nodes} --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 -m scripts.train --config={config} --experiment={exp} model.config.fsdp_shard_size=0"
}))
job_id = create_job_resp.body.job_id
wait_for_job_to_terminate(dlc_client, job_id)
if __name__ == '__main__':
main()03 Rejection Sampling
Multiple candidate videos are generated and scored by Cosmos-Reason1. The highest‑scoring video is kept. Scoring criteria include motion continuity, temporal consistency, physical plausibility, visual quality, and logical scene coherence.
!torchrun --nproc_per_node=4 --master_port=12341 -m examples.video2world_bestofn \
--model_size 14B --gr00t_variant gr1 \
--prompt "Use the right hand to pick up rubik's cube from the bottom of the three‑tiered wooden shelf to the top of the three‑tiered wooden shelf." \
--input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik's_cube_from_the_bottom_of_the_three-tiered_wooden_shelf_to_the_top_of_the_three-tiered_wooden_shelf..png \
--num_gpus 2 --num_generations 4 --prompt_prefix "" \
--disable_guardrail --save_path output/best-of-n-gr00t-gr104 IDM Inverse‑Dynamics Decoding
Since Cosmos‑Predict outputs videos without explicit action sequences, an Inverse Dynamics Model (IDM) extracts the missing actions:
!PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1,2,3 python IDM_dump/dump_idm_actions.py \
--checkpoint "seonghyeonye/IDM_gr1" \
--dataset "IDM_dump/data/gr1_unified.data" \
--output_dir "IDM_dump/data/gr1_unified.data_idm" \
--num_gpus 4 \
--video_indices "0 8"For environments with restricted access to HuggingFace, set the mirror endpoint: HF_ENDPOINT=https://hf-mirror.com Results are stored in Parquet files, which can be inspected with parquet-tools:
!uv pip install parquet-tools
!parquet-tools csv IDM_dump/data/gr1_unified.data_idm/data/chunk-000/episode_000000.parquet05 Imitation Learning
The augmented dataset is used to fine‑tune the GR00T‑N1 model:
!cd /workspace/GR00T-Dreams/
!export HF_HOME=/mnt/data/notebook2 && export WANDB_MODE=offline && \
bash IDM_dump/scripts/finetune/gr1.shThe training script ( gr1.sh) defines a Config dataclass with hyper‑parameters such as batch size, learning rate, LoRA settings, and GPU allocation, then launches a TrainRunner using the LeRobotSingleDataset loader and the GR00T_N1 model.
06 Model Evaluation
Evaluation on a real GR1 robot shows that the baseline GR00T‑N1 model achieves 11.2% success on known scenes, while the model fine‑tuned with Cosmos‑augmented data reaches 43.2%. In unknown scenes, success improves from 0% to 28.5% after augmentation.
07 Summary
This best‑practice demonstrates a complete end‑to‑end pipeline on Alibaba Cloud PAI that leverages the Cosmos world model for robot action data augmentation and imitation learning, covering minimal human demonstration, prompt‑driven augmentation, rejection sampling, IDM inverse‑kinematics extraction, fine‑tuning, and quantitative model evaluation. Compared with Isaac simulation, the Cosmos‑based approach eliminates the need for simulation hardware, removes manual action labeling, and integrates data enhancement directly via prompt engineering, albeit requiring extra rejection sampling and IDM steps.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
