Artificial Intelligence 29 min read

How to Build an End‑to‑End Hand‑Video to VLA Data Pipeline on Alibaba Cloud PAI with Data‑Juicer

This article details a step‑by‑step, distributed pipeline built on Alibaba Cloud PAI using Data‑Juicer and Ray that transforms raw egocentric hand videos into LeRobot v2.0‑compatible Vision‑Language‑Action (VLA) training data, covering video splitting, frame extraction, camera calibration, 3D hand reconstruction, pose estimation, action captioning, and export, with code snippets, performance numbers, and references.

Alibaba Cloud Big Data AI Platform

Apr 22, 2026

How to Build an End‑to‑End Hand‑Video to VLA Data Pipeline on Alibaba Cloud PAI with Data‑Juicer

Background

Vision‑Language‑Action (VLA) models for embodied AI require large‑scale, high‑quality interaction data. Tele‑operation collection is costly and limited, while the Internet contains abundant egocentric hand‑operation videos that could serve as a data source. The technical challenge is to convert these raw videos into VLA‑ready training data in a reproducible, scalable way.

Pipeline Overview

The end‑to‑end pipeline runs on Alibaba Cloud PAI using the Data‑Juicer framework and Ray for distributed execution. It transforms first‑person hand videos into the LeRobot v2.0 dataset format through a sequence of Data‑Juicer operators.

Stage 1 – Video Splitting

Long videos (minutes to hours) are cut into overlapping 20‑second clips to reduce memory pressure and avoid losing actions at clip boundaries. Overlap ensures that actions crossing a cut are preserved for later stitching.

from data_juicer.ops.mapper import VideoSplitByDurationMapper
video_split_op = VideoSplitByDurationMapper(
    split_duration=20,
    keep_original_sample=False,
    save_dir=os.path.join(output_dir, 'clips'),
    video_backend="ffmpeg",
    ffmpeg_extra_args="-movflags frag_keyframe+empty_moov",
    skip_op_error=False,
    batch_mode=True,
    video_key=video_key,
    save_field=clip_key,
    legacy_split_by_text_token=False,
)

Stage 2 – Frame Extraction

All‑keyframe sampling extracts only I‑frames, storing their file paths in the video_frames metadata field for downstream operators. This avoids the storage overhead of extracting every frame.

from data_juicer.ops.mapper import VideoExtractFramesMapper
extract_frames_op = VideoExtractFramesMapper(
    frame_sampling_method="all_keyframes",
    output_format='path',
    frame_dir=os.path.join(output_dir, 'frames'),
    frame_field=MetaKeys.video_frames,
    video_key="videos",
    video_backend='ffmpeg',
    batch_mode=True,
)

Stage 3 – Camera Calibration & Depth (MoGe‑2)

MoGe‑2 is a monocular depth and geometry model that provides precise focal‑length estimates and dense depth maps. The focal length is fed to later stages, and the depth map is reused as a prior for MegaSAM, eliminating a separate depth inference step.

from data_juicer.ops.mapper import VideoCameraCalibrationMogeMapper
ds = ds.map_batches(
    VideoCameraCalibrationMogeMapper,
    fn_constructor_kwargs=dict(
        model_path="Ruicheng/moge-2-vitl",
        tag_field_name=MetaKeys.camera_calibration_moge_tags,
        frame_field=MetaKeys.video_frames,
        output_depth=True,
        output_points=False,
        output_mask=False,
        batch_mode=True,
        skip_op_error=skip_op_error,
    ),
    batch_size=1,
    num_gpus=0.15,
    compute=ActorPoolStrategy(min_size=1, max_size=10),
)

Stage 4 – 3D Hand Reconstruction (HaWoR)

HaWoR (Hands in the Wild with Reconstruction) reconstructs 6‑DoF wrist pose and MANO hand joint angles for each frame. The focal length from MoGe‑2 improves the reconstruction accuracy.

from data_juicer.ops.mapper import VideoHandReconstructionHaworMapper
ds = ds.map_batches(
    VideoHandReconstructionHaworMapper,
    fn_constructor_kwargs=dict(
        camera_calibration_field=MetaKeys.camera_calibration_moge_tags,
        tag_field_name=MetaKeys.hand_reconstruction_hawor_tags,
        mano_right_path='/path/to/MANO_RIGHT.pkl',
        mano_left_path='/path/to/MANO_LEFT.pkl',
        frame_field=MetaKeys.video_frames,
        batch_mode=True,
    ),
    batch_size=1,
    num_gpus=0.1,
    compute=ActorPoolStrategy(min_size=1, max_size=2),
)

Stage 5 – Camera Pose Estimation (MegaSAM)

MegaSAM, built on DROID‑SLAM, estimates metric‑scale camera poses. By replacing its original depth priors with MoGe‑2 outputs, the pipeline gains higher accuracy and avoids loading an extra depth model.

from data_juicer.ops.mapper import VideoCameraPoseMegaSaMMapper
ds = ds.map_batches(
    VideoCameraPoseMegaSaMMapper,
    fn_constructor_kwargs=dict(
        tag_field_name=MetaKeys.video_camera_pose_tags,
        camera_calibration_field=MetaKeys.camera_calibration_moge_tags,
        batch_mode=True,
    ),
    batch_size=1,
    num_gpus=0.1,
    runtime_env={"env_vars": {"PYTHONPATH": "/opt/megasam-ext"}},
    compute=ActorPoolStrategy(min_size=1, max_size=2),
)

Stage 6 – Action Captioning

A visual‑language model receives the extracted keyframes together with a hand‑specific prompt and returns a JSON object containing a short reasoning trace and an imperative action description. The operator supports both API calls (e.g., Qwen) and local vLLM inference.

# Simplified definition of the mapper (full code omitted)
class VideoActionCaptioningMapper(Mapper):
    DEFAULT_SYSTEM_PROMPT = "You are a multimodal expert specializing in video captioning for egocentric human‑object interaction (HOI) clips."
    DEFAULT_USER_PROMPT_TEMPLATE = """Below are video frames ... describe the specific {hand_type}-hand action..."""
    ...

Stage 7 – Export to LeRobot v2.0

The final operator writes MP4 video clips, Parquet action/state vectors, and JSONL task descriptions into the directory layout required by the LeRobot v2.0 dataset.

from data_juicer.ops.mapper import ExportToLeRobotMapper
export_op = ExportToLeRobotMapper(
    output_dir=LEROBOT_OUTPUT_DIR,
    hand_action_field=MetaKeys.hand_action_tags,
    frame_field=MetaKeys.video_frames,
    video_key=clip_key,
    task_description_key="text",
    fps=10,
    robot_type="egodex_hand",
    batch_mode=True,
)
ExportToLeRobotMapper.finalize_dataset(
    output_dir=LEROBOT_OUTPUT_DIR,
    fps=10,
    robot_type="egodex_hand",
)

Distributed Execution on PAI‑DLC

All operators are native Ray map_batches calls, enabling automatic data sharding, fault tolerance, and linear scaling. The demo processes 10 000 egocentric videos (≈ 25 h total duration) on 32 GN8V GPUs across four machines in 66 minutes, achieving stable ~95 % GPU utilization.

Conclusion and Outlook

The pipeline constitutes the first publicly available, turnkey solution that converts raw egocentric hand videos into standardized VLA training data. By encapsulating environment management, data format conversion, distributed scheduling, and fault tolerance in Data‑Juicer operators, users can run the whole workflow with a few configuration lines. Future work will add data‑quality assessment, richer augmentation, and tighter integration with downstream robot‑learning frameworks.

References

LeRobot v2.0 repository: https://github.com/huggingface/lerobot

MANO hand model: https://mano.is.tue.mpg.de/

Data‑Juicer PR931 (visualization tools): https://github.com/datajuicer/data-juicer/pull/931

data pipeline embodied AI Distributed Computing Ray VLA Data-Juicer Lerobot

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Pipeline Overview

Stage 1 – Video Splitting

Stage 2 – Frame Extraction

Stage 3 – Camera Calibration & Depth (MoGe‑2)

Stage 4 – 3D Hand Reconstruction (HaWoR)

Stage 5 – Camera Pose Estimation (MegaSAM)

Stage 6 – Action Captioning

Stage 7 – Export to LeRobot v2.0

Distributed Execution on PAI‑DLC

Conclusion and Outlook

References

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – Video Splitting

Stage 2 – Frame Extraction

Stage 3 – Camera Calibration & Depth (MoGe‑2)

Stage 4 – 3D Hand Reconstruction (HaWoR)

Stage 5 – Camera Pose Estimation (MegaSAM)

Stage 6 – Action Captioning

Stage 7 – Export to LeRobot v2.0