How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model

The X‑VLA paper introduces a 0.9‑billion‑parameter, fully open‑source embodied model that uses a learnable soft‑prompt and divide‑and‑conquer encoding to handle heterogeneous robot vision inputs, achieving a record‑breaking 120‑minute autonomous clothing‑folding task while surpassing benchmarks across five simulation environments.

Data Party THU
Data Party THU
Data Party THU
How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model

Overview

X‑VLA is a general‑purpose cross‑embodiment foundation model for robots that jointly processes heterogeneous visual inputs from a primary viewpoint and auxiliary cameras. It uses a learnable Soft‑Prompt to encode hardware configuration and a divide‑and‑conquer encoding scheme that separates high‑level semantic extraction (via a visual‑language model) from lightweight spatial feedback (via small auxiliary networks). The model’s backbone is a standard Transformer, and action generation is performed with a probabilistic flow‑matching decoder to improve trajectory smoothness and robustness.

Illustration of X‑VLA architecture
Illustration of X‑VLA architecture

Core Method

The Soft‑Prompt learns a continuous representation of robot hardware (e.g., degrees of freedom, camera placements) that decouples task strategy from specific actuators, enabling the same model to adapt across diverse platforms. The primary camera feed is encoded by a high‑capacity visual‑language model to capture semantic cues, while auxiliary views are processed by lightweight networks that provide spatial feedback, optimizing computational resource allocation.

Action sequences are generated with a flow‑matching diffusion process, which models robot motions probabilistically rather than deterministically, yielding smoother and more robust trajectories in uncertain environments.

Data Pipeline and Pre‑training

Balanced data sampling : custom sampling ensures each modality (vision, language, action) contributes equally during training, preventing bias toward any single source.

Multimodal cleaning and spatio‑temporal alignment : raw robot operation logs are unified into a common task space, high‑frequency recordings are temporally aligned and resampled to a consistent rate.

Semantic‑action alignment criteria : only samples with clear visual frames, precise language instructions, and strong correlation to subsequent actions are retained, guaranteeing causal behavior knowledge.

Fine‑tuning Strategies

Layer‑wise adaptive learning rates : frozen visual‑language backbone, Soft‑Prompt, and Transformer layers receive distinct learning‑rate schedules, preserving pretrained knowledge while allowing rapid adaptation of critical components.

Progressive warm‑up for heterogeneous modules : newly introduced learnable parameters start with a linearly increasing learning rate, stabilizing early training before full‑scale optimization.

Experimental Results

Scaling‑law curves show linear performance improvement as model size and data volume increase, confirming the scalability of the Soft‑Prompt mechanism and the streamlined Transformer architecture. X‑VLA achieves state‑of‑the‑art results on five benchmark simulators (e.g., LIBERO, SIMPLER) and demonstrates robust real‑world performance, completing an uninterrupted 120‑minute autonomous clothing‑folding task with only 0.9 B parameters.

All code, data, and model checkpoints are publicly released at https://github.com/2toinf/X-VLA.git, and the full paper is available at https://arxiv.org/pdf/2510.10274.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Embodied AIRoboticspretrainingMultimodal Learningsoft-promptX-VLAflow-matching
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.