How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model
The X‑VLA paper introduces a 0.9‑billion‑parameter, fully open‑source embodied model that uses a learnable soft‑prompt and divide‑and‑conquer encoding to handle heterogeneous robot vision inputs, achieving a record‑breaking 120‑minute autonomous clothing‑folding task while surpassing benchmarks across five simulation environments.
Overview
X‑VLA is a general‑purpose cross‑embodiment foundation model for robots that jointly processes heterogeneous visual inputs from a primary viewpoint and auxiliary cameras. It uses a learnable Soft‑Prompt to encode hardware configuration and a divide‑and‑conquer encoding scheme that separates high‑level semantic extraction (via a visual‑language model) from lightweight spatial feedback (via small auxiliary networks). The model’s backbone is a standard Transformer, and action generation is performed with a probabilistic flow‑matching decoder to improve trajectory smoothness and robustness.
Core Method
The Soft‑Prompt learns a continuous representation of robot hardware (e.g., degrees of freedom, camera placements) that decouples task strategy from specific actuators, enabling the same model to adapt across diverse platforms. The primary camera feed is encoded by a high‑capacity visual‑language model to capture semantic cues, while auxiliary views are processed by lightweight networks that provide spatial feedback, optimizing computational resource allocation.
Action sequences are generated with a flow‑matching diffusion process, which models robot motions probabilistically rather than deterministically, yielding smoother and more robust trajectories in uncertain environments.
Data Pipeline and Pre‑training
Balanced data sampling : custom sampling ensures each modality (vision, language, action) contributes equally during training, preventing bias toward any single source.
Multimodal cleaning and spatio‑temporal alignment : raw robot operation logs are unified into a common task space, high‑frequency recordings are temporally aligned and resampled to a consistent rate.
Semantic‑action alignment criteria : only samples with clear visual frames, precise language instructions, and strong correlation to subsequent actions are retained, guaranteeing causal behavior knowledge.
Fine‑tuning Strategies
Layer‑wise adaptive learning rates : frozen visual‑language backbone, Soft‑Prompt, and Transformer layers receive distinct learning‑rate schedules, preserving pretrained knowledge while allowing rapid adaptation of critical components.
Progressive warm‑up for heterogeneous modules : newly introduced learnable parameters start with a linearly increasing learning rate, stabilizing early training before full‑scale optimization.
Experimental Results
Scaling‑law curves show linear performance improvement as model size and data volume increase, confirming the scalability of the Soft‑Prompt mechanism and the streamlined Transformer architecture. X‑VLA achieves state‑of‑the‑art results on five benchmark simulators (e.g., LIBERO, SIMPLER) and demonstrates robust real‑world performance, completing an uninterrupted 120‑minute autonomous clothing‑folding task with only 0.9 B parameters.
All code, data, and model checkpoints are publicly released at https://github.com/2toinf/X-VLA.git, and the full paper is available at https://arxiv.org/pdf/2510.10274.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
