Beyond VLA: How Tactile Sensing Redefines Embodied AI with VTLA

In an IEEE Spectrum interview, robotics veteran Wang Yu argues that the vision‑language‑action (VLA) paradigm lacks the physical feedback needed for reliable manipulation, proposes a vision‑tactile‑language‑action (VTLA) framework, and details the open‑source Daimon‑Infinity tactile dataset and sensor technology that aim to reshape embodied AI.

Machine Heart
Machine Heart
Machine Heart
Beyond VLA: How Tactile Sensing Redefines Embodied AI with VTLA

Problem with Vision‑Only Robotics

Current VLA (vision‑language‑action) architectures rely solely on visual perception. When a robot attempts tasks such as picking a glass, harvesting a strawberry, or inserting a wire, vision can locate the object but cannot provide force, angle, contact state, or completion status. This lack of physical feedback limits stable manipulation in real environments.

Role of Tactile Sensing

Physical feedback—material properties, friction, contact force, deformation—fills the blind spot of vision. Tactile sensing captures contact force, deformation, state, slip, texture, and material information, enabling robots to move from object recognition to object manipulation.

VTLA Framework

Wang Yu and his team propose VTLA (vision‑tactile‑language‑action), extending VLA by treating tactile perception as a modality equal to vision. The tactile stream is fused with visual, language, and action data, allowing models to learn manipulation policies that incorporate physical interaction signals.

Dataset: Daimon‑Infinity

Open‑sourced multimodal dataset containing 10 000 hours of synchronized vision‑tactile‑language‑action recordings.

Collected via a global “external” embodied data‑collection network of lightweight devices that can operate in diverse real‑world scenes, enabling production of millions of hours of data per year.

Co‑created with dozens of institutions: Peking University, Tsinghua University, HKUST, DeepMind, Northwestern University, NUS, China Mobile, Hiconics, and others.

Provides high‑quality, reliable, low‑cost data that includes contact force, deformation, slip, material, and texture alongside visual and language annotations.

Vision‑Tactile Sensor

The sensor is fingertip‑sized, equipped with 110 000 sensing units—the highest density reported in the industry. It delivers high sampling frequency and bandwidth for real‑time signal processing, and exhibits robustness to drift, electromagnetic interference, and environmental factors. The sensor converts fingertip surface deformation into visual images, allowing seamless integration with existing VLA pipelines.

Single‑Color Vision‑Tactile Technology

Developed to emulate human fingertip skin, the technology integrates advantages of multi‑color approaches while reducing complexity, cost, and improving reliability. It captures multidimensional tactile information (force/torque, shape, material, contact state) in a compact form factor.

Commercial “3D” Strategy

Devices → Data → Deployment. The company first builds high‑performance tactile devices, then generates large‑scale multimodal datasets, and finally integrates both into robot models for real‑world deployment. Each component is presented as indispensable for embodied AI progress.

Deployment Scenarios

Example: a humanoid robot in a convenience store must reach into narrow shelves, a task that human fingers accomplish with three delicate digits. Tactile feedback is required to judge position, slip, and applied force, preventing object drop or damage.

Future Outlook

Embedding tactile perception into embodied AI is expected to enable humanoid robots that can operate reliably in unstructured environments such as hotels, restaurants, and pharmacies, advancing physical AI toward practical applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Embodied AIPhysical AItactile sensingrobot manipulationdata setsVTLA
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.