Artificial Intelligence 12 min read

FTP-1: First Generalist Tactile Foundation Model Unifying 21 Sensors for Diverse Robots

FTP-1, a new generalist tactile foundation policy trained on the 3,000‑hour FTP‑1‑Dataset covering 21 heterogeneous sensors from 26 sources, introduces a morphology‑aware token space and an independent tactile transformer expert, achieving up to 31.6‑percentage‑point gains on unseen sensors and consistently outperforming prior VLA baselines across 14 real‑world manipulation tasks.

Machine Heart

Jun 27, 2026

FTP-1: First Generalist Tactile Foundation Model Unifying 21 Sensors for Diverse Robots

Background

Vision‑language‑action (VLA) models such as π₀.₅ and GR00T N1.5 demonstrate that large‑scale heterogeneous pre‑training can produce transferable manipulation policies. Contact‑rich tasks (insertion, force‑controlled wiping, in‑hand adjustment, bottle‑cap twisting) still lack a comparable tactile foundation because existing tactile policies are tightly coupled to specific sensors and hardware.

Why Tactile Foundations Have Been Missing

Sensor heterogeneity : GelSight, Contactile, and force/torque sensors differ in format, resolution, and physical form, preventing direct reuse of learned experience.

Simple fusion fails : Injecting raw tactile tokens into a VLM backbone interferes with visual‑language knowledge (e.g., Tactile‑VLA achieves 35.8 % success vs. π₀.₅’s 45.3 %).

No unified pre‑training corpus : Unlike ImageNet‑scale visual data, tactile robotics has lacked a large, cross‑sensor dataset.

Research Question

Can a single tactile policy absorb heterogeneous tactile experience and transfer to sensors and robot bodies that were never seen during pre‑training?

Proposed Solution: MTTS + Independent Tactile Expert

Morphology‑Aware Tactile Token Space (MTTS) : Maps any tactile input—image‑based (GelSight), array‑based (Contactile), or force/torque signals—into 24 functional‑region tokens (e.g., thumb tip, index tip, palm, wrist). The token embedding encodes the end‑effector location rather than the raw sensor, enabling consistent semantics across parallel grippers, dexterous hands, and different sensor families.

Independent Tactile Transformer Expert (~300 M parameters): All tactile tokens are routed to this dedicated module; the action head reads the resulting representation while gradients are blocked from the visual‑language expert, preserving existing VLM knowledge and learning reusable tactile features.

During downstream fine‑tuning, a new sensor encoder can be trained from scratch while re‑using the pre‑trained tactile expert, MTTS embeddings, and the shared vision‑tactile transformer.

FTP‑1 Dataset

The dataset comprises roughly 3,000 hours of tactile manipulation data from 26 sources, covering 21 distinct sensors (7 image‑type, 5 array‑type, 9 state‑type). After resampling, the mix is ~20 % human demonstrations, ~30 % dexterous‑hand data, and ~50 % parallel‑gripper data. Sharpa contributed 4,000 long‑horizon demonstrations collected with the Dynamic Tactile Array (DTC) on the Sharpa North platform; all annotations are standardized under MTTS and language instructions are diversified via GPT‑4o. The dataset is positioned as the “ImageNet of tactile robotics”.

Evaluation Protocol

FTP‑1 checkpoints were distributed to five independent research groups, which fine‑tuned and evaluated on 14 tasks spanning in‑hand adjustment, force‑controlled pressing, insertion/withdrawal, flexible‑object interaction, and long‑range dual‑arm manipulation. Tasks were split into:

Seen‑sensor settings : Sensors present in the pre‑training mix.

Unseen‑sensor settings : Sensors absent from pre‑training.

Results

On known sensors, FTP‑1 improves success rate by 17.2 percentage points over π₀.₅ (62.5 % vs. 45.3 %).

On unseen sensors, FTP‑1 outperforms the strongest baseline by 31.6 percentage points.

In the UniVTAC simulation environment, average success reaches 66.7 %, 17.5 pp above the best baseline; focusing on contact‑heavy tasks, FTP‑1 attains 59.5 % vs. 42.0 % for the architecture‑only variant (FTP‑π₀.₅).

Task‑specific example – “Twist Cap”: FTP‑1 maintains stable pressure and slows insertion when tactile feedback indicates mis‑alignment, whereas π₀.₅ simply pushes.

Ablation with NTP‑1 (no tactile pre‑training) shows a clear drop on the unseen FlexivXense sensor, confirming that FTP‑1 learns transferable tactile knowledge rather than sensor‑specific tricks.

Key Components

MTTS – a universal tactile token language covering 21 sensors.

FTP‑1‑Dataset – ~3,000 hours of heterogeneous tactile experience.

Reusable tactile expert that can be fine‑tuned for new sensors without retraining from scratch.

Pre‑trained model, dataset, and training code are publicly released at https://ftp1-policy.github.io/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

robotics multimodal transfer learning dataset foundation model tactile

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.