Artificial Intelligence 9 min read

MINT: Enabling Strong Generalization and One‑Shot Transfer for Vision‑Language‑Action Models

MINT introduces a spectrally disentangled tokenization and intent‑driven strategy that lets Vision‑Language‑Action models generalize compositionally, transfer with a single demonstration, and achieve state‑of‑the‑art performance and robustness across benchmark suites and real‑world robot experiments.

Machine Heart

Jun 10, 2026

MINT: Enabling Strong Generalization and One‑Shot Transfer for Vision‑Language‑Action Models

Current Vision‑Language‑Action (VLA) models excel in fixed scenes but fail to generalize when object positions, lighting, or backgrounds change, often requiring large numbers of new demonstrations for each task.

The paper identifies two core challenges: compositional generalization—combining learned skills A, B, C into new sequences—and few‑shot/one‑shot transfer—learning a new task from only a handful of demonstrations.

To address these, Shanghai Chuangzhi Institute and Shanghai Jiao Tong University propose MINT, a VLA architecture that treats action trajectories as time signals and applies Spectrally Disentangled Action Tokenization (SDAT). SDAT maps trajectories to multi‑scale tokens: a coarse S1 "Intent" token representing low‑frequency, high‑level intent, and finer S2‑SK "Execution" tokens capturing high‑frequency control details, forming a pyramid of representations.

Residual learning ensures finer tokens model only the residual information not captured by coarser tokens.

Coarse‑to‑fine multi‑scale reconstruction uses each token set to reconstruct the trajectory, preserving information at every scale.

Frequency‑domain reconstruction computes loss in the frequency domain, explicitly separating low‑ and high‑frequency components.

Strategy learning follows an "Intent → Execution" hierarchy: first predict the Intent token, then generate Execution tokens layer by layer, finally decode all tokens into a continuous control trajectory. This staged reasoning first fixes the desired behavior and then fills in the necessary control details, improving learning efficiency and stability for long‑horizon tasks.

Because the Intent token encodes abstract behavior, it can be injected directly as a task specification, enabling one‑shot transfer: a single demonstration provides an Intent token, which is inserted into the policy to generate the required execution details for a new task without further training.

Experimental results

On the LIBERO, CALVIN, and MetaWorld benchmarks, MINT surpasses existing state‑of‑the‑art methods:

LIBERO: MINT‑30M (97.1% success) outperforms SmolVLA (88.8%); MINT‑4B (98.3%) exceeds π₀.₅ (96.9%).

CALVIN: MINT‑4B shows superior performance on long‑sequence tasks, confirming stable long‑horizon execution.

MetaWorld “hard” tasks: MINT‑4B achieves a success rate nearly three times that of π₀.

Robustness tests on LIBERO‑Plus with varied camera angles, lighting, backgrounds, and visual noise show MINT’s performance drop is far smaller than OpenVLA or π₀.₅, maintaining 84.6%–96.6% success rates under strong perturbations, highlighting the importance of behavior‑intent cognition for generalization.

For skill transfer, MINT attains 90% success on a new task with only one demonstration, compared to 42% for fine‑tuning‑based methods. It also demonstrates compositional generalization: after seeing only skills A and B, a single demo enables execution of the combined A→B task.

Real‑world validation

Using a Piper‑X 6‑DoF arm, the team trained on tasks such as grasp‑place bananas, stack blocks, and insert markers with only 20 demonstrations each, then tested on an unseen cup‑stacking task. MINT‑4B improved overall success by 29% over π₀.₅, showed higher precision in stacking and marker insertion, and successfully transferred the abstract "stack" intent to the new cup‑stacking task where other methods failed.

These results confirm that MINT learns transferable behavior structures rather than mere trajectory imitation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark robotics MINT Vision-Language-Action Compositional Generalization Few-shot Transfer

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.