MINT: Enabling Strong Generalization and One‑Shot Transfer for Vision‑Language‑Action Models
MINT introduces a spectrally disentangled tokenization and intent‑driven strategy that lets Vision‑Language‑Action models generalize compositionally, transfer with a single demonstration, and achieve state‑of‑the‑art performance and robustness across benchmark suites and real‑world robot experiments.
