Artificial Intelligence 13 min read

Task Tokens Cut Per-Task Trainable Parameters 125× and Boost Convergence 6× for Embodied AI

The Task Tokens method introduced by an Israeli research team reduces the number of trainable parameters per task by up to 125‑fold and speeds up convergence by six times, while preserving the flexibility of Behavior Foundation Models and demonstrating strong performance, robustness, and compatibility across a suite of embodied control tasks.

HyperAI Super Neural

Apr 23, 2026

Task Tokens Cut Per-Task Trainable Parameters 125× and Boost Convergence 6× for Embodied AI

Recent advances in imitation learning have led to Transformer‑based Behavior Foundation Models (BFMs) that can generate human‑like robot control from high‑level goals, but fine‑grained tasks often require cumbersome prompt engineering.

To address this, the authors propose Task Tokens , a technique that adapts a Goal‑Conditioned Behavior Foundation Model (MaskedMimic) to specific tasks while keeping the base model frozen. Compared with standard baselines, Task Tokens reduce the per‑task trainable parameter count by up to 125× and improve convergence speed by up to 6× .

The approach builds on the MaskedMimic architecture, which combines a Transformer with random masking of future goal tokens. Task Tokens introduce three token types:

Prior Token : optional user‑defined behavior prior supplied via text or joint conditions.

Task Token : generated by a task‑specific encoder that processes the current goal observation.

State Token : encodes the current environment state.

The task encoder is implemented as a feed‑forward neural network that receives proprioceptive observations (e.g., target direction, desired speed) and outputs a compact Task Token. During training, Proximal Policy Optimization (PPO) updates only the task encoder while the BFM remains frozen.

Evaluation uses a standardized suite of five tasks that increase in complexity: Direction (move along a target direction), Steering (move while keeping pelvis orientation), Reach (precise hand‑to‑target reaching), Strike (approach and knock down a target), and Long Jump (run‑up and clear a line). Success criteria are defined for each task (e.g., speed deviation < 20 %, orientation error < 45°, hand‑target distance < 20 cm, jump distance > 1.5 m).

Results show that Task Tokens achieve comparable or higher scores than full fine‑tuning on most tasks while using only ~200 K parameters, versus 9.3 M (PULSE) and 25 M (MaskedMimic Fine‑Tune). Convergence occurs within ~50 M steps, far fewer than the ~300 M steps required by PULSE. The method also demonstrates superior robustness under out‑of‑distribution (OOD) perturbations such as altered gravity and ground friction, maintaining higher success rates even at extreme settings (gravity ×1.5, friction ×0.4).

Human studies reveal that participants select motions generated with Task Tokens as more “human‑like” than those from MaskedMimic (joint‑condition only) and MaskedMimic Fine‑Tune, though PULSE scores slightly higher on this metric.

Task Tokens are compatible with other prompting modalities. In the Direction task, adding a user‑defined prior (e.g., head height and orientation constraints) guides the policy toward upright forward walking instead of backward locomotion. In the Strike task, combining a directional prior with a textual goal (“kick the target”) yields a natural kicking behavior rather than a degenerate “whirlwind” motion.

The authors note that full model fine‑tuning often leads to catastrophic forgetting of multimodal prompting abilities, whereas freezing the base model preserves these capabilities.

In conclusion, Task Tokens provide an efficient and effective way to adapt behavior foundation models to new, unseen tasks, balancing parameter efficiency, performance, and robustness. Future work includes validating the approach on a broader range of GC‑BFM architectures, automating task‑specific reward and observation design, transferring the method to real‑world robots, and exploring richer task‑encoder architectures.