How OmniNav Unifies Multi‑Task Embodied Navigation with a Fast‑Slow Dual System

OmniNav introduces a unified framework for embodied navigation that simultaneously handles instruction‑goal, object‑goal, point‑goal, and frontier‑based exploration tasks using a fast visual‑language‑driven policy and a slow memory‑augmented planner, achieving state‑of‑the‑art performance and real‑world 5 Hz deployment.

Amap Tech
Amap Tech
Amap Tech
How OmniNav Unifies Multi‑Task Embodied Navigation with a Fast‑Slow Dual System

Overview

Embodied navigation is a fundamental challenge for intelligent robots, requiring visual scene understanding, natural‑language instruction following, and autonomous exploration. Existing models struggle to provide a unified solution across heterogeneous navigation paradigms, leading to low success rates and limited generalization. OmniNav is a unified framework that handles instruct‑goal, object‑goal, point‑goal, and frontier‑based exploration within a single architecture.

Fast System

The fast system generates high‑precision continuous waypoints (position and orientation) directly from short‑term visual context and sub‑task cues, supporting up to 5 Hz low‑latency closed‑loop control. It employs a flow‑matching policy that produces smooth trajectories, avoiding the error accumulation of discrete action chunks.

Slow System

The slow system provides hierarchical, deliberative planning. It maintains long‑term visual memory and a set of frontier candidates, using a vision‑language model (VLM) with chain‑of‑thought reasoning to decompose complex goals and select the next sub‑goal or sub‑task. When a target appears in memory or the current view, the system quickly directs the fast system; otherwise it chooses frontier locations semantically related to the goal, enabling active exploration.

Data and Training

OmniNav is trained on a massive multimodal dataset that combines navigation tasks, embodied question‑answering, general vision‑language data, and referring/grounding datasets. The training follows a two‑stage joint scheme: first, an autoregressive stage learns discrete variables for language‑vision‑action grounding; second, a lightweight flow‑matching stage learns continuous waypoint prediction while retaining ~20% of discrete data to preserve VLM capabilities. Continuous waypoint coordinates are min‑max normalized for stable training.

Results

Extensive experiments show OmniNav achieves state‑of‑the‑art performance on multiple navigation benchmarks. On unseen validation splits, it improves success rates by 4.4% on R2R‑CE, 4.3% on RxR‑CE, and 18.4% on OVON compared to prior best works. Real‑world robot deployments confirm reliable 5 Hz inference and high‑frequency control.

Conclusion

OmniNav’s fast‑slow dual‑system architecture, continuous waypoint control, unified multimodal interface, and joint discrete‑continuous training together enable robust open‑vocabulary generalization, long‑term planning, and real‑time precise control for embodied navigation.

Vision Language Modelcontinuous controlembodied navigationfast-slow architectureMultimodal Trainingstate-of-the-art performance
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.