How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory
JanusVLN presents a groundbreaking Vision‑and‑Language Navigation framework that decouples semantic understanding from spatial geometry using dual implicit memory, eliminates explicit memory overhead, achieves state‑of‑the‑art performance with only RGB video input, and dramatically improves efficiency and generalization across VLN benchmarks.
Introduction
Vision‑and‑Language Navigation (VLN) is a core embodied‑AI task that requires an agent to follow natural‑language instructions in complex real‑world environments. Recent multimodal large language models (MLLMs) have advanced VLN, but existing methods rely on explicit memory (textual topological maps, stored image histories), causing spatial information loss, computational redundancy, and memory bloat, while ignoring the 3D physical nature of RGB images.
JanusVLN Overview
Inspired by the human brain’s left‑hemisphere semantic processing and right‑hemisphere spatial cognition, we propose JanusVLN, a novel VLN framework with dual implicit memory. It decouples visual semantics from spatial geometry and models each as a compact, fixed‑size neural representation. Using only an RGB video stream, JanusVLN achieves strong 3D spatial reasoning and reduces computation via an efficient incremental update mechanism.
Key Challenges in Existing VLN Methods
Spatial information loss and redundancy: Text‑based maps cannot precisely express object spatial relations, leading to loss of visual and geometric cues.
Low computational and inference efficiency: Storing all past frames forces repeated processing of the entire history at each step.
Memory explosion: Explicit memory size grows linearly or exponentially with navigation length.
Core Contributions
Dual memory: Separate “left‑brain” semantic memory and “right‑brain” spatial memory for clearer navigation.
Implicit memory: Store only high‑level neural embeddings instead of raw high‑dimensional observations.
3D perception from RGB: Infer 3D geometry without depth sensors or LiDAR.
Lightweight design: Fixed‑size memory with dynamic incremental updates avoids memory explosion.
Method
1. Decoupled visual perception: A dual‑encoder architecture uses a Qwen2.5‑VL visual encoder for 2D semantic features and a pretrained 3D visual geometry model (VGGT) to extract spatial geometry from RGB video. The two features are fused via a lightweight MLP and combined with instruction embeddings for action prediction.
2. Dual implicit neural memory: The key‑value caches of the attention modules of each encoder serve as compact implicit memories, maintained separately for semantics and geometry.
3. Hybrid incremental update: A sliding‑window FIFO cache stores the most recent n frames, while an initial window permanently retains the first few frames as global anchors, enabling fixed‑size memory and eliminating repeated computation.
Experiments
Quantitative results: On the VLN‑CE benchmark, JanusVLN outperforms multimodal methods that use panoramic views, odometry, or depth maps, achieving 10.5–35.5 % higher success rate (SR) with only monocular RGB. It also surpasses SOTA RGB‑only methods (e.g., NaVILA, StreamVLN) by 10.8 and 3.6 % SR respectively, with less auxiliary training data.
On the challenging RxR‑CE dataset, JanusVLN improves SR by 3.3–30.7 % over previous approaches, demonstrating strong generalization.
Qualitative results: Visualizations show successful navigation in tasks requiring depth perception, 3‑D direction, and spatial relations, thanks to the spatial geometry memory.
Conclusion
JanusVLN introduces the first dual implicit neural memory for VLN, decoupling semantics and geometry and providing a fixed‑size, efficient memory that resolves the memory‑bloat and computation issues of prior methods. By leveraging MLLM and a 3D geometry foundation model, it achieves state‑of‑the‑art performance with only RGB input, paving the way for next‑generation embodied agents with advanced spatial cognition.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
