How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory

JanusVLN presents a groundbreaking Vision‑and‑Language Navigation framework that decouples semantic understanding from spatial geometry using dual implicit memory, eliminates explicit memory overhead, achieves state‑of‑the‑art performance with only RGB video input, and dramatically improves efficiency and generalization across VLN benchmarks.

Amap Tech
Amap Tech
Amap Tech
How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory

Introduction

Vision‑and‑Language Navigation (VLN) is a core embodied‑AI task that requires an agent to follow natural‑language instructions in complex real‑world environments. Recent multimodal large language models (MLLMs) have advanced VLN, but existing methods rely on explicit memory (textual topological maps, stored image histories), causing spatial information loss, computational redundancy, and memory bloat, while ignoring the 3D physical nature of RGB images.

JanusVLN Overview

Inspired by the human brain’s left‑hemisphere semantic processing and right‑hemisphere spatial cognition, we propose JanusVLN, a novel VLN framework with dual implicit memory. It decouples visual semantics from spatial geometry and models each as a compact, fixed‑size neural representation. Using only an RGB video stream, JanusVLN achieves strong 3D spatial reasoning and reduces computation via an efficient incremental update mechanism.

Key Challenges in Existing VLN Methods

Spatial information loss and redundancy: Text‑based maps cannot precisely express object spatial relations, leading to loss of visual and geometric cues.

Low computational and inference efficiency: Storing all past frames forces repeated processing of the entire history at each step.

Memory explosion: Explicit memory size grows linearly or exponentially with navigation length.

Core Contributions

Dual memory: Separate “left‑brain” semantic memory and “right‑brain” spatial memory for clearer navigation.

Implicit memory: Store only high‑level neural embeddings instead of raw high‑dimensional observations.

3D perception from RGB: Infer 3D geometry without depth sensors or LiDAR.

Lightweight design: Fixed‑size memory with dynamic incremental updates avoids memory explosion.

Method

1. Decoupled visual perception: A dual‑encoder architecture uses a Qwen2.5‑VL visual encoder for 2D semantic features and a pretrained 3D visual geometry model (VGGT) to extract spatial geometry from RGB video. The two features are fused via a lightweight MLP and combined with instruction embeddings for action prediction.

2. Dual implicit neural memory: The key‑value caches of the attention modules of each encoder serve as compact implicit memories, maintained separately for semantics and geometry.

3. Hybrid incremental update: A sliding‑window FIFO cache stores the most recent n frames, while an initial window permanently retains the first few frames as global anchors, enabling fixed‑size memory and eliminating repeated computation.

Experiments

Quantitative results: On the VLN‑CE benchmark, JanusVLN outperforms multimodal methods that use panoramic views, odometry, or depth maps, achieving 10.5–35.5 % higher success rate (SR) with only monocular RGB. It also surpasses SOTA RGB‑only methods (e.g., NaVILA, StreamVLN) by 10.8 and 3.6 % SR respectively, with less auxiliary training data.

On the challenging RxR‑CE dataset, JanusVLN improves SR by 3.3–30.7 % over previous approaches, demonstrating strong generalization.

Qualitative results: Visualizations show successful navigation in tasks requiring depth perception, 3‑D direction, and spatial relations, thanks to the spatial geometry memory.

Conclusion

JanusVLN introduces the first dual implicit neural memory for VLN, decoupling semantics and geometry and providing a fixed‑size, efficient memory that resolves the memory‑bloat and computation issues of prior methods. By leveraging MLLM and a 3D geometry foundation model, it achieves state‑of‑the‑art performance with only RGB input, paving the way for next‑generation embodied agents with advanced spatial cognition.

multimodal LLM3D spatial reasoningDual Implicit MemoryVision-Language Navigation
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.