Artificial Intelligence 13 min read

Tsinghua’s Spatial‑TTT Beats Gemini: Continuous Spatial Intelligence Wins ECCV 2026

Spatial‑TTT, a 2‑billion‑parameter open‑source multimodal model, uses fast‑weight updates, a hybrid TTT architecture, a spatial‑predictive mechanism and dense 3D scene supervision to maintain and refresh a spatial memory while processing up to 120‑minute video streams, outperforming Gemini‑3‑pro and other closed‑source baselines on multiple spatial‑intelligence benchmarks with over 40% lower memory and compute cost.

Machine Learning Algorithms & Natural Language Processing

Jun 22, 2026

Tsinghua’s Spatial‑TTT Beats Gemini: Continuous Spatial Intelligence Wins ECCV 2026

Problem

In robotics, autonomous driving, and AR, spatial understanding requires remembering, linking, and continuously updating information from long video streams rather than relying on a single image snapshot.

Spatial‑TTT Overview

Spatial‑TTT treats model parameters as fast‑weight memory that is updated online while processing video chunks, allowing the model to accumulate 3‑D evidence over time.

Design 1 – Hybrid TTT architecture

In the decoder, TTT layers and standard self‑attention layers are interleaved at a 3:1 ratio; 75 % of layers are TTT, writing long‑range information into fast weights.

The remaining 25 % are anchor self‑attention layers that preserve the pretrained multimodal semantic abilities.

Large‑chunk updates together with sliding‑window attention keep GPU utilization high and retain local spatio‑temporal interactions.

Design 2 – Spatial‑predictive mechanism

Standard TTT projects each visual token independently, ignoring local geometry and temporal continuity. Spatial‑TTT adds lightweight 3‑D spatio‑temporal convolutions to the Q/K/V projections, enabling the fast‑weight module to learn a mapping from spatio‑temporal context to spatio‑temporal context. This captures geometric correspondence, view changes, and continuity, stabilising online updates.

Design 3 – Dense 3D scene‑description supervision

Typical spatial‑intelligence datasets provide sparse Q&A supervision that covers only tiny parts of a scene. Spatial‑TTT builds a dense 3‑D scene‑description dataset that requires the model to generate a full scene walkthrough (global context, object categories, counts, spatial relations). Training proceeds in two stages:

Spatial‑aware progressive training on dense descriptions to form a global 3‑D awareness.

Millions of spatial VQA examples refine direction judgment, distance estimation, counting, room‑size estimation, and route planning.

Experimental Results

On VSI‑Bench, the 2 B‑parameter Spatial‑TTT‑2B achieves an average score of 64.4, surpassing all open‑source and closed‑source baselines, with especially strong performance on absolute distance, relative direction, route planning, and appearance‑order tasks.

On MindCube‑Tiny, Spatial‑TTT reaches 76.2 % accuracy, 12 percentage points higher than Gemini‑3‑pro (63.9 %) and nearly 25 points above the open‑source MindCube‑3B (51.7 %).

For the long‑duration VSI‑SUPER counting tasks (10, 30, 60, 120 min videos), Spatial‑TTT scores 31.8, 45.6, 36.2, and 38.4 respectively, while many other models either collapse in performance or run out of memory.

Ablation studies show that removing the spatial‑predictive mechanism drops the VSI‑Bench average to 62.1, removing dense supervision drops it to 61.3, and discarding the hybrid architecture (using pure TTT) reduces it to 53.9, confirming that all three designs contribute synergistically.

Efficiency Analysis

With a 1024‑frame input, Spatial‑TTT‑2B uses 11.9 GB peak GPU memory and 799.4 TFLOPs, compared with industry‑leading closed‑source models that require 21.2 GB and 1403.1 TFLOPs—over 40 % savings in both memory and compute. Models that rely on explicit geometric encoders (e.g., Spatial‑MLLM‑4B) cannot run at 512 or 1024 frames.

Conclusion

Spatial‑TTT demonstrates that a fast‑weight based dynamic memory can continuously integrate new observations into an internal spatial state, enabling persistent world‑state modeling for physical agents that must accumulate and refine knowledge over prolonged operation.

Paper: https://arxiv.org/pdf/2603.12255

Project page: https://liuff19.github.io/Spatial-TTT/

GitHub: https://github.com/THU-SI/Spatial-TTT/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal video fast weights ECCV2026 long video memory Spatial-TTT

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem

Spatial‑TTT Overview

Design 1 – Hybrid TTT architecture

Design 2 – Spatial‑predictive mechanism

Design 3 – Dense 3D scene‑description supervision

Experimental Results

Efficiency Analysis

Conclusion

Machine Learning Algorithms & Natural Language Processing

How this landed with the community

Was this worth your time?

0 Comments

Design 1 – Hybrid TTT architecture

Design 2 – Spatial‑predictive mechanism

Design 3 – Dense 3D scene‑description supervision