Tsinghua’s Spatial‑TTT Beats Gemini: Continuous Spatial Intelligence Wins ECCV 2026
Spatial‑TTT, a 2‑billion‑parameter open‑source multimodal model, uses fast‑weight updates, a hybrid TTT architecture, a spatial‑predictive mechanism and dense 3D scene supervision to maintain and refresh a spatial memory while processing up to 120‑minute video streams, outperforming Gemini‑3‑pro and other closed‑source baselines on multiple spatial‑intelligence benchmarks with over 40% lower memory and compute cost.
Problem
In robotics, autonomous driving, and AR, spatial understanding requires remembering, linking, and continuously updating information from long video streams rather than relying on a single image snapshot.
Spatial‑TTT Overview
Spatial‑TTT treats model parameters as fast‑weight memory that is updated online while processing video chunks, allowing the model to accumulate 3‑D evidence over time.
Design 1 – Hybrid TTT architecture
In the decoder, TTT layers and standard self‑attention layers are interleaved at a 3:1 ratio; 75 % of layers are TTT, writing long‑range information into fast weights.
The remaining 25 % are anchor self‑attention layers that preserve the pretrained multimodal semantic abilities.
Large‑chunk updates together with sliding‑window attention keep GPU utilization high and retain local spatio‑temporal interactions.
Design 2 – Spatial‑predictive mechanism
Standard TTT projects each visual token independently, ignoring local geometry and temporal continuity. Spatial‑TTT adds lightweight 3‑D spatio‑temporal convolutions to the Q/K/V projections, enabling the fast‑weight module to learn a mapping from spatio‑temporal context to spatio‑temporal context. This captures geometric correspondence, view changes, and continuity, stabilising online updates.
Design 3 – Dense 3D scene‑description supervision
Typical spatial‑intelligence datasets provide sparse Q&A supervision that covers only tiny parts of a scene. Spatial‑TTT builds a dense 3‑D scene‑description dataset that requires the model to generate a full scene walkthrough (global context, object categories, counts, spatial relations). Training proceeds in two stages:
Spatial‑aware progressive training on dense descriptions to form a global 3‑D awareness.
Millions of spatial VQA examples refine direction judgment, distance estimation, counting, room‑size estimation, and route planning.
Experimental Results
On VSI‑Bench, the 2 B‑parameter Spatial‑TTT‑2B achieves an average score of 64.4, surpassing all open‑source and closed‑source baselines, with especially strong performance on absolute distance, relative direction, route planning, and appearance‑order tasks.
On MindCube‑Tiny, Spatial‑TTT reaches 76.2 % accuracy, 12 percentage points higher than Gemini‑3‑pro (63.9 %) and nearly 25 points above the open‑source MindCube‑3B (51.7 %).
For the long‑duration VSI‑SUPER counting tasks (10, 30, 60, 120 min videos), Spatial‑TTT scores 31.8, 45.6, 36.2, and 38.4 respectively, while many other models either collapse in performance or run out of memory.
Ablation studies show that removing the spatial‑predictive mechanism drops the VSI‑Bench average to 62.1, removing dense supervision drops it to 61.3, and discarding the hybrid architecture (using pure TTT) reduces it to 53.9, confirming that all three designs contribute synergistically.
Efficiency Analysis
With a 1024‑frame input, Spatial‑TTT‑2B uses 11.9 GB peak GPU memory and 799.4 TFLOPs, compared with industry‑leading closed‑source models that require 21.2 GB and 1403.1 TFLOPs—over 40 % savings in both memory and compute. Models that rely on explicit geometric encoders (e.g., Spatial‑MLLM‑4B) cannot run at 512 or 1024 frames.
Conclusion
Spatial‑TTT demonstrates that a fast‑weight based dynamic memory can continuously integrate new observations into an internal spatial state, enabling persistent world‑state modeling for physical agents that must accumulate and refine knowledge over prolonged operation.
Paper: https://arxiv.org/pdf/2603.12255
Project page: https://liuff19.github.io/Spatial-TTT/
GitHub: https://github.com/THU-SI/Spatial-TTT/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
