How LoGeR Extends 3D Reconstruction to Thousands of Frames with Hybrid Memory
LoGeR, a new long‑context geometric reconstruction framework from DeepMind and UC Berkeley, uses a hybrid memory module combining test‑time‑training (TTT) and sliding‑window attention (SWA) to enable feed‑forward 3D reconstruction over sequences of up to tens of thousands of frames, achieving state‑of‑the‑art accuracy on KITTI, VBR, 7‑Scenes, ScanNetV2 and TUM‑Dynamics benchmarks.
Problem Statement
Existing feed‑forward dense 3D reconstruction networks operate on short context windows (tens to a few hundred frames) and are trained on limited short‑term data. This creates two bottlenecks: (1) the quadratic cost of bidirectional attention restricts the usable context length, and (2) there is a severe scarcity of long‑range training data, preventing reliable reconstruction of city‑scale or minute‑level video sequences.
Method Overview
LoGeR processes video streams in overlapping blocks (e.g., 128‑frame segments) so that computation grows linearly with sequence length while preserving local geometric fidelity. The core is a hybrid memory module that combines:
Test‑Time‑Training (TTT) memory : a parametric fast‑weight mechanism that compresses global geometric cues (coarse shape, scene scale) across blocks, anchoring a global coordinate frame and mitigating scale drift.
Sliding‑Window Attention (SWA) : a non‑parametric, loss‑less attention window that propagates high‑resolution features from the previous block to the current one, ensuring fine‑grained geometric alignment.
Block‑wise Processing
Each block undergoes dense bidirectional attention for high‑quality local inference. After a block is processed, the TTT layer updates its fast weights with compressed geometry and applies them to the next block. SWA layers attend to tokens from the current and previous blocks (C^{m‑1} ∪ C^{m}) at only four network depths, keeping memory and compute overhead low.
Alignment and Global Consistency
A pure feed‑forward alignment step re‑projects predictions into a globally consistent coordinate system after each block, eliminating drift without requiring loop‑closure detection.
Training Curriculum
To cope with limited long‑context data, a progressive curriculum is used:
Start with 48‑frame sequences split into 4 blocks.
Increase block density to 12 blocks while keeping sequence length.
Scale up to 128‑frame context (20 blocks) using H200 GPUs.
During this schedule the model gradually shifts reliance from SWA to TTT, learning to store and retrieve global geometry.
Experimental Results
KITTI : Absolute Trajectory Error (ATE) reduced by >74 % compared to prior feed‑forward methods; LoGeR outperforms the strongest optimization‑based baseline (VGGT‑Long) by 32.5 %.
VBR (up to 19 000 frames) : Maintains consistent global scale, whereas baselines exhibit severe drift.
7‑Scenes (50–500 frames) : Beats state‑of‑the‑art low‑complexity methods (Point3R, CUT3R, TTT3R, StreamVGGT, VGGT, π³) in reconstruction quality and pose accuracy.
ScanNetV2 and TUM‑Dynamics : Achieves lower camera pose error than all compared methods.
Qualitative visualizations show stable reconstruction over 20 000‑frame sequences, preserving global structure while baseline methods drift.
Paper and Resources
Paper title: LoGeR: Long‑Context Geometric Reconstruction with Hybrid Memory
arXiv PDF: https://arxiv.org/pdf/2603.03269
Project page: https://loger-project.github.io/
Figures
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
