How LoGeR Enables Minute‑Long 3D Reconstruction with Hybrid Memory
The article presents LoGeR, a long‑context geometric reconstruction framework that combines test‑time‑training memory and sliding‑window attention to achieve minute‑scale, fully‑feedforward 3D reconstruction with superior accuracy on benchmarks such as KITTI and VBR.
Introduction
Long‑context memory is essential for large models that must preserve information across thousands of video frames in 3D reconstruction tasks. Conventional feed‑forward 3D reconstruction networks use short context windows, which limits their ability to model long‑range dependencies and leads to scale drift on city‑scale or minute‑long video sequences.
Architecture: LoGeR (Long‑Context Geometric Reconstruction)
LoGeR processes a video stream in a sequence of fixed‑size blocks. Within each block a bidirectional attention prior provides high‑fidelity intra‑block inference, while a hybrid memory module propagates information across blocks without requiring post‑hoc optimization.
Hybrid Memory Module
Test‑time‑training (TTT) memory : a parametric component that learns a compressed representation of geometric information and anchors a global coordinate frame. The TTT memory stores a set of fast weights W that are updated after each block, mitigating scale drift and providing long‑range, lossy compression.
Sliding‑window attention (SWA) : a non‑parametric component that copies uncompressed token features from the previous block to the current block (tokens C^{m‑1} ∪ C^{m}). SWA is inserted sparsely (four layers) and operates only on adjacent blocks, delivering lossless short‑range context and fine‑grained geometric alignment.
The combination yields linear‑time, fixed‑size memory propagation that scales to thousands of frames while preserving dense local geometry.
Block‑wise Processing
Input video is divided into n blocks (e.g., 4–20 blocks). Each block is processed independently by the backbone network with bidirectional attention, then the hybrid memory passes information to the next block. This design limits computational cost (quadratic attention is confined to a block) and ensures that the distribution of training data (short‑range blocks) matches the inference distribution.
Training Curriculum
LoGeR is trained with a progressive curriculum that gradually increases context length and block density:
Stage 1: 48‑frame sequences split into 4 blocks.
Stage 2: Increase block density to 12 blocks while keeping the total sequence length at 48 frames.
Stage 3: Expand the context to 128 frames and up to 20 blocks using H200 GPUs.
This curriculum forces the model to first rely on SWA for local consistency and later shift to the TTT memory for global alignment.
Evaluation
KITTI benchmark (open‑loop) : LoGeR trained on 128‑frame sequences reduces absolute trajectory error (ATE) by >74 % compared with prior feed‑forward methods and outperforms the strongest optimization‑based baseline (VGGT‑Long) by 32.5 %.
VBR dataset (up to 19 000 frames) : LoGeR produces stable, globally consistent reconstructions where baseline methods exhibit severe scale drift. The TTT module naturally anchors the global scale, enabling accurate reconstruction over ultra‑long sequences.
Short‑sequence datasets (7‑Scenes, ScanNetV2, TUM‑Dynamics, up to ~1 000 frames) : LoGeR consistently surpasses state‑of‑the‑art methods such as Point3R, CUT3R, TTT3R, StreamVGGT, and bidirectional attention baselines in both 3D point‑cloud quality and camera pose accuracy.
Quantitative results are reported in the original paper’s tables and figures, showing average performance gains of 30 %–35 % over the best existing baselines across all benchmarks.
References
Paper: LoGeR: Long‑Context Geometric Reconstruction with Hybrid Memory
arXiv preprint: https://arxiv.org/pdf/2603.03269
Project website: https://loger-project.github.io/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
