Artificial Intelligence 11 min read

How LoGeR Enables Minute‑Long 3D Reconstruction with Hybrid Memory

The article presents LoGeR, a long‑context geometric reconstruction framework that combines test‑time‑training memory and sliding‑window attention to achieve minute‑scale, fully‑feedforward 3D reconstruction with superior accuracy on benchmarks such as KITTI and VBR.

Data Party THU

Mar 29, 2026

How LoGeR Enables Minute‑Long 3D Reconstruction with Hybrid Memory

Introduction

Long‑context memory is essential for large models that must preserve information across thousands of video frames in 3D reconstruction tasks. Conventional feed‑forward 3D reconstruction networks use short context windows, which limits their ability to model long‑range dependencies and leads to scale drift on city‑scale or minute‑long video sequences.

Architecture: LoGeR (Long‑Context Geometric Reconstruction)

LoGeR processes a video stream in a sequence of fixed‑size blocks. Within each block a bidirectional attention prior provides high‑fidelity intra‑block inference, while a hybrid memory module propagates information across blocks without requiring post‑hoc optimization.

Hybrid Memory Module

Test‑time‑training (TTT) memory : a parametric component that learns a compressed representation of geometric information and anchors a global coordinate frame. The TTT memory stores a set of fast weights W that are updated after each block, mitigating scale drift and providing long‑range, lossy compression.

Sliding‑window attention (SWA) : a non‑parametric component that copies uncompressed token features from the previous block to the current block (tokens C^{m‑1} ∪ C^{m}). SWA is inserted sparsely (four layers) and operates only on adjacent blocks, delivering lossless short‑range context and fine‑grained geometric alignment.

The combination yields linear‑time, fixed‑size memory propagation that scales to thousands of frames while preserving dense local geometry.

Block‑wise Processing

Input video is divided into n blocks (e.g., 4–20 blocks). Each block is processed independently by the backbone network with bidirectional attention, then the hybrid memory passes information to the next block. This design limits computational cost (quadratic attention is confined to a block) and ensures that the distribution of training data (short‑range blocks) matches the inference distribution.

Training Curriculum

LoGeR is trained with a progressive curriculum that gradually increases context length and block density:

Stage 1: 48‑frame sequences split into 4 blocks.

Stage 2: Increase block density to 12 blocks while keeping the total sequence length at 48 frames.

Stage 3: Expand the context to 128 frames and up to 20 blocks using H200 GPUs.

This curriculum forces the model to first rely on SWA for local consistency and later shift to the TTT memory for global alignment.

Evaluation

KITTI benchmark (open‑loop) : LoGeR trained on 128‑frame sequences reduces absolute trajectory error (ATE) by >74 % compared with prior feed‑forward methods and outperforms the strongest optimization‑based baseline (VGGT‑Long) by 32.5 %.

VBR dataset (up to 19 000 frames) : LoGeR produces stable, globally consistent reconstructions where baseline methods exhibit severe scale drift. The TTT module naturally anchors the global scale, enabling accurate reconstruction over ultra‑long sequences.

Short‑sequence datasets (7‑Scenes, ScanNetV2, TUM‑Dynamics, up to ~1 000 frames) : LoGeR consistently surpasses state‑of‑the‑art methods such as Point3R, CUT3R, TTT3R, StreamVGGT, and bidirectional attention baselines in both 3D point‑cloud quality and camera pose accuracy.

Quantitative results are reported in the original paper’s tables and figures, showing average performance gains of 30 %–35 % over the best existing baselines across all benchmarks.

References

Paper: LoGeR: Long‑Context Geometric Reconstruction with Hybrid Memory

arXiv preprint: https://arxiv.org/pdf/2603.03269

Project website: https://loger-project.github.io/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision 3D reconstruction long-context hybrid memory LoGeR

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.