Artificial Intelligence 10 min read

LinStereo Bridges the Last Mile of Stereo Matching (ECCV 2026)

LinStereo replaces ConvGRU with a position‑aware linear attention module, adds a multi‑scale cost volume and monocular depth initialization, cutting Middlebury occlusion error by 37%, outperforming larger models, and achieving strong zero‑shot underwater performance while remaining parameter‑efficient.

Machine Heart

Jul 4, 2026

LinStereo Bridges the Last Mile of Stereo Matching (ECCV 2026)

Current stereo‑matching pipelines share a common structure: a pretrained backbone extracts features, a cost volume is built, and a ConvGRU iteratively refines disparity. The ConvGRU‑based update is identified as a bottleneck, especially in large occluded, weak‑texture, or underwater scenes where its limited receptive field hampers information propagation.

LinStereo architecture

LinStereo introduces three complementary modules:

PALA (Position‑Aware Linear Attention) replaces ConvGRU, allowing each pixel to attend to the whole image at every iteration.

HSCV (Hierarchical Multi‑Scale Cost Volume) retains multi‑scale features by constructing cost volumes at 1/4, 1/8 and 1/16 resolutions, each with a four‑level disparity pyramid.

DPI (Depth‑Prior Initialization) uses the frozen Depth Anything V3 backbone to generate a monocular depth map; SIFT matches compute a scale and shift that are converted into a reliable disparity initialization.

PALA: linear‑complexity global attention

PALA substitutes the ConvGRU’s local update with global attention. To avoid the O(N²) cost of softmax attention, queries and keys are passed through an ELU+1 kernel activation and the matrix‑multiplication associativity is exploited, reducing complexity to O(N·C_h²). Empirically, a single iteration costs 3.50 ms, compared with 3.63 ms for ConvGRU, i.e., no noticeable slowdown.

Because kernel‑based attention can lose positional cues, PALA adds a 2‑D RoPE only to the attention numerator ("asymmetric RoPE"). Ablation on KITTI shows EPE 1.01 with asymmetric RoPE versus 1.05 without; on the underwater TartanAir‑UW benchmark RMSE improves from 2.18 to 2.08 (≈5 % gain).

HSCV: preserving multi‑scale information

HSCV builds cost volumes at three scales (1/4, 1/8, 1/16) and, within each scale, a four‑level disparity pyramid. Removing HSCV raises KITTI EPE by 0.06 and underwater AbsRel by 0.003. Although the individual gain is modest, it compounds when combined with PALA.

DPI: monocular depth‑based initialization

Depth Anything V3 provides an affine‑invariant depth map. SIFT matches between the left and right images estimate the required scale and shift, which are transformed into an initial disparity. SIFT fails in 3.7 % of cases; the system falls back to zero initialization, incurring only a 0.08‑pixel increase in EPE.

Experimental results

LinStereo uses a ViT‑B backbone (127 M parameters, >100 M frozen). On the Middlebury occlusion set it achieves EPE 1.33, 16 % lower than FoundationStereo and 37 % lower than DEFOM‑Stereo (2.11), attributed to global attention retrieving information from distant, unoccluded regions.

Zero‑shot underwater evaluation shows LinStereo leading across all metrics despite no underwater training data, with qualitative results demonstrating coherent depth maps in severely degraded regions.

Speed‑accuracy trade‑offs: with T = 2 iterations the system runs at 12.5 FPS on 480×640 images, maintaining SQUID AbsRel 0.05. Adding three PALA blocks increases parameters to 147 M but degrades KITTI EPE from 1.01 to 1.05, suggesting over‑fitting when stacking explicit depth layers.

SeaStereo dataset

The authors release the SeaStereo‑Dataset, containing 40,320 underwater stereo pairs with dense disparity. The data cover seven Jerlov water types, are rendered in Blender with ShapeNetCore objects against real ocean backgrounds, and address the scarcity of public underwater stereo data.

Parameter efficiency

Only ~10 M parameters require training; the remainder (>100 M) are frozen from Depth Anything V3, demonstrating that a lightweight decoder can leverage large pretrained priors.

Paper: https://arxiv.org/abs/2606.25437. Code: https://github.com/u7079256/LinStereo.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linear Attention Vision Foundation Model ECCV 2026 Stereo Matching Underwater Vision

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.