Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap
The paper analytically identifies injectivity and local modeling as the two key factors causing the performance gap between linear and Softmax attention, proposes the InLine attention modifications to restore these properties, and demonstrates through extensive Vision Transformer experiments that the enhanced linear attention matches or surpasses Softmax while retaining linear computational cost.
Problem Statement
Vision Transformers rely on Softmax (dot‑product) attention for strong long‑range modeling, but its quadratic complexity O(N²) makes high‑resolution inputs prohibitively expensive. Linear attention reduces the complexity to O(N) by replacing the Softmax with a kernel that enables the computation order to be changed, yet its empirical performance lags far behind Softmax.
Key Technical Insights
Injectivity of the attention mapping : Softmax attention is injective—different query vectors produce distinct attention distributions—whereas standard linear attention is not, causing different queries to receive identical weight vectors and leading to semantic confusion.
Effective local modeling : Although attention has a global receptive field, strong bias toward a query’s 3×3 spatial neighborhood is essential for high performance. Softmax attention exhibits a markedly stronger local bias than existing linear variants.
Both claims are supported by theoretical proofs (Propositions 1‑3) and extensive experiments on DeiT‑T, showing frequent identical scores for distinct queries in linear attention and a larger performance drop when local tokens are masked for Softmax than for linear attention.
Analysis of the Gap
4.1 Injectivity
Define the attention mapping f_query that maps a query q to its attention scores. Under mild assumptions, Proposition 1 proves Softmax attention is injective, while Proposition 2 shows linear attention is not. Consequently, two different queries q₁≠q₂ can yield the same score vector, collapsing distinct semantics into the same output.
Illustrative example (Fig. 1): four collinear vectors of different lengths. Softmax assigns distinct scores, focusing more on the longer query. Linear attention with a simple kernel φ(x)=x produces identical scores for all four queries, demonstrating the confusion. Using a more nonlinear kernel φ(x)=exp(x) exacerbates the issue, assigning the same scores to vectors with different directions.
Empirical verification on ImageNet‑1K (DeiT‑T): for each image the number of query pairs whose L₂ distance of scores is <1e‑3 is counted. Nearly all images processed with Softmax have zero such collisions, while linear attention exhibits thousands of collisions per image (Fig. 2). Table 1 quantifies the resulting accuracy loss.
4.2 Local Modeling Capability
Attention’s global nature does not guarantee effective local modeling. By summing the attention weights assigned to the 3×3 spatial window of each query, the authors observe a strong local bias in all three mechanisms, but Softmax allocates a larger proportion (≈0.45) than linear variants (≈0.30), especially in shallow layers (Fig. 3).
Masking experiments (Table 2) reveal two facts:
Removing local tokens dramatically reduces accuracy, whereas random removal of the same number of tokens has a minor effect.
The degradation is far more severe for Softmax than for the proposed linear variant, confirming that stronger local bias contributes to Softmax’s superiority.
Proposed Solution: InLine Attention
To restore injectivity, the division‑based normalization in linear attention is replaced by subtraction, yielding the following formulation:
\tilde{A}(q,k) = \phi(q)^{\top}\phi(k) - \frac{\sum_{i}\phi(q)^{\top}\phi(k_i)}{\sum_{i}\phi(k_i)}Proposition 3 proves that this mapping is injective, i.e., for any q₁≠q₂ the resulting attention vectors differ. The computational cost remains linear because the numerator and denominator can be computed via associative matrix multiplication.
To compensate for the weaker local bias, a lightweight residual is added:
y = \text{InLine}(X) + \text{MLP}(\text{AvgPool}(X)) \odot X_{\text{local}}where X_{\text{local}} denotes the tokens in the 3×3 window of each query. The residual adds only O(N) operations, preserving overall linear complexity O(N·H·D) for H heads of dimension D.
Implementation Details
Backbone: Swin‑Transformer (window‑based attention). The original Softmax attention is swapped with linear attention, then the injectivity fix and local residual are added incrementally.
Training: AdamW optimizer, 300 epochs, cosine decay with 20‑epoch linear warm‑up, initial LR = 0.001 (value reported in the paper), weight decay = 0.05. Data augmentations include RandAugment, Mixup, CutMix, and random erasing. EMA is used for InLine‑CSwin training.
Datasets: ImageNet‑1K (1.28 M train / 50 k val, 1000 classes), COCO (118 k train / 5 k val), ADE20K (20 k train / 2 k val, 150 classes).
Empirical Results
Classification (ImageNet‑1K)
InLine‑Swin‑T improves top‑1 accuracy from 81.3 % (baseline) to 82.4 % (+1.1 %).
Injectivity alone yields up to +9.8 % accuracy when using problematic kernels (e.g., φ(x)=exp(x)).
Adding the local residual consistently boosts performance across window sizes; larger windows benefit more when the residual is present.
Object Detection (COCO)
InLine‑PVT‑S achieves 6.7 % higher box AP than PVT‑T at comparable FLOPs.
InLine‑PVT‑L surpasses PVT‑M by 3.4 % AP while using fewer FLOPs.
Semantic Segmentation (ADE20K)
When integrated into SemanticFPN and UperNet, InLine consistently raises mIoU while reducing computation.
Throughput
Figure 5 shows that InLine models maintain high inference speed as window size grows, unlike Softmax‑based Swin where latency spikes due to O(N²) scaling. In high‑resolution scenarios (e.g., 1024×1024), InLine delivers a 2‑3× speedup with comparable accuracy.
Comparison with Prior Linear Attentions
Table 9 reports results against CosFormer, Nystromformer, Efficient Attention, TransNormer, FLatten, and MLLA. InLine outperforms all baselines without extra modules, demonstrating that injectivity and local bias are the dominant factors.
Ablation Studies
Kernel functions: Identity → best trade‑off; ReLU and Exponential give slight gains (Table 10).
Local residual: removing it drops accuracy by 0.5‑1.2 % across tasks, confirming its importance.
Window size: without the residual, larger windows do not improve performance; with the residual, accuracy rises monotonically with window size.
Conclusion
Restoring injectivity by replacing division with subtraction and augmenting linear attention with a minimal local‑attention residual closes the performance gap with Softmax attention. The resulting InLine attention retains linear time‑complexity O(N·H·D) while achieving equal or superior results on ImageNet‑1K classification, COCO detection, and ADE20K segmentation, and it offers substantial speed advantages on high‑resolution inputs.
All code, pretrained models, and training scripts are available at https://github.com/LeapLabTHU/InLine.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
