Artificial Intelligence 8 min read

Does Scale Stealthily Hijack Attention? PMDformer’s Simple Subtraction Fix for Long-Term Forecasting

The paper identifies scale differences between patches as a hidden source of attention distortion in long‑term time‑series forecasting, introduces PMDformer with Patch Mean Decoupling, Neighbor Variable Attention, and Trend Recovery Attention, and demonstrates state‑of‑the‑art accuracy and efficiency across eight benchmark datasets.

Machine Heart

Apr 4, 2026

Does Scale Stealthily Hijack Attention? PMDformer’s Simple Subtraction Fix for Long-Term Forecasting

Long‑term time series forecasting (LTSF) is critical for domains such as energy management, finance, and traffic prediction. Existing patch‑based Transformer models suffer from a fundamental issue: in non‑stationary series, scale disparities between patches obscure shape similarity, causing the attention mechanism to learn misleading relationships and limiting prediction accuracy.

To address this, researchers from Southwest University of Finance and Economics, Shanghai Institute of Science and Intelligence, Fudan University, and Chengdu Hengtou Technology propose PMDformer , an innovative framework built on Patch Mean Decoupling (PMD). The model comprises three synergistic modules that together form a complete technical solution.

1. Patch Mean Decoupling (PMD) : For each patch, the temporal mean is subtracted, separating the original patch into a long‑term trend (the mean) and a residual shape component. Unlike conventional normalization, PMD performs only mean subtraction, preserving the amplitude variations and structural shape within the patch.

2. Neighbor Variable Attention (PVA) : Based on the insight that recent cross‑variable interactions are most informative for the target series, PVA restricts cross‑variable self‑attention to tokens within the most recent patch rather than the entire historical window. This design yields two advantages: (a) it captures the most relevant recent shape similarity while avoiding interference from early weak or spurious correlations, and (b) it reduces computational complexity from O(C²N) to O(C²), markedly improving efficiency.

3. Trend Recovery Attention (TRA) : While PMD enhances shape modeling, it attenuates the long‑term trend signal. TRA addresses this by using only shape embeddings for the Query and Key channels, ensuring attention scores reflect shape similarity, and injecting the previously extracted patch mean into the Value channel via addition. This decoupled design enables the model to encode both local shape patterns and global trend dynamics, producing more stable forecasts.

The authors evaluated PMDformer on eight widely used real‑world datasets covering electricity, weather, energy, and traffic domains. Compared with eight recent baselines, PMDformer achieved the lowest MSE and MAE on seven of the eight datasets, demonstrating consistent and superior performance across multiple prediction horizons (96, 192, 336, 720 steps).

In terms of computational efficiency, scaling experiments that increased the number of variables from 100 to 3000 and the sequence length from 144 to 5400 showed that PMDformer consistently required less GPU memory than PatchTST, iTransformer, and ModernTCN. The memory savings stem from the PVA module’s reduced attention complexity, which becomes especially pronounced in high‑dimensional multivariate scenarios.

In summary, the work reveals that coupling patch means (trend) with residual shapes systematically harms attention’s ability to model shape similarity. By applying a simple mean‑subtraction operation together with carefully designed trend‑recovery and neighbor‑variable attention mechanisms, PMDformer improves both prediction accuracy and computational efficiency without increasing model complexity. Future work will extend the approach to higher‑dimensional multivariate data and explore multimodal fusion with text and image inputs for smarter forecasting in energy, finance, and transportation.