Time Series Forecasting Augmentation: Frequency, Decomposition, and Patch Techniques

This article reviews why classic classification augmentations fail for forecasting, introduces the essential data‑label consistency requirement, and systematically categorizes effective time‑series augmentation methods—including frequency‑domain (RobustTAD, FreqMask, FreqMix), decomposition (STAug), and patch‑based approaches (WaveMask, WaveMix, Dominant Shuffle, Temporal Patch Shuffle)—backed by extensive experiments on long‑term, short‑term, and classification tasks.

Data Party THU
Data Party THU
Data Party THU
Time Series Forecasting Augmentation: Frequency, Decomposition, and Patch Techniques

Data augmentation is indispensable in modern machine learning. In computer vision it is crucial for training good models, and in time‑series classification a mature set of techniques (jittering, scaling, window slicing, time warping, permutation, rotation) exists. However, forecasting differs because the target is a continuous signal that follows the input, making many classification‑oriented augmentations unsuitable.

Why Classification Augmentations Fail in Forecasting

Techniques such as jittering, scaling, window warping, and permutation preserve class labels but disrupt the relationship between the look‑back window x and the prediction horizon y. Altering the input without a corresponding, consistent change to the target breaks the input‑target alignment, causing performance to drop below the non‑augmented baseline.

Data‑Label Consistency: A Necessary Condition

Let the concatenated sequence be s = x \parallel y. Augmentation should be applied to s as a whole, then split back into (\tilde{x}, \tilde{y}) = Split(\mathcal{A}(s)). This preserves the natural continuity between input and target, which is essential for forecasting.

Classification of Forecast Augmentation Methods

Frequency‑based: RobustTAD, FreqMask, FreqMix, WaveMask, WaveMix, Dominant Shuffle

Decomposition‑based: STAug

Other: wDBA, MBB, Upsample

Patch‑based: Temporal Patch Shuffle (TPS)

RobustTAD

RobustTAD performs a discrete Fourier transform on the concatenated sequence, perturbs selected frequency bands (either amplitude or phase), and applies the inverse transform. The perturbation magnitude is controlled by a proportion of the spectrum, and amplitude variants replace magnitudes with samples from a Gaussian distribution, while phase variants add a small offset. Originally designed for anomaly detection, the amplitude variant is also used for multivariate forecasting.

FreqMask and FreqMix

Both start with s = x \parallel y and compute S = rFFT(s). FreqMask applies a binary mask M to zero out selected frequencies: \tilde{S} = M \odot S, \tilde{s}=irFFT(\tilde{S}). FreqMix mixes two sequences in the frequency domain:

\tilde{S}=M \odot S_1 + (1-M) \odot S_2, \tilde{s}=irFFT(\tilde{S})

. These operations are simple yet effective, forcing models to be robust to missing frequency components.

FreqMask and FreqMix illustration
FreqMask and FreqMix illustration

WaveMask and WaveMix (Time‑Frequency Localization)

Fourier transforms lose temporal location. Short‑Time Fourier Transform (STFT) uses a fixed window, while wavelets provide multi‑resolution analysis. WaveMask and WaveMix first apply a discrete wavelet transform W = WaveDec(s) = {W^{(1)},…,W^{(L+1)}}. For each level l, WaveMask masks coefficients: \tilde{W}^{(l)} = M^{(l)} \odot W^{(l)}, and WaveMix mixes coefficients from two sequences:

\tilde{W}^{(l)} = M^{(l)} \odot W_1^{(l)} + (1-M^{(l)}) \odot W_2^{(l)}

. Reconstructing with inverse DWT yields the augmented sequence.

WaveMask and WaveMix pipelines
WaveMask and WaveMix pipelines

Dominant Shuffle

Dominant Shuffle selects the top‑ k dominant frequencies Ω_k from the FFT of s, shuffles only those components, and leaves the rest untouched: S_{Ω_k} = Shuffle(S_{Ω_k}), then \tilde{s}=IFFT(\tilde{S}). This avoids overly aggressive perturbations of the whole spectrum. In the TPS benchmark, Dominant Shuffle is not the strongest overall.

Dominant Shuffle illustration
Dominant Shuffle illustration

STAug (Decomposition‑Based)

STAug applies Empirical Mode Decomposition (EMD) to two sequences, obtaining intrinsic mode functions (IMFs). It then recombines the IMFs using mixup‑style interpolation weights sampled from a uniform distribution, producing a new sequence that blends temporal features. The method suffers from high memory consumption; in the TPS experiments it could not be evaluated on the ECL and Traffic datasets due to GPU memory limits.

STAug decomposition and recombination
STAug decomposition and recombination

Other Non‑Frequency Methods

wDBA aligns sequences with Dynamic Time Warping (DTW) and averages them, producing high‑quality synthetic samples at a large computational cost. MBB decomposes a series into trend, seasonality, and residual via STL, then bootstraps residual blocks. Upsample extracts a continuous segment and linearly interpolates it back to the original length, acting as a local magnifier; it consistently provides a strong non‑frequency baseline.

Upsample pipeline
Upsample pipeline

From Image Patch to Time‑Series Patch

Patch‑based augmentation is well‑established in vision (e.g., PatchShuffle, PatchMix) because images have spatial redundancy. Time series lack such redundancy; shuffling non‑overlapping patches creates hard boundaries and breaks input‑target alignment. Therefore, patch ideas must be re‑thought for the temporal domain.

PatchShuffle in vision
PatchShuffle in vision

Temporal Patch Shuffle (TPS)

The TPS pipeline works as follows:

Concatenate the look‑back window and prediction horizon into a continuous sequence s = x \parallel y to enforce data‑label consistency.

Temporal Patching : extract overlapping patches of length p with stride s. Overlap ensures smooth transitions during reconstruction.

Variance Scoring : compute the variance of each patch across all channels (after normalisation). Low‑variance patches contain fewer structural details and are safer to perturb.

Selective Shuffle : shuffle a proportion \alpha of the lowest‑variance patches; the remaining patches stay in place.

Reconstruction : place each patch back (shuffled or not) and average overlapping regions to smooth discontinuities.

Split the reconstructed sequence back into the augmented input \tilde{x} and target \tilde{y}.

TPS pipeline illustration
TPS pipeline illustration

Ablation Study

Key findings:

Data‑label consistency is decisive; augmenting only the input while keeping the target unchanged causes the largest performance drop.

Overlapping patches are crucial; replacing them with non‑overlapping patches degrades results noticeably.

Variance‑aware ordering provides a modest gain, especially when only a subset of patches is shuffled.

Operating directly in the time domain outperforms frequency‑domain variants of the same patch operation.

Higher shuffle ratios (≈0.7–1.0) generally yield stronger improvements.

Overall, the study emphasizes that forecasting augmentation must inject *controlled* randomness that respects the signal’s temporal structure.

Long‑Term Forecasting Results

TPS was evaluated on nine long‑term datasets using five recent backbones (TSMixer, DLinear, PatchTST, TiDE, LightTS). Across all backbones, TPS achieved the best average MSE, improving the second‑best baseline by 2.08%–10.51% (the largest 10.51% gain on LightTS).

Long‑term forecasting comparison
Long‑term forecasting comparison

Short‑Term Traffic Forecasting

Using PatchTST as the backbone on four traffic datasets (PeMS‑03,‑04,‑07,‑08), TPS again delivered the strongest overall enhancement, with MSE improvements ranging from 0.00% to 7.14% and never degrading performance.

Short‑term traffic forecasting results
Short‑term traffic forecasting results

Extension to Time‑Series Classification

For classification, TPS removes the concatenation step and shuffles patches at the sample level. On 30 univariate UCR datasets (MiniRocket) and 10 multivariate UEA datasets (MultiRocket), TPS achieved the highest average accuracy, improving the best competitor by 0.50% (UCR) and 1.10% (UEA), and ranking in the top‑2 on a majority of datasets.

Classification results
Classification results

Conclusion

TPS’s advantage stems from three factors: it avoids costly decomposition steps, it does not indiscriminately disturb the entire spectrum, and it preserves input‑target alignment through data‑label consistency. By applying controlled, variance‑aware shuffling with overlapping patches and averaging, TPS consistently outperforms other augmentations across long‑term forecasting, short‑term traffic prediction, and time‑series classification, establishing a new state‑of‑the‑art across tasks and model families.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data augmentationtime series forecastingfrequency domaintemporal patch shufflewavelet transform
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.