KANMixer: A New KAN‑Centric Paradigm for Long‑Term Time Series Forecasting
This article reviews the KANMixer model, which places Kolmogorov‑Arnold Networks at the core of a lightweight architecture for long‑term time series forecasting, detailing its design, extensive benchmark experiments on seven real‑world datasets, ablation analyses, and its computational trade‑offs versus MLP and Transformer baselines.
Background: Long‑term time series forecasting (LTSF) is critical for energy, weather, traffic, and other domains. Early models such as LSTM and Transformers dominated, but recent work (DLinear) showed that simple linear models can surpass Transformers, prompting researchers to augment MLP‑based LTSF models with hand‑crafted external modules. However, MLPs lack hierarchical locality and inductive bias, and performance gains are saturating.
Problem definition: The paper investigates whether a Kolmogorov‑Arnold Network (KAN) can serve as the core modeling component for LTSF, addressing three issues: missing inductive bias, performance saturation, and limited representational capacity of flat MLPs.
Method: KANMixer is a lightweight architecture built around KAN. It consists of three modules: (1) an explicit multi‑scale processing module that generates multi‑scale representations via average pooling and concatenates them into a unified hidden representation X^{ms}; (2) an implicit time‑mixing module that hierarchically fuses scales using N stacked mixing blocks, where each scale i is updated by the up‑sampled feature of the finer scale i‑1 processed through a KAN layer; (3) a KAN‑based prediction head that maps each scale’s hidden feature Z_{N}^{i} to a scale‑specific forecast and sums them for the final output.
Experiment setup: Seven real‑world datasets (ETTh1/2, ETTm1/2, Exchange Rate, Weather, Electricity) are split into training/validation/test sets (6:2:2 or 7:1:2). Baselines include KAN‑based TimeKAN, Transformer‑based iTransformer and PatchTST, MLP‑based TimeMixer and DLinear, and CNN‑based TimesNet. All models use the Adam optimizer (lr = 0.01), batch size = 32, and results are averaged over five runs. Evaluation metrics are mean squared error (MSE) and mean absolute error (MAE).
Main results: Across 28 experiments, KANMixer achieves the best MSE in 16 cases and the best MAE in 11 cases; for example, on the ETTh1 dataset it improves average MSE by 4.9 %. Compared with more complex models such as WPMixer and TimeMixer, KANMixer is simpler yet delivers superior performance. Against the KAN‑based TimeKAN, which relies on cascaded frequency decomposition, KANMixer directly uses KAN as the core and shows more stable results.
Ablation studies: Replacing KAN layers with MLPs degrades performance markedly, confirming the advantage of KAN’s adaptive basis functions. The KAN prediction head contributes the most to accuracy; removing it causes a large rise in MSE/MAE. B‑spline basis functions perform best across prediction horizons, while Chebyshev, Fourier, and wavelet bases underperform. Explicit decomposition (DFT/MA) harms KAN performance, whereas the multi‑scale module benefits it, indicating KAN relies more on data‑driven learning than handcrafted priors.
Computational efficiency: KANMixer’s MACs and parameter count exceed those of plain MLPs but are far lower than Transformer models (PatchTST: 90.57 M vs. KANMixer ≈ 5.89 G MACs). Training time is longer mainly due to unoptimized CUDA kernels, an engineering rather than theoretical limitation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
