How HORAI Uses Large‑Scale Multimodal Pretraining to Boost Time‑Series Forecasting and Anomaly Detection
The article reviews the HORAI model, which introduces a frequency‑enhanced multimodal pretraining paradigm and the massive MM‑TS dataset, showing that integrating derived images, endogenous text, and real‑world news dramatically improves zero‑shot forecasting and anomaly detection across six domains.
Background
Time‑series analysis is widely used in energy management, medical monitoring, and financial forecasting. Existing methods rely almost exclusively on the numeric modality, limiting their ability to capture the complex, multi‑faceted dynamics of real‑world processes.
Recent advances in NLP and multimodal learning demonstrate that large‑scale pretraining on complementary modalities can improve generalisation. Inspired by this, the authors propose a multimodal foundation model for time‑series analysis that incorporates textual and visual information.
Problem Definition
The paper identifies three core challenges: (1) the lack of a unified multimodal pretraining paradigm and large‑scale multimodal corpus for time‑series; (2) difficulty in aligning heterogeneous modalities within a single architecture; and (3) insufficient generalisation across diverse domains.
Method
3.1 Large‑Scale Multimodal Time‑Series Dataset (MM‑TS)
Dataset overview : MM‑TS is the first large‑scale multimodal time‑series pretraining dataset. It integrates three modalities—numeric series, derived line‑chart images, and textual descriptions—covering six domains (energy, healthcare, network, nature, traffic, economics) with over one billion time points. The series span multiple granularities (seconds to months). Visual modality is generated by rendering line charts directly from the series, while textual modality combines automatically generated endogenous descriptions (via GPT‑4o) with external news retrieved from the GDELT database.
Text construction pipeline :
Context synthesis :
Endogenous text generation : GPT‑4o receives prompts about statistical properties (trend, seasonality, stability) and produces structured descriptions of the series.
Exogenous news retrieval : Metadata (domain, time range, region) forms a query to GDELT; retrieved articles are summarised by GPT‑4o to provide contextual background.
Quality alignment :
Logical consistency check : A specialised GPT‑4o model evaluates the coherence between generated text and retrieved news, discarding hallucinated pairs.
Integrated quality assessment : Multiple LLM judges score each pair on factual correctness and semantic clarity; only samples with high consensus scores are retained.
3.2 HORAI Model
Overall architecture : HORAI is a frequency‑enhanced multimodal foundation model built on a autoregressive backbone. It consists of a frequency‑guided cross‑modal encoder and a time‑frequency decoder.
Frequency‑enhanced cross‑modal encoder :
Numeric series X_{ts} are normalised ( X_{norm}) and transformed to the frequency domain via FFT ( X_{freq}).
A ratio parameter alpha defines a threshold tau, producing low‑frequency mask M_{low} and mid‑high‑frequency mask M_{mh}. Applying the masks yields low‑frequency series X_{low} and mid‑high‑frequency series X_{mh} after inverse FFT.
Each component is split into patches of size S, projected to embeddings E_{low}, E_{mh}, and E_{ts}.
Text embeddings E_{text} are extracted by the Qwen‑0.5B encoder; image embeddings E_{img} come from a ViT‑Base encoder.
Low‑frequency embeddings align with text embeddings, while mid‑high‑frequency embeddings align with image embeddings using a Flow‑Attention mechanism. Queries, keys, and values are constructed as follows:
Q = proj(E_{low})
K, V = proj(E_{text}')The Flow‑Attention computes token‑wise flow, producing frequency‑aware aligned representations E_{text}' and E_{img}'.
Adaptive multimodal fusion concatenates E_{img}' and E_{text}', applies a linear projection and sigmoid gate sigma to obtain a gated weight G, and combines with the series embedding:
E_{mm} = sigma(G) * concat(E_{img}', E_{text}')
E_{fused} = E_{mm} + E_{ts}Time‑frequency decoder :
The fused representation passes through a time‑frequency MoE‑FFN, where each expert captures patterns from specific domains.
A time‑frequency router receives the fused token H, projects it to a temporal vector H_{temp} (via MLP) and a frequency vector H_{freq} (via FFT + MLP). A learned gate G_{router} merges the two signals, producing router output H_{r}.
Top‑K routing selects the most relevant experts; their outputs are weighted and summed to form the final token representation.
Autoregressive training predicts the next token X_{i+1} from the current token X_i using a GPT‑style objective.
Experiments
4.1 Experimental Setup
HORAI is pretrained on MM‑TS and evaluated on downstream forecasting and anomaly‑detection benchmarks that do not overlap with the pretraining data. Forecasting datasets include TimeMMD and additional sets spanning agriculture, climate, energy, environment, social welfare, traffic, EWJ, KR, and MDT. Anomaly‑detection datasets cover weather, energy, KR, EWJ, and MDT with anomaly ratios between 5.81 % and 17.23 %.
Baselines comprise five state‑of‑the‑art time‑series foundation models (ChatTime, VisionTS, ROSE, Timer, MOIRAI) and four multimodal‑specific models (GPT4MTS, TATS, GPT4TS, TimeVLM) for forecasting, as well as nine single‑modal and multimodal models for anomaly detection.
Training uses Adam (lr = 5e‑4) for 20 epochs with early stopping patience of 5. No drop‑last strategy is applied; zero‑shot inference is performed for foundation models.
4.2 Forecasting Results
HORAI achieves the best performance in 15 out of 18 cases. Compared with the single‑modal baseline ROSE, HORAI reduces MSE by 29.6 %. Even against fully‑supervised multimodal models, HORAI outperforms GPT4MTS by 11.4 % and TimeVLM by 12.0 % in zero‑shot settings, demonstrating strong generalisation from multimodal pretraining.
4.3 Anomaly Detection Results
HORAI attains top AUC‑ROC, VUS‑ROC, and VUS‑PR scores in 13 out of 15 scenarios. Against the generic detector DADA, HORAI improves AUC‑ROC by 13.4 %, VUS‑ROC by 19.5 %, and VUS‑PR by 19.2 % under zero‑shot inference. Compared with the multimodal model GPT4TS, improvements are 12.2 %, 20.2 %, and 22.6 % respectively.
4.4 Ablation Studies
Removing image and text modalities degrades performance, confirming that both semantic and visual cues are beneficial.
Modality exchange (aligning high‑frequency series with text and low‑frequency with images) yields worse results, highlighting the importance of frequency‑aware alignment.
Replacing MoE‑FFN with a standard FFN reduces performance, showing that expert routing captures diverse patterns.
Removing frequency information from the router also harms results, indicating that frequency cues guide token‑to‑expert assignment.
4.5 Model Analysis
Sensitivity analysis varies the frequency threshold alpha and the number of experts K. The best forecasting performance occurs at alpha = 0.05, which cleanly separates low and mid‑high frequencies. Top‑2 or Top‑3 experts provide the optimal trade‑off between expressiveness and redundancy.
Modality and alignment ablations confirm that both text and image modalities contribute, but their impact differs across datasets. Frequency‑guided alignment and the original modality pairing are crucial for maintaining performance.
Encoder ablations replace the Qwen text encoder with smaller models and swap ViT for Swin Transformer. Larger text encoders (e.g., LLaMA) yield slight gains, while visual encoders show comparable results.
Conclusion
HORAI introduces a frequency‑enhanced multimodal pretraining paradigm and the MM‑TS dataset, demonstrating that integrating derived images and contextual news with time‑series data substantially improves zero‑shot forecasting and anomaly detection across diverse domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
