Paper Reading: CoRA – A Multimodal Covariate Adaptation Framework for Time‑Series Foundation Models
CoRA freezes pretrained time‑series foundation models, extracts multimodal covariate embeddings, evaluates their causal relevance with a trainable Granger‑Causal Embedding, and injects them via a zero‑initialized condition module, achieving up to 31.1% MSE reduction across single‑ and multi‑modal forecasting tasks.
Background
Time‑series forecasting is essential in domains such as weather, supply chain, and finance. Large‑scale pretrained time‑series foundation models (TSFMs) such as TimesFM, Chronos, and Sundial achieve strong zero‑shot generalisation, but most are pretrained on single‑variable series, limiting their ability to incorporate multimodal covariates (additional series, text, images) in real‑world scenarios.
Problem Definition
The paper identifies three challenges: (1) single‑variable pretraining prevents direct use of multivariate or multimodal covariates; (2) covariate‑target dependencies are often domain‑specific, non‑causal, and noisy, requiring a data‑driven quantification of causal contribution; (3) naïve adaptation (e.g., inserting covariate modules) disrupts the pretrained embedding space, causing catastrophic forgetting and unstable training.
Method: CoRA Framework
3.1 Freeze the Base Model as a Feature Extractor
Each modality (time‑series, text, image) is processed by a dedicated pretrained encoder (TSFM, LLM, ViT). Encoders remain frozen to retain learned knowledge. For time‑series covariates the embedding from the last timestep is used; for text and image covariates the timestep‑averaged embeddings are taken. The target variable is encoded by the TSFM backbone’s last‑timestep embedding.
3.2 Covariate Causal Evaluation: Granger‑Causal Embedding (GCE)
A trainable Granger‑Causal Embedding matrix W_GC aligns multimodal covariate embeddings into a unified latent space and quantifies their causal impact on the target based on Granger causality theory. The process consists of (1) aligning embeddings, and (2) concatenating all modality embeddings followed by a Softmax‑weighted aggregation.
3.3 Zero‑Initialisation Condition Injection
A lightweight MLP maps the causally weighted covariate embedding H to a scaling factor α, a shift factor β, and a bias γ. These parameters are injected into the TSFM’s prediction head, allowing covariate information to modulate the forecast. Both the MLP and the alignment parameters are zero‑initialised so that the adapted model starts from the exact pretrained state, preventing catastrophic forgetting.
Experiments
4.1 Datasets and Baselines
Single‑modal covariates: ETT (transformer temperature), Weather, ECL (electric load), EPF (electric price). Multi‑modal covariates: RT‑1 (robot images), Time‑MMD (text). Baselines include adaptation methods (AdaPTS, ChronosX, UniCA), deep forecasting models (TimeXer, iTransformer, PatchTST, N‑BEATSx) and TSFMs (Sundial, TimesFM, Chronos‑Bolt, FlowState).
4.2 Main Results
Single‑modal Covariate Forecasting
On long‑term datasets (ETTh1, ETTh2) CoRA reduces MSE by 31.1 % and MAE by 19.8 % relative to the strongest baseline TimeXer, and improves over UniCA by 18.7 %.
For short‑term EPF forecasting CoRA lowers MSE by 9.4 % relative to TimeXer and by 6.4 % relative to AdaPTS.
Multi‑modal Covariate Forecasting
With image covariates (RT‑1 subset) CoRA cuts MSE by 12.7 % and CRPS by 8.8 % versus the best supervised model.
For text covariates (Time‑MMD) CoRA reduces MSE by 3.0 % and CRPS by 3.7 % compared with UniCA.
Few‑Shot Forecasting (EPF)
When only 1 %–25 % of training samples are available, CoRA achieves 15 %–20 % lower MSE than TimeXer and 5 %–10 % lower MSE than ChronosX, demonstrating rapid adaptation under data scarcity.
Multivariate Forecasting
On multivariate datasets (ETT, Weather) CoRA lowers average MSE by 14.5 % and MAE by 12.2 % versus TimeXer, highlighting its advantage for joint target prediction.
Model Analysis
Generality
CoRA is compatible with various TSFMs (Sundial, TimesFM, Chronos‑Bolt, FlowState) and yields average MSE reductions ranging from 3.3 % to 14.2 %, confirming broad applicability.
Ablation Study
Removing covariates (w/o covariate) increases MSE by 6.5 % → covariates are essential.
Removing the adaptive layer‑norm injection (w/o adaLN) raises MSE by 12.9 % → the condition injection is effective.
Removing GCE (w/o selection) raises MSE by 8.3 % → causal weighting matters.
Removing zero‑initialisation (w/o zero‑init) raises MSE by 4.3 % → stable initialisation prevents forgetting.
Interpretability
The Pearson correlation between GCE scores and traditional Granger‑Geweke causality tests is 0.58, indicating that GCE reliably quantifies covariate causal contributions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
