Artificial Intelligence 12 min read

FinCast: A Foundation Model for Financial Time‑Series Forecasting

FinCast introduces a decoder‑only Transformer foundation model for financial time‑series forecasting that tackles non‑stationarity, multi‑domain diversity, and multi‑resolution challenges through input chunking with frequency embeddings, a sparse MoE decoder, and a PQ‑loss, achieving zero‑shot and supervised gains over state‑of‑the‑art baselines while running five times faster on consumer GPUs.

Bighead's Algorithm Notes

Oct 23, 2025

FinCast: A Foundation Model for Financial Time‑Series Forecasting

Background

Financial time‑series forecasting underpins economic stability, policy making, and sustainable investment, but it faces three major challenges: (1) non‑stationarity – data distributions shift over time due to structural changes, investor behavior, and policy interventions; (2) multi‑domain diversity – stocks, FX, futures, etc., exhibit distinct patterns; (3) multi‑resolution – second‑level high‑frequency data capture noise‑driven fluctuations, while week‑level data reflect macro trends, making a single‑resolution model hard to generalize.

Problem Definition

Existing financial forecasting models suffer from (1) poor generalization across distribution drifts, domains, and resolutions; (2) over‑fitting and heavy domain‑specific fine‑tuning; (3) lack of uncertainty modeling, as mean‑squared‑error loss drives predictions toward the mean.

Method

FinCast is the first foundation model designed specifically for financial time‑series prediction. It adopts a decoder‑only Transformer architecture and introduces three innovations:

Input chunking & frequency embedding

Decoder sparse Mixture‑of‑Experts (MoE) with RMSNorm and causal attention

Output block with point‑quantile (PQ) loss

3.1 Input chunking and frequency embedding

Raw series X is split into N non‑overlapping blocks of length P. Each block is instance‑normalized to remove scale bias, then passed through a residual MLP to obtain the input hidden state h_{input}. A discrete frequency index f is assigned to each block; a learnable embedding Embed_{freq}(f) encodes the time‑resolution (second‑level, day‑level, etc.), enhancing the model’s ability to capture patterns across granularities.

3.2 Decoder MoE

RMSNorm replaces LayerNorm, normalizing by the L_2 norm to improve stability for large‑scale pre‑training.

Causal self‑attention masks future tokens; the query vector Q is dynamically scaled by a learnable parameter α to adapt feature dimensions.

Sparse expert mixture routes each token to the top‑ k experts ( k=2) via a gating network; gating scores s_{i,n} are softmax‑ed and the top‑ k are kept, then expert outputs are weighted and summed.

3.3 Output block and PQ‑loss

The decoder output passes through a residual MLP to project onto the prediction horizon H, then inverse‑normalization restores the original scale.

The PQ‑loss jointly optimizes four terms: point loss, quantile loss (for uncertainty), trend‑consistency loss (first‑order difference alignment), and expert regularization (balance loss L_{balance} and router‑entropy loss L_{router‑z}) to avoid expert collapse.

3.4 Training and inference

Pre‑training data : 2.4 M financial series (≈200 B time points) covering crypto, FX, futures, stocks, with resolutions from seconds to months.

Training configuration : 1 B‑parameter decoder, 4‑expert MoE (top‑2 routing), context length 1024 (high‑freq) / 256 (low‑freq), global batch size 8192, AdamW (lr = 2e‑4, weight decay = 0.05), 147 152 optimization steps.

Inference : autoregressive block‑wise decoding supports arbitrary context length L and horizon H, running efficiently on an 8 GB consumer GPU.

Experiments

4.1 Datasets and benchmarks

Zero‑shot benchmark : 3 632 series (4.38 M points) spanning crypto, FX, stocks, futures from minute to week granularity, disjoint from pre‑training data.

Supervised benchmark : US_71 (71 high‑liquidity US stocks, 2016‑2023) and US_14L (14 large‑cap stocks, 2005‑2023), split 7:1:2 for train/validation/test.

4.2 Results

Zero‑shot performance : FinCast beats TimesFM, Chronos‑T5, TimesMOE on all horizons ( h=10,30,60), reducing average MSE by 20 % and MAE by 10 %; it ranks first in 23 of 36 sub‑datasets (MSE) and 25 of 36 (MAE).

Supervised performance : Both zero‑shot and fine‑tuned (adjusting output block and last 10 % MoE layers) surpass SOTA baselines PCIE, PatchTST, D‑Va; zero‑shot MSE ↓23 %, MAE ↓16 %; fine‑tuned MSE ↓26 %, MAE ↓19 %.

4.3 Ablation study

Removing sparse MoE raises MSE by 9.32 % (expert homogenization). Removing PQ‑loss raises MSE by 7.62 % (prediction collapse). Removing frequency embedding raises MSE by 4.38 % (loss of explicit resolution encoding).

4.4 Inference speed

On an NVIDIA RTX 4060 (8 GB VRAM), FinCast’s inference is five times faster than generic time‑series models while delivering higher accuracy, meeting real‑time requirements of high‑frequency trading.

4.5 Qualitative analysis

Zero‑shot predictions accurately capture minute‑level crypto volatility, daily stock trends, and weekly futures cycles, avoiding the flat predictions of baseline models.

Supervised predictions retain trend consistency under distribution shifts (e.g., sudden drop at the end of the input window), whereas baselines become overly conservative due to limited training distribution.